Retrievability is not discovery

RAG pipelines retrieve a narrow slice of documents that look closest in vector space to a query, but treating that narrowing as discovery ignores the fact that anything framed outside familiar patterns is never even surfaced for evaluation.¹ While recent work has demonstrated the quantitative limits of embedding-based retrieval at scale, the concern here is a qualitative bias in embedding retrieval, the tendency to privilege precedent and suppress novelty.

Discovery has always been constrained by filters, once link structure and keyword overlap, now vector proximity. That shift removes the web’s original filtering system, backlinks as proxy for trust, pagerank as a crude map of collective attention, and replaces it with a pipeline where retrieval is dominated by embedding similarity. That shift introduces what can be called qualitative bias in embedding retrieval, a conservative filter that privileges familiar patterns. The outcome is a filtering regime where retrievability is governed by how closely new content resembles known patterns, not by merit, usefulness, or originality.

This system treats retrieval as a nearest-neighbor problem in vector space. It does not parse reasoning, evaluate correctness or promote quality unless that quality is already legible in the form of semantic features. New projects, especially those with no existing footprint, are invisible unless they encode themselves near existing questions. Semantic clarity wins, not conceptual originality. Retrieval fails when novelty outruns precedent.

The ecosystem incentives are set. These dynamics echo the earlier web, where keyword matching and backlinks favored the familiar, but embeddings deepen the effect by encoding similarity at the semantic level. Embedding models are trained on known corpora. Retrieval systems rank documents based on vector similarity to past queries. LLMs predict next tokens from surrounding context, not from verification or evaluation. No actor in the system corrects for quality unless explicitly trained to do so. None of the defaults penalize noise if the noise looks familiar. That’s the contradiction: the tools reward semantic mimicry, not semantic value.

Publishing a high-quality project without describing it in relation to common problems, known formats, and shared vocabulary guarantees it remains undiscovered. Publishing a low-quality clone that mimics the structure of well-known projects can guarantee inclusion. Retrieval pipelines rarely see the difference because what matters now more is not how well something works but whether it embeds close enough to something already seen.

Creators working on new tools or products can no longer rely on classical discovery dynamics. There is no search index that rewards exploratory clicking. No backlink trail that elevates reputation through citation. Visibility now depends on placement inside the model’s vector neighborhoods, through documentation phrasing, metadata, content examples, and corpus presence. Publishing to GitHub is not enough unless the README uses vocabulary already tied to existing categories. Launching on a personal website is irrelevant unless it gets scraped and placed into a retrievable corpus. Syndication isn’t about reach, it’s about inclusion in the few data sources that actually inform retrieval: curated lists, documentation aggregators, package registries, and static content channels already ingested by pretraining pipelines.

This is not a neutral shift. The system does not reward technical work directly. It rewards work that is described in ways the system already understands. It forces conformity at the semantic level to achieve discoverability. Any claim that LLMs enable better discovery must account for the fact that retrieval is a conservative force: it pulls toward known vectors, not outliers. Tail projects remain tail projects unless they express themselves through the same form as head examples.

To fix this, retrievers would need to incorporate alternate signals, execution traces, usage telemetry, curated review weightings, temporal recency biasing, none of which are standard in deployed systems. Absent that, discovery remains a function of surface similarity. And the only reliable antidote is to encode new things in old shapes, until the system can represent quality as something other than vector proximity to precedent.

No change in modeling architecture fixes this: as long as retrieval depends on top-k embedding similarity, content outside the threshold is invisible. Leaf documents without neighboring vectors are never surfaced, and the system has no mechanism to notice what it missed. Visibility remains the precondition for reasoning, and embedding fit the gate, unless indexing itself expands representation, through summaries or alternate views, to give the outliers a chance to be seen.

The work ahead is deciding whether to repeat the old cycle, where novelty vanishes until it bends toward precedent, or to build retrieval systems that can surface the unfamiliar on purpose.

Weller, O., Boratko, M., Naim, I., & Lee, J. (2025). On the Theoretical Limitations of Embedding-Based Retrieval. arXiv:2508.21038. https://arxiv.org/abs/2508.21038 ↩︎