Retrieval is the layer where most document-intelligence engagements live or die. Almost every "the model is dumb on our domain" report we are called in to fix is a retrieval failure dressed up as a model failure. A model that is fed the wrong context is going to summarise the wrong context honestly. The fix is not to fine-tune. The fix is to make the retrieval layer correct.
What this layer covers
- Chunking the corpus into retrievable units
- Generating embeddings for the units
- Storing them in an index that supports dense + lexical retrieval
- At query time: fetching the right units, in the right order, with the right permissions applied
- Versioning all of the above so the system is reproducible after a model upgrade
The four pieces
1. Chunking
Naive token-windowed chunking — "every 500 tokens, with 50 token overlap" — is the default in most tutorials. It is wrong for almost every production document corpus. It splits clauses, separates headings from their content, and treats a contract page like a poem.
Layout-aware chunking, structured per corpus:
- Contracts: chunk by clause. Carry the section heading and clause number as metadata. Each chunk knows what part of which document it belongs to.
- Invoices: chunk by line item, plus a header context (vendor, date, currency) attached to every line.
- Specifications: chunk by section, with the section number prepended into the chunk text.
- Research papers: chunk by paragraph within section. Section title attached.
Token counts are a constraint, not a strategy. If a clause is 1200 tokens, the chunk is 1200 tokens. If a line item is 30 tokens, the chunk is 30 tokens with the header prepended. The point of retrieval is to return semantically meaningful units, not to return them at uniform length.
"We use the default chunker from the framework." If you cannot describe how a single chunk maps back to a position in the source document, you have made retrieval debugging impossible. Every chunk should carry document ID, section/clause anchor, byte range, and corpus version as metadata.
2. Embedding
One model, one dimensionality, written once across the corpus. The embedding model is a structural commitment: changing it requires re-embedding everything, and the new distance topology will not match the old one. So we treat the embedding model with the same discipline as a schema.
Default choice: Amazon Titan Embed Text V2 (Bedrock) or Cohere Embed v3 (Bedrock), depending on which scores better on the customer's domain eval. We pick by running both against a representative test set, not by reading the model card. Sometimes Voyage AI's models win on a particular domain — happy to use them when the eval says so.
Versioning is non-negotiable. Every chunk in the index carries the embedding model ID and version that produced it. When we upgrade the model, we upgrade the whole index — never half — and the evaluation harness reruns before the new index becomes the default. A retrieval index in a state where half the chunks are V1 and half are V2 is broken; the distance metric is no longer meaningful.
3. Index — hybrid retrieval
Pure-vector retrieval is wrong for production document corpora. Documents are full of names, numbers, case IDs, SKUs, and proper nouns that have no semantic neighbours — pure-vector retrieval misses them in favour of "near" but-irrelevant chunks. Pure-lexical (BM25) retrieval is wrong for the opposite reason — it misses paraphrases.
The default architecture is hybrid: dense vectors + BM25, fused at query time with Reciprocal Rank Fusion (RRF) or a weighted combination. OpenSearch supports both natively in a single index; pgvector + PostgreSQL full-text is fine when the customer already has Postgres operationalised and wants one fewer service.
RRF works like this: each retrieval method (dense and lexical)
returns the top K chunks ranked. Each chunk gets a score of
1 / (k + rank) from each method, summed across methods.
Top-N by combined score wins. k is a constant — 60 is
the canonical default. The whole thing is ~10 lines of code on top
of two normal retrieval calls. It is the single highest-leverage
quality improvement available to the retrieval layer, and it costs
almost nothing to add.
4. Permission filtering — at the index, not after
If the customer's data has row-level permissions — and it almost always does — the retrieval layer has to respect them. The mistake is to retrieve first, then filter the results by permission. That pattern has a quiet compliance leak: you have already shown the embedding endpoint the documents the user is not allowed to see. Even if the answer is correct, the auditor sees the embedding API call in the trace and the engagement is a much harder conversation.
The correct pattern is to filter at the index, using a metadata
pre-filter on the user's identity tags. OpenSearch supports this
with a filter clause on the query; pgvector supports
it with a normal SQL WHERE. The retrieval call returns
only the chunks the user is allowed to see — the model never sees
the rest. The auditor sees a clean trace.
Filter on permissions at the index, not after retrieval. Otherwise you have leaked the document existence to the embedding endpoint, even if the user never saw the content.
Default reference architecture
Ingest pipeline
- Document lands in S3 under the data account, KMS-encrypted, with content-hash and version metadata.
- An S3 event triggers an SQS message to the parser queue.
- Parser Lambda (or ECS task for large documents) pulls the document, layout-parses it, and writes the parsed form back to S3 alongside the raw bytes.
- Chunker Lambda reads the parsed form, applies the corpus-specific chunking strategy, and writes chunks to a chunks table (DynamoDB or S3, depending on size).
- Embedder Lambda batches chunks, calls Bedrock embeddings, writes vectors back into the chunks store, and indexes into OpenSearch with permission tags attached.
- Every step logs the corpus version it was operating on, so a re-ingest can be triggered from a clean state if a chunker or embedder upgrade lands.
Query pipeline
- User query arrives at the orchestration layer with the user's identity tags resolved.
- Orchestrator builds two queries: a dense query (embedding of the user query) and a lexical query (the user query as text), both with the user's permission tags as a pre-filter.
- Both queries hit OpenSearch in parallel.
- RRF fusion combines the rankings. Top N chunks returned, with their metadata (document ID, section anchor, content hash).
- Orchestrator passes the chunks plus metadata to the model layer for grounded answer synthesis.
- The answer comes back, the orchestrator logs the chunks used alongside the answer for auditability.
Build vs. buy at this layer
Default: build. Buy the index substrate (OpenSearch Serverless, pgvector, Pinecone, Weaviate). Build everything else: chunking, embedding-pipeline, fusion, filtering, re-ranking.
The temptation: "end-to-end RAG-as-a-service" vendors that promise to handle ingestion, chunking, and retrieval generically. They handle it generically. Generic ingestion is fine for the demo, fails on the cases that matter to the customer's business — long-tail proper nouns, specific document layouts, multi-tenant permissions — and you cannot debug it because the chunks and embeddings are not yours. Almost every retrieval-quality fire we are brought in to fight started with a bought RAG layer.
Re-ranking — when, and when not
Re-ranking is the step after hybrid retrieval: take the top 20-50 chunks, score them with a cross-encoder model (Cohere Rerank, Voyage Rerank, or a hosted Bedrock equivalent), keep the top 5-10. Cross-encoders score much more accurately than the dense + lexical fusion did, because they actually consider the query and the chunk together.
Use re-ranking when:
- The corpus is large (millions of chunks) and the first-stage retrieval is noisy.
- The customer's queries are highly variable and benefit from the cross-encoder's pairwise judgement.
- The latency budget tolerates an additional ~100-300ms.
Skip re-ranking when:
- The corpus is small (under ~50K chunks) and hybrid retrieval already returns clean results.
- The latency budget is tight and the eval shows minimal gain.
- The re-ranker's hosted cost would exceed the marginal quality improvement at the customer's traffic level.
Re-ranking is the most often-recommended-without-checking pattern in current AI engineering writing. Always check on the eval first. If hybrid retrieval + RRF is already returning the right chunks, re-ranking adds cost and latency for no gain.
What we log on every retrieval
The retrieval call log is what makes debugging possible. Every retrieval emits:
- The query (text + dense vector if practical).
- The user identity tags applied as the pre-filter.
- The dense top-K chunk IDs with scores.
- The lexical top-K chunk IDs with scores.
- The fused top-N chunk IDs after RRF.
- Latency at each step.
- The embedding model ID + version, the index corpus version.
The same trace lets us answer "why did the model return that?", "did the user actually have permission to see this chunk?", and "did the embedding model upgrade silently change the distance topology?" — all of which are routine investigation work in the operate phase.
The five mistakes we see
1. Naive chunker
Token-windowed chunking on a structured corpus. Look at any failing retrieval; you will find chunks that contain half a clause and the first sentence of the next.
2. Pure-vector retrieval on a named-entity-heavy corpus
Contract IDs, customer names, SKUs, case numbers. None of them have meaningful semantic neighbours. Pure-vector retrieval misses them, and the model gets context that almost answers the user's question.
3. Post-retrieval permission filtering
The compliance leak above. Always filter at the index.
4. Embedding model upgrade without re-embedding the corpus
Half the chunks at V1, half at V2. Cosine similarity is no longer measuring what you think it's measuring. The system silently degrades, sometimes by a lot, and nothing in observability tells you why.
5. Re-ranker on by default
Adding latency and cost without checking that the underlying retrieval actually needed it. Always run a re-ranker on/off A/B on the eval before paying for it in production.
How this connects to the other layers
Retrieval depends on the infrastructure layer putting OpenSearch in a private VPC with the right KMS policy. It feeds the orchestration layer the grounded context that lets the model answer faithfully. It is graded by the evaluation layer's coverage and faithfulness scores. It is logged through the observability layer for debugging and audit. Without all four of those other layers being honest, retrieval is just a pile of floating-point math.
Related: the infrastructure layer reference architecture, the tooling catalog, the concepts & standards glossary, and the document-intelligence end-to-end walkthrough (the long-form essay where retrieval is one of nine pieces).