A Production Document-Intelligence Stack: The Nine Layers, End to End

Most "AI for documents" demos work because the demo set is curated. You point a model at a PDF, get a clean answer, ship a screen recording. The problem starts the day the system has to answer questions about documents it has never seen, on a deadline, with a human auditor watching.

This is a walkthrough of the production document-intelligence stack we build at Quantum Leap engagements. It maps to the same nine layers we use on every AI engagement. The point is not "which model is best" — that decision changes every six weeks. The point is the rest of the system. The model is one layer of nine, and the other eight are where the system silently breaks.

The shape of a real document-intelligence engagement

A representative engagement looks like this: a customer hands us a few thousand contracts (or invoices, or specs, or filings) and asks for a system that can answer structured questions about them — extract the parties, flag non-standard clauses, surface the renewal date, cross-reference against a policy. Sometimes it has to run as a chat interface for an analyst. Sometimes it has to feed a downstream workflow that affects money.

Whichever shape it takes, the system has to do four things well: ingest the documents losslessly, index them so retrieval is fast and exact, answer with grounded citations, and refuse when it doesn't know. Each of those four things leans on a different subset of the nine layers.

Layer 01 — Infrastructure

Default: Bedrock on a VPC endpoint, in the customer account, in a region they have already approved. Embedding and inference traffic never crosses the public internet. OpenSearch Serverless in the same account, in the same region. KMS keys owned by the customer.

This is the layer that determines whether your legal and compliance teams ever sign off, so we settle it on the first call. We will not build on a stack that sends document text to a vendor-hosted endpoint by default. The marginal latency saving is not worth the conversation with the customer's CISO.

Layer 02 — Data

The least glamorous layer and the one most demos cheat on. Three questions decide whether the rest of the stack stands up:

Where is the source of truth? SharePoint, S3, a vendor PDF API, a scanned-paper archive nobody owns? We need write-once URIs we can re-fetch.
How are permissions modelled? If two analysts can ask the same question and get different answers because of row-level access, the retrieval layer needs to filter on identity. We design that on day one, not after launch.
What does "fresh" mean? Daily ingest? Streaming? Once-and-done? The right answer drives whether we build a real pipeline (Airflow, Step Functions) or a nightly batch job in Lambda.

We parse PDFs with Textract for native scans, with a layout-aware parser (Unstructured, LlamaParse) for born-digital documents that have real structure. We keep the raw bytes and the parsed form, addressable by the same content hash. When the parser changes in 18 months you will be grateful.

Layer 03 — Models

Claude on Bedrock for reasoning and answer synthesis. Bedrock Titan or Cohere embeddings for retrieval. Per-task selection, not per-vendor loyalty. The model is the layer that will get replaced most often, so we wrap calls in a thin abstraction that takes a model_id and per-model parameters. No "best model" hard-coded anywhere.

One opinion we hold across engagements: do not fine-tune until you have evaluated the retrieval layer. Most reports of "the model is dumb on our domain" are actually retrieval failures dressed up as model failures. Fine-tuning a model that is being fed the wrong context just memorises the wrong answer.

Layer 04 — Retrieval

This is where document-intelligence engagements live or die. Three pieces:

Chunking. Layout-aware, not naive token-windowed. For contracts we chunk by clause, with the section heading carried as metadata. For invoices, by line item plus header context. For specifications, by section with the section number prepended. Token counts are a constraint, not a strategy.

Embedding. One model, one dimensionality, written once across the corpus. Re-embedding is expensive and changes the distance topology, so we version the embedding model alongside the index. When you upgrade the model, you upgrade the whole index — never half.

Index. Hybrid retrieval: dense vectors for semantic similarity, BM25 for exact-term match. Documents are full of names, numbers, and case IDs that have no semantic neighbours — pure-vector retrieval misses them. The fusion step (RRF or weighted) is where most quality wins come from, and it costs almost nothing to add.

Filter on permissions at the index, not after retrieval. If you filter after, you have already shown the embedding endpoint the documents the user is not allowed to see — that is a compliance leak even if the answer is correct.

Layer 05 — Orchestration

The default shape is not "an agent." It is a single-turn pipeline: retrieve, ground, answer, cite. Most document-intelligence questions do not need multi-step reasoning. They need the retrieval layer to surface the right paragraphs and the model to summarise honestly.

Agents enter the picture when the question requires sequencing — "find all contracts with the same counterparty, then check whether any have a non-standard indemnity clause." That is two retrievals and a comparison, which a one-shot pipeline cannot do. We use Bedrock Agents or Claude with MCP servers for this, with explicit tool boundaries and audit logs on every call.

The cost of an agent is honesty about failure: a single-turn pipeline fails visibly, an agent fails by silently going off on a tangent. We only ship agents when the workflow earns the additional surface area and we have an evaluation harness that catches the off-tangent case.

Layer 06 — Tools

Tools are how the system reaches outside the document corpus — fetching a customer record, checking a policy database, writing something to a system of record. Three rules:

Every tool has a named owner. When the API changes, we know who to call.
Side effects are reversible or signed-off. A tool that mutates production records goes behind a human-in-the-loop boundary by default.
Every tool call is audit-logged with the prompt that triggered it. When something looks wrong six months later, we want the receipts.

Layer 07 — Evaluation

The contract for the engagement. Before the build starts, we co-author a test set with the customer's domain experts: real questions, real expected answers, real edge cases. Three score categories on every question:

Faithfulness. Does the answer follow from the retrieved context, or did the model fill in plausible-sounding text from training?
Coverage. Did the retrieval layer surface the chunks that contain the answer?
Refusal correctness. When the answer is not in the corpus, does the system say so — instead of inventing one?

The harness reruns on every model bump, every parser change, every chunking-strategy change. It is part of the deliverable. The customer owns it. When we hand off the engagement, the harness is the customer's lever for catching regressions without us.

Layer 08 — Observability

Token spend per request, p50/p95/p99 latency, retrieval hit rate, refusal rate, tool-call success rate, downstream error rate. Same dashboard for finance and engineering. A document-intelligence system that answers correctly but triples your monthly Bedrock bill is not a system that works.

We log the retrieved chunks alongside every answer so debugging is possible after the fact. A bad answer is not a model problem until you have ruled out a retrieval problem, and you cannot rule out a retrieval problem if you did not log what was retrieved.

Layer 09 — Governance

Prompt-injection defense at the boundary. No tool execution from text extracted from a document — the document is data, never an instruction source. PII detection on the way in, masking on the way out for unauthorized users. Versioned prompts, versioned model IDs, versioned chunking strategies — so every answer is reproducible from the audit log.

Governance is not a layer you add at the end. It shapes the infrastructure layer (where can the data live), the data layer (how permissions are modelled), the retrieval layer (where filtering happens), the tool layer (what side effects are permitted). The document-intelligence systems we ship pass a written audit because the audit was the spec.

What gets handed back

At handoff, the customer's team owns: the IaC for the whole stack, the parsers and chunking strategy, the embedding pipeline, the retrieval API, the evaluation harness, the observability dashboards, a runbook for "what to do when answers start looking off," and a one-page kill-switch procedure for shutting the system down fast if something is wrong.

What they do not own from us: model weights (those are vendor-side), proprietary glue code (there is none — Apache 2.0 across the stack), or our continued involvement. The healthiest document-intelligence engagements we run end with the customer operating the system without us calling in.

If you are scoping a document-intelligence build and want to talk through the layer-by-layer choices for your domain — what to build, what to buy, what to defer — that is exactly the conversation a Quantum Labs spike starts. Send a paragraph.