The models layer is the one customers want to spend the most time discussing and where the decision has the lowest payoff per hour. Foundation models change every six weeks. The right answer in January is the wrong answer in March. Treat the choice as structurally cheap to revisit, and spend the engineering hours on the layers — retrieval, evaluation, governance — that actually determine whether the system works.
What this layer covers
- Picking foundation models for the engagement's tasks
- Per-task model selection (reasoning vs. classification vs. embedding)
- Where the model runs (Bedrock vs direct API vs self-hosted)
- Versioning the model ID + parameters in every call
- The thin abstraction that makes swapping cheap
- When fine-tuning is actually warranted
The shape we end up at
Claude (via Bedrock) for reasoning and answer synthesis. Bedrock Titan Embed Text V2 or Cohere Embed v3 for retrieval (chosen by eval, not by marketing). Claude Haiku for the smaller classification and routing tasks where Sonnet is overkill. The whole set wrapped behind a thin abstraction so any one of them can be replaced in a config change.
Per-task selection is the discipline. The mistake is to standardise on one model everywhere — usually the most capable, usually the most expensive — and pay 4× for tasks a smaller model handles deterministically. The eval harness scores each task with the model that was actually used; that lets us compare alternatives at the task boundary instead of pretending one model fits all.
Default reference architecture
Where the model runs
- Bedrock — the default. VPC endpoint, customer-managed KMS key on the inference log group, IAM-scoped to the workload account's service roles. The model never leaves the AWS boundary; the inference traffic never crosses the public internet. This is the posture that gets signed off on by customer security teams without a six-week conversation.
- Anthropic API directly — when the customer is comfortable with the egress (rare for enterprise, common for fast-moving teams). Latency is sometimes lower than Bedrock; access to the very newest model versions is sometimes earlier.
- Self-hosted (EC2 GPU / SageMaker) — almost never our default. The three real exceptions: regulatory residency that no vendor meets, a fine-tune that genuinely beats hosted models on the customer's eval, or workloads at a scale where the GPU TCO actually wins. We have been asked about this often. The answer is almost always "no, use Bedrock and put the engineering hours into retrieval."
The model abstraction
Every model call goes through a single function that takes a
task_id, resolves it to a model_id +
parameters via config, makes the call, and logs the
(task_id, model_id, model_version, prompt_hash, latency,
tokens, cost) tuple. The config lives in source control and
can be flipped per environment. Swapping Claude Sonnet for Haiku
on a routing task is one PR.
Critically, the abstraction is thin — it does not try to be model-agnostic in the LangChain sense, where one interface pretends to abstract over wildly different model APIs. It is just a per-task router with logging and config. Each model still gets called with the parameters it actually wants. Generic wrappers silently strip features (cache control, system prompt structure, structured-output mode) that matter in production.
Versioning
Every deliverable identifies, in writing, which model is in use for each task, at what version, with what parameters, and how those choices were made. When the model bumps, the eval harness reruns and the report goes to the customer before any production deploy. See Principle 05.
"We use the best AI model" is not an answer we are willing to give.
Fine-tuning — when, and almost-never-otherwise
The single most common cause of wasted engineering hours in customer engagements is fine-tuning before the retrieval layer is correct. The flow goes: the model returns wrong answers → the team concludes the model does not know the domain → they spend six weeks fine-tuning → the answers are still wrong, because the model was being fed the wrong context all along.
Do not fine-tune until you have evaluated the retrieval layer. Most "the model is dumb on our domain" reports are retrieval failures dressed up as model failures.
The real cases for fine-tuning, in order of how often they're the right call:
- Format / style constraints — the customer needs a very specific output structure that prompting alone is not nailing. Small adapter fine-tune on a few hundred examples. Cheap, fast.
- Domain language — the corpus has heavy domain-specific terminology the base model handles awkwardly. Fine-tune on a representative span of the corpus. Verify with a real eval, not a smile test.
- Cost-driven distillation — at scale, distill a frontier model's behaviour into a smaller model that can serve cheaper. Worth the work when the per-token bill is the dominant cost line.
Almost never the right call:
- "To make the model 'know' our facts" — RAG handles this with no training and stays current.
- "Because frontier models hallucinate" — they will hallucinate after fine-tuning too. Constrain with retrieval, refusal, and evaluation. See the concepts glossary.
Build vs. buy at this layer
Default: buy. Frontier model training is a capital-intensive R&D enterprise. The customer is not in that business; we are not in that business. Buy Claude, GPT, Gemini, Titan, Cohere. Wrap them behind the thin abstraction. Swap as the eval indicates.
Things to build at this layer:
- The model abstraction. ~50 lines of code. Names the per-task model choices; logs every call; flips with config.
- The model-version tracking. Every inference trace records the model ID + version. Required for reproducibility and audit.
- The cost telemetry. Token counts and per-task costs tagged by engagement and workload. See Layer 08.
Things to buy at this layer:
- Foundation models themselves. Always.
- Re-rankers — Cohere Rerank, Voyage. Don't build a cross-encoder from scratch unless the eval actively requires a custom one.
- Embedding models — Bedrock-hosted by default. Picked by eval.
The five mistakes we see
1. Standardising on one model everywhere
Usually Claude Sonnet, GPT-4-class. Pays 4× for tasks a smaller model handles deterministically. Per-task selection takes an afternoon to set up and pays back forever.
2. Fine-tuning before fixing retrieval
Discussed above. The single most expensive mistake at this layer.
3. Generic model wrapper that strips features
Using a wrapper that promises portability across providers, then losing prompt caching, structured-output mode, system-prompt structure, or the right tokenizer for cost estimation. The thin abstraction wins; the all-providers-equal wrapper does not.
4. No model version in the inference log
The model bumped silently — Bedrock published a new minor version, the alias caught up — and behaviour changed. Without the version in the trace, the team is debugging blind. Pin the model version in the model ID; log the resolved version in every call.
5. Building a self-hosted GPU farm "to save money"
The math almost never works at engagement scale. We have audited deployments where the GPU TCO came out at 3-5× the Bedrock spend once amortisation, scaling overhead, and operational time were counted. The exceptions are real but rare; do the spreadsheet honestly before committing.
How it connects to the other layers
The models layer sits on top of Layer 01 (the Bedrock VPC endpoint, the IAM scoping). It consumes context from Layer 04 (without good retrieval, no model is good enough). It is graded by Layer 07 — the harness scores each task with the model that actually produced the answer, which is what makes per-task selection an evidence-based decision rather than a vibes one. Its costs surface in Layer 08 as the largest line item in most engagements.
Related: the retrieval layer reference architecture, the evaluation layer reference architecture, the tooling catalog, and the concepts & standards glossary.