The build-vs-buy question is the most expensive AI decision most teams make, and almost nobody makes it on purpose. It usually gets made one tool at a time, in the order that a vendor happens to demo, by whoever was free that week. Eighteen months later the bill arrives in three forms: a SaaS line item that grew, an integration that nobody owns, and a piece of internal infrastructure that duplicates something the vendor already does well.
This is the framework we use at Quantum Leap engagements to settle it deliberately. It runs through the same nine layers we use to think about any AI system, and applies one consistent test at each layer. The test is not "is the vendor good?" — they usually are. It is "would building this earn us anything we care about?"
The one test
At every layer, the decision comes down to a single question:
Is the work at this layer a differentiator for our business, or is it a commodity we just need to function?
Build differentiators. Buy commodities. Refuse to do either backward. Most of the over-engineered AI platforms we audit got that wrong at one specific layer — they built the commodity and bought the differentiator — and have been paying for it ever since.
Three secondary tests trim the edges:
- Speed of change. If the layer changes faster than your team can absorb (foundation models, every six weeks), buy it and stay current. If it changes slower than your team can absorb (your domain ontology), build it and own it.
- Switching cost. If switching vendors at this layer would require touching every other layer, build the seam yourself. If switching is a config change, the vendor's lock-in is yours to walk away from.
- Compliance surface. If a vendor cannot meet the compliance bar your business already meets, you do not have a build-vs-buy decision at this layer. You have a build-or-don't-ship decision.
Layer 01 — Infrastructure
Default: buy. AWS, Azure, GCP. Nobody is going to differentiate on rack management. The interesting question at this layer is which cloud and which region — not whether to operate one yourself. The exception is regulated workloads where on-prem is the only legal option; even then, "buy" usually means Kubernetes on a vendor distribution rather than ground-up.
The mistake we see: teams build their own Kubernetes platform because they want "portability." Portability is a real concern; it just rarely justifies a year of platform engineering when a managed K8s distribution gives you 80% of it.
Layer 02 — Data
Default: build the pipeline, buy the substrate. The storage layer (S3, Postgres, OpenSearch) is bought. The ingestion, parsing, lineage, and access-control logic on top is built — because it is exactly the part that encodes how your business thinks about its data.
This is the layer with the most premature buying. Vendor "data platforms for AI" promise to handle ingestion and chunking generically. They handle it generically. Generic ingestion is a commodity; the way it is wired into your permissions, lineage, and audit logs is the differentiator. Buying the wrapper means rewriting the integration every time the vendor changes.
Layer 03 — Models
Default: buy. Frontier models change every six weeks. The team that fine-tunes its own foundation model from scratch is, in 99% of business contexts, lighting capital on fire to end up 12 months behind the vendor curve. Buy Claude, GPT, Gemini, Llama-via-hosting. Wrap them behind a thin abstraction and swap.
Three real exceptions: (1) a domain so narrow that a small fine-tune on top of an open base model genuinely beats general-purpose for your workload, and you have the eval harness to prove it; (2) a workload so cost-sensitive that owning a quantized model on your own GPU is cheaper than per-token billing at scale; (3) a regulatory requirement that the model itself stay inside a boundary no vendor can offer.
Layer 04 — Retrieval
Default: build. Buy the vector store (OpenSearch, pgvector, a dedicated vector DB). Build the chunking, the hybrid retrieval, the metadata-filter logic, the re-ranking. This is where domain-specific knowledge lives.
"End-to-end RAG-as-a-service" vendors will sell you a black box that works for the generic case. The black box fails on the cases that matter to your business — long-tail names, your specific document layout, your permission rules — and you cannot debug it because the chunks and embeddings are not yours. Almost every retrieval-quality fire we are called in to put out started with a bought RAG layer.
Layer 05 — Orchestration
Default: buy the runtime, build the policy. Use Bedrock Agents, LangGraph, Temporal, or whatever fits the shape — the runtime that holds an agent's state and routes tool calls is a commodity. The agent's policy — which tools it can call, in what order, with what guardrails, with what escalation path — is yours.
The mistake: building a homegrown orchestration runtime because the bought one was "too opinionated." Eighteen months later the homegrown runtime is the most fragile thing in production and the team is quietly rebuilding what the vendor already had.
Layer 06 — Tools
Default: build. Tools are how the AI system touches your business — your APIs, your records, your side effects. There is no vendor for "the integration to your internal CRM with your specific permission boundary." Build them as ordinary backend code, with named owners, audit logs, and tests. The agent layer just discovers them and calls them.
The temptation is to use a "tool marketplace" — pre-built integrations for every SaaS. They are fine for personal-productivity agents. They are the wrong shape for production, because they bypass the permission model your business already enforces in every other code path.
Layer 07 — Evaluation
Default: build. This is the most important layer to own outright. The test set is the contract. The rubric encodes what "the system is working" means in your domain. The harness re-runs on every model bump and chunking change. If you do not own those, you cannot defend the system to your own auditors.
Buy the substrate: a test runner, a scoring framework, an LLM-as-judge primitive if you need one. The test set itself, the rubrics, the domain ground-truth — those are written by your subject-matter experts, with you. We deliver the harness as code your team owns and can re-run without us in the room.
Layer 08 — Observability
Default: buy. Datadog, CloudWatch, Honeycomb, LangSmith for prompt-level traces. The differentiator at this layer is not the storage of metrics; it is what you choose to instrument. Instrument token spend, retrieval hit rate, refusal rate, tool-call success, downstream error correlation — whichever vendor you buy.
Building your own observability stack for AI specifically is a sinkhole. The vendors are better at it than your team will be in year one. Spend the saved time building the dashboards that actually reflect how your finance and engineering teams reason about cost.
Layer 09 — Governance
Default: build, with bought primitives. Prompt-injection filters, PII detectors, content classifiers — buy them as primitives. The policy on top — what gets blocked, what gets logged, what gets escalated to a human, who is allowed to override — is yours. No vendor can answer "is this allowed at your company" for you.
Two anti-patterns: (1) building your own PII detector from scratch when AWS Comprehend and several open-source models exist; (2) trusting a vendor's default policy as your policy because checking it would have taken an afternoon and nobody had the afternoon.
Putting it together: a one-page decision sheet
For the engagements we run, the answer-by-layer usually shakes out like this:
- Infrastructure — buy
- Data — build pipeline, buy substrate
- Models — buy
- Retrieval — build
- Orchestration — buy runtime, build policy
- Tools — build
- Evaluation — build (with bought substrate)
- Observability — buy
- Governance — build, with bought primitives
That distribution is not universal — there are domains where Models is a build, and there are early-stage teams where Retrieval is fine to buy temporarily. But it is the shape we keep landing on, and it is a useful starting position to argue against, layer by layer, with your own constraints in mind.
The two failure modes
The two failure modes we see most often:
"Build everything" — typically a strong engineering team new to AI. They build their own model serving, their own vector store, their own observability. They are now 18 months behind the vendor curve at every layer and the platform is the only thing the team has time to maintain. They ship a small AI feature a year after a competitor shipped four.
"Buy everything" — typically a leadership team under deadline pressure. They wire together six SaaS layers, none of which know about the other five, none of which respect their permission model. The result works on the demo set, fails in production, and there is no layer they own well enough to debug. By the time the first incident hits, three of the six vendors have pivoted.
The good shape is "buy the commodities, build the differentiators, own the seams that connect them." Most of the work in a Quantum Labs engagement is exactly the seams.
If you are about to make a stack-wide build-vs-buy call and want a second opinion before the vendor contracts get signed — that is exactly the conversation a two-week Quantum Labs spike is built for. Send us the picture.