This is the working catalog of tools we currently pick at each
layer of the AI stack on customer engagements. It is opinionated
on purpose — neutrality is not useful when a customer needs to make
a decision in the next two weeks. We update it as the field shifts
and as new tools earn their place against the existing defaults.
The shape: for each layer, the default we reach for
when starting clean, the alternatives we will use
when the default does not fit, and the notes that
explain why. None of this is universal — engagements have
constraints (existing investments, regulatory posture, team skills)
that legitimately change the answer. The defaults are the starting
position.
The full layer reference architectures live in the per-layer
Research pages — start with
Infrastructure (Layer 01)
and
Retrieval (Layer 04),
which are live today; the rest are being written.
01 — Infrastructure
| Slot | Default | Alternatives | Notes |
| Cloud | AWS | Azure, GCP (customer-driven) | WOSB + AWS Select Partner posture. Deep CDK + Sprintsail bench. |
| IaC | CDK (TypeScript) + Sprintsail primitives | Terraform, Pulumi | Sprintsail for the application surface (~80%), raw CDK for the AI-specific resources (Bedrock, OpenSearch, KMS). |
| Compute (orchestration) | Lambda (Node.js / Python) | ECS Fargate, EKS | Lambda for the API + orchestration path. ECS when long-running parsers or per-request state push past Lambda limits. |
| Secrets | AWS Secrets Manager | Parameter Store (SecureString) | Customer-managed KMS key, automatic rotation where the secret supports it. |
| Identity | Cognito (customer-facing) + IAM roles (service-to-service) | SSO / OIDC integration to existing IdP | Cross-account roles, never shared SDK credentials. |
02 — Data
| Slot | Default | Alternatives | Notes |
| Object storage | S3 (KMS-encrypted, customer-managed key) | Azure Blob, GCS | Versioned bucket. Object Lock for the raw corpus on regulated workloads. |
| Relational | RDS Postgres (Aurora Serverless v2 when bursty) | RDS MySQL, DynamoDB (for non-relational) | Postgres carries pgvector well if the customer wants a single DB story. |
| Document parsing | Textract (scanned), Unstructured / LlamaParse (born-digital) | Mistral OCR, Azure Document Intelligence | Keep raw bytes + parsed form side-by-side, content-hash addressable. |
| Pipeline | S3 events + SQS + Lambda | Step Functions, Airflow (MWAA), Temporal | Step Functions when the workflow has explicit state machines worth modelling. |
| Lineage / catalog | Custom (small) on DynamoDB | OpenLineage + Marquez (when at scale) | For engagement-size corpora, a custom lineage table beats running a full catalog service. |
03 — Models
| Slot | Default | Alternatives | Notes |
| Reasoning / synthesis | Claude (via Bedrock) | GPT-4-class (Anthropic API, Azure OpenAI), Llama 3.3 (Bedrock) | Bedrock-hosted by default for VPC posture. Anthropic API directly when the customer is comfortable with the egress. |
| Embeddings | Titan Embed Text V2 (Bedrock) | Cohere Embed v3, Voyage AI | Pick by running both on the customer's domain eval. Versioned in the index. |
| Re-ranker | Cohere Rerank (when needed) | Voyage Rerank, hosted Bedrock equivalents | Off by default; on when the eval shows it earns its keep. |
| Small / classification | Claude Haiku (fast, cheap) | Cohere Command R, Llama-class small | For deterministic classification tasks where a frontier model is overkill. |
04 — Retrieval
| Slot | Default | Alternatives | Notes |
| Index | OpenSearch Serverless (vector + BM25 in one index) | pgvector + PG FTS, Pinecone, Weaviate | OpenSearch when starting clean. pgvector when the customer already runs Postgres and wants one fewer service. |
| Retrieval pattern | Hybrid dense + BM25, fused with RRF (k=60) | Dense-only (small corpora), lexical-only (rare) | RRF is ~10 lines of code and is the single highest-leverage quality move. |
| Chunking | Custom, layout-aware per corpus | Framework defaults (last resort only) | Carry document ID + section anchor + byte range as metadata on every chunk. |
| Permission filter | Pre-filter on identity tags at the index | (none — post-retrieval filtering is a compliance leak) | See Layer 04 for the full reasoning. |
05 — Orchestration
| Slot | Default | Alternatives | Notes |
| Pattern | Single-turn pipeline (retrieve → ground → answer) | Agent (Bedrock Agents, LangGraph, MCP) | Agents only when the workflow actually requires multi-step reasoning with tool use. |
| Runtime | Lambda (single-turn) / Step Functions (multi-step) | Temporal, LangGraph runtime | Temporal when the workflow has long-running human-in-the-loop branches. |
| Tool protocol | Claude tool-use (Bedrock or direct) + MCP for portable servers | OpenAI function calling, custom | MCP for tools we want to share across customers; native tool-use for engagement-specific. |
| Caching | Prompt caching (Anthropic) + ElastiCache for retrieval results | — | Prompt caching is the single biggest cost lever once a system is in steady state. |
06 — Tools
| Slot | Default | Alternatives | Notes |
| Internal tools | Custom Lambdas / Fargate tasks behind named IAM roles | — | No "tool marketplace" pre-built integrations. Each tool has a named owner and an audit log. |
| Side-effect boundary | Human-in-the-loop confirmation for any production-mutating call | Allow-list of safe reversible calls | Reversibility is the deciding factor. Mutating prod records always needs sign-off. |
| Audit log | Every tool call → DynamoDB + S3 (long-term) | CloudTrail (for AWS-API tools) | Log the prompt + tool call + result, not just the call. |
07 — Evaluation
| Slot | Default | Alternatives | Notes |
| Harness | Custom on top of pytest + a small scoring library | Promptfoo, Ragas, OpenAI Evals | The harness is the contract; the customer owns it. We use whatever framework lets them re-run it without us. |
| Test set authoring | Co-authored with the customer's domain experts | — | Synthetic test sets miss the cases that actually matter. Real questions from real users. |
| Scoring | Rubric-based, LLM-as-judge with human spot-checks | Exact match, semantic similarity | LLM-as-judge with a strict rubric. Validate the rubric on a human-scored sample before trusting it at scale. |
| Trigger | CI on every PR + scheduled nightly + on every model bump | — | The harness reruns automatically. If it does not, it stops being the contract. |
08 — Observability
| Slot | Default | Alternatives | Notes |
| Metrics + dashboards | CloudWatch (default), Datadog (when customer already has it) | Grafana / Prometheus | One dashboard for finance and engineering — token spend next to p95 latency. |
| Prompt-level tracing | LangSmith (Anthropic-friendly) or Langfuse (self-host) | OpenLLMetry, custom | Langfuse when the customer wants the tracing data inside their boundary. |
| Cost | AWS Cost Explorer + Bedrock per-model breakdown | Vendor (Vantage, CloudHealth) | Tag every Bedrock call with engagement / customer / workload to slice spend. |
| Alerting | CloudWatch alarms on the four AI metrics (token spend, refusal rate, tool-call success, retrieval hit rate) | Datadog monitors | Alert on AI-specific metrics, not just CPU/memory. |
09 — Governance
| Slot | Default | Alternatives | Notes |
| PII detection | AWS Comprehend (default), Presidio (self-host) | — | Detect on the way in, mask on the way out for unauthorized users. |
| Prompt-injection defense | Layered: input sanitization + system prompt structure + tool-call whitelisting | Lakera Guard, Rebuff | No tool call from text extracted from a retrieved document. Document content is data, never an instruction source. |
| Audit trail | Inference logs → KMS-encrypted CloudWatch → S3 archive (long-term) | — | Customer-managed KMS key. Retention based on compliance regime, not "indefinite." |
| Policy as code | OPA / Rego when the policy has enough branches to warrant it | Custom Lambda authorizers | For simple allow-lists, a custom Lambda is cheaper than introducing OPA. |
Things deliberately not in the catalog
We get asked about these often enough that explaining why they are
not our default is worth doing in writing.
- End-to-end RAG-as-a-service vendors — the bought black box. The promise is generic ingestion + retrieval; the reality is debugging is impossible because the chunks and embeddings are not yours. See Layer 04 for the full reasoning.
- Self-hosted foundation models at engagement scale — the GPU TCO almost never beats Bedrock's per-token at the volumes a customer engagement produces. Real exceptions exist but they are rare.
- Tool marketplaces for production agentic systems — pre-built SaaS integrations that bypass the customer's permission model. Fine for personal-productivity agents, wrong shape for production.
- Homegrown orchestration runtime — the "the bought one is too opinionated" trap. Eighteen months later the homegrown runtime is the most fragile thing in production.
Versioning
This catalog is updated as we change our picks. Significant changes
are dated and called out below. Minor refinements are quiet.
- 2026-06 — Initial publication. Defaults reflect the stack we run today across active Quantum Leap engagements.
If you spot a tool we do not list that we should evaluate, send a
note — happy to talk about why we have or have not landed on it.