Tooling catalog — what we use at each layer

This is the working catalog of tools we currently pick at each layer of the AI stack on customer engagements. It is opinionated on purpose — neutrality is not useful when a customer needs to make a decision in the next two weeks. We update it as the field shifts and as new tools earn their place against the existing defaults.

The shape: for each layer, the default we reach for when starting clean, the alternatives we will use when the default does not fit, and the notes that explain why. None of this is universal — engagements have constraints (existing investments, regulatory posture, team skills) that legitimately change the answer. The defaults are the starting position.

The full layer reference architectures live in the per-layer Research pages — start with Infrastructure (Layer 01) and Retrieval (Layer 04), which are live today; the rest are being written.

01 — Infrastructure

Slot	Default	Alternatives	Notes
Cloud	AWS	Azure, GCP (customer-driven)	WOSB + AWS Select Partner posture. Deep CDK + Sprintsail bench.
IaC	CDK (TypeScript) + Sprintsail primitives	Terraform, Pulumi	Sprintsail for the application surface (~80%), raw CDK for the AI-specific resources (Bedrock, OpenSearch, KMS).
Compute (orchestration)	Lambda (Node.js / Python)	ECS Fargate, EKS	Lambda for the API + orchestration path. ECS when long-running parsers or per-request state push past Lambda limits.
Secrets	AWS Secrets Manager	Parameter Store (SecureString)	Customer-managed KMS key, automatic rotation where the secret supports it.
Identity	Cognito (customer-facing) + IAM roles (service-to-service)	SSO / OIDC integration to existing IdP	Cross-account roles, never shared SDK credentials.

02 — Data

Slot	Default	Alternatives	Notes
Object storage	S3 (KMS-encrypted, customer-managed key)	Azure Blob, GCS	Versioned bucket. Object Lock for the raw corpus on regulated workloads.
Relational	RDS Postgres (Aurora Serverless v2 when bursty)	RDS MySQL, DynamoDB (for non-relational)	Postgres carries pgvector well if the customer wants a single DB story.
Document parsing	Textract (scanned), Unstructured / LlamaParse (born-digital)	Mistral OCR, Azure Document Intelligence	Keep raw bytes + parsed form side-by-side, content-hash addressable.
Pipeline	S3 events + SQS + Lambda	Step Functions, Airflow (MWAA), Temporal	Step Functions when the workflow has explicit state machines worth modelling.
Lineage / catalog	Custom (small) on DynamoDB	OpenLineage + Marquez (when at scale)	For engagement-size corpora, a custom lineage table beats running a full catalog service.

03 — Models

Slot	Default	Alternatives	Notes
Reasoning / synthesis	Claude (via Bedrock)	GPT-4-class (Anthropic API, Azure OpenAI), Llama 3.3 (Bedrock)	Bedrock-hosted by default for VPC posture. Anthropic API directly when the customer is comfortable with the egress.
Embeddings	Titan Embed Text V2 (Bedrock)	Cohere Embed v3, Voyage AI	Pick by running both on the customer's domain eval. Versioned in the index.
Re-ranker	Cohere Rerank (when needed)	Voyage Rerank, hosted Bedrock equivalents	Off by default; on when the eval shows it earns its keep.
Small / classification	Claude Haiku (fast, cheap)	Cohere Command R, Llama-class small	For deterministic classification tasks where a frontier model is overkill.

04 — Retrieval

Slot	Default	Alternatives	Notes
Index	OpenSearch Serverless (vector + BM25 in one index)	pgvector + PG FTS, Pinecone, Weaviate	OpenSearch when starting clean. pgvector when the customer already runs Postgres and wants one fewer service.
Retrieval pattern	Hybrid dense + BM25, fused with RRF (k=60)	Dense-only (small corpora), lexical-only (rare)	RRF is ~10 lines of code and is the single highest-leverage quality move.
Chunking	Custom, layout-aware per corpus	Framework defaults (last resort only)	Carry document ID + section anchor + byte range as metadata on every chunk.
Permission filter	Pre-filter on identity tags at the index	(none — post-retrieval filtering is a compliance leak)	See Layer 04 for the full reasoning.

05 — Orchestration

Slot	Default	Alternatives	Notes
Pattern	Single-turn pipeline (retrieve → ground → answer)	Agent (Bedrock Agents, LangGraph, MCP)	Agents only when the workflow actually requires multi-step reasoning with tool use.
Runtime	Lambda (single-turn) / Step Functions (multi-step)	Temporal, LangGraph runtime	Temporal when the workflow has long-running human-in-the-loop branches.
Tool protocol	Claude tool-use (Bedrock or direct) + MCP for portable servers	OpenAI function calling, custom	MCP for tools we want to share across customers; native tool-use for engagement-specific.
Caching	Prompt caching (Anthropic) + ElastiCache for retrieval results	—	Prompt caching is the single biggest cost lever once a system is in steady state.

06 — Tools

Slot	Default	Alternatives	Notes
Internal tools	Custom Lambdas / Fargate tasks behind named IAM roles	—	No "tool marketplace" pre-built integrations. Each tool has a named owner and an audit log.
Side-effect boundary	Human-in-the-loop confirmation for any production-mutating call	Allow-list of safe reversible calls	Reversibility is the deciding factor. Mutating prod records always needs sign-off.
Audit log	Every tool call → DynamoDB + S3 (long-term)	CloudTrail (for AWS-API tools)	Log the prompt + tool call + result, not just the call.

07 — Evaluation

Slot	Default	Alternatives	Notes
Harness	Custom on top of pytest + a small scoring library	Promptfoo, Ragas, OpenAI Evals	The harness is the contract; the customer owns it. We use whatever framework lets them re-run it without us.
Test set authoring	Co-authored with the customer's domain experts	—	Synthetic test sets miss the cases that actually matter. Real questions from real users.
Scoring	Rubric-based, LLM-as-judge with human spot-checks	Exact match, semantic similarity	LLM-as-judge with a strict rubric. Validate the rubric on a human-scored sample before trusting it at scale.
Trigger	CI on every PR + scheduled nightly + on every model bump	—	The harness reruns automatically. If it does not, it stops being the contract.

08 — Observability

Slot	Default	Alternatives	Notes
Metrics + dashboards	CloudWatch (default), Datadog (when customer already has it)	Grafana / Prometheus	One dashboard for finance and engineering — token spend next to p95 latency.
Prompt-level tracing	LangSmith (Anthropic-friendly) or Langfuse (self-host)	OpenLLMetry, custom	Langfuse when the customer wants the tracing data inside their boundary.
Cost	AWS Cost Explorer + Bedrock per-model breakdown	Vendor (Vantage, CloudHealth)	Tag every Bedrock call with engagement / customer / workload to slice spend.
Alerting	CloudWatch alarms on the four AI metrics (token spend, refusal rate, tool-call success, retrieval hit rate)	Datadog monitors	Alert on AI-specific metrics, not just CPU/memory.

09 — Governance

Slot	Default	Alternatives	Notes
PII detection	AWS Comprehend (default), Presidio (self-host)	—	Detect on the way in, mask on the way out for unauthorized users.
Prompt-injection defense	Layered: input sanitization + system prompt structure + tool-call whitelisting	Lakera Guard, Rebuff	No tool call from text extracted from a retrieved document. Document content is data, never an instruction source.
Audit trail	Inference logs → KMS-encrypted CloudWatch → S3 archive (long-term)	—	Customer-managed KMS key. Retention based on compliance regime, not "indefinite."
Policy as code	OPA / Rego when the policy has enough branches to warrant it	Custom Lambda authorizers	For simple allow-lists, a custom Lambda is cheaper than introducing OPA.

Things deliberately not in the catalog

We get asked about these often enough that explaining why they are not our default is worth doing in writing.

End-to-end RAG-as-a-service vendors — the bought black box. The promise is generic ingestion + retrieval; the reality is debugging is impossible because the chunks and embeddings are not yours. See Layer 04 for the full reasoning.
Self-hosted foundation models at engagement scale — the GPU TCO almost never beats Bedrock's per-token at the volumes a customer engagement produces. Real exceptions exist but they are rare.
Tool marketplaces for production agentic systems — pre-built SaaS integrations that bypass the customer's permission model. Fine for personal-productivity agents, wrong shape for production.
Homegrown orchestration runtime — the "the bought one is too opinionated" trap. Eighteen months later the homegrown runtime is the most fragile thing in production.

Versioning

This catalog is updated as we change our picks. Significant changes are dated and called out below. Minor refinements are quiet.

2026-06 — Initial publication. Defaults reflect the stack we run today across active Quantum Leap engagements.

If you spot a tool we do not list that we should evaluate, send a note — happy to talk about why we have or have not landed on it.

Tooling catalog — what we use at each layer | Orion Research