← All Research
Tooling catalog · Cross-layer

Tooling catalog — what we use at each layer | Orion Research

The tools Orion currently uses at each layer of the AI stack — defaults, alternatives, and the rationale for each. Versioned, opinionated, updated as the field shifts.

This is the working catalog of tools we currently pick at each layer of the AI stack on customer engagements. It is opinionated on purpose — neutrality is not useful when a customer needs to make a decision in the next two weeks. We update it as the field shifts and as new tools earn their place against the existing defaults.

The shape: for each layer, the default we reach for when starting clean, the alternatives we will use when the default does not fit, and the notes that explain why. None of this is universal — engagements have constraints (existing investments, regulatory posture, team skills) that legitimately change the answer. The defaults are the starting position.

The full layer reference architectures live in the per-layer Research pages — start with Infrastructure (Layer 01) and Retrieval (Layer 04), which are live today; the rest are being written.

01 — Infrastructure

SlotDefaultAlternativesNotes
Cloud AWS Azure, GCP (customer-driven) WOSB + AWS Select Partner posture. Deep CDK + Sprintsail bench.
IaC CDK (TypeScript) + Sprintsail primitives Terraform, Pulumi Sprintsail for the application surface (~80%), raw CDK for the AI-specific resources (Bedrock, OpenSearch, KMS).
Compute (orchestration) Lambda (Node.js / Python) ECS Fargate, EKS Lambda for the API + orchestration path. ECS when long-running parsers or per-request state push past Lambda limits.
Secrets AWS Secrets Manager Parameter Store (SecureString) Customer-managed KMS key, automatic rotation where the secret supports it.
Identity Cognito (customer-facing) + IAM roles (service-to-service) SSO / OIDC integration to existing IdP Cross-account roles, never shared SDK credentials.

02 — Data

SlotDefaultAlternativesNotes
Object storage S3 (KMS-encrypted, customer-managed key) Azure Blob, GCS Versioned bucket. Object Lock for the raw corpus on regulated workloads.
Relational RDS Postgres (Aurora Serverless v2 when bursty) RDS MySQL, DynamoDB (for non-relational) Postgres carries pgvector well if the customer wants a single DB story.
Document parsing Textract (scanned), Unstructured / LlamaParse (born-digital) Mistral OCR, Azure Document Intelligence Keep raw bytes + parsed form side-by-side, content-hash addressable.
Pipeline S3 events + SQS + Lambda Step Functions, Airflow (MWAA), Temporal Step Functions when the workflow has explicit state machines worth modelling.
Lineage / catalog Custom (small) on DynamoDB OpenLineage + Marquez (when at scale) For engagement-size corpora, a custom lineage table beats running a full catalog service.

03 — Models

SlotDefaultAlternativesNotes
Reasoning / synthesis Claude (via Bedrock) GPT-4-class (Anthropic API, Azure OpenAI), Llama 3.3 (Bedrock) Bedrock-hosted by default for VPC posture. Anthropic API directly when the customer is comfortable with the egress.
Embeddings Titan Embed Text V2 (Bedrock) Cohere Embed v3, Voyage AI Pick by running both on the customer's domain eval. Versioned in the index.
Re-ranker Cohere Rerank (when needed) Voyage Rerank, hosted Bedrock equivalents Off by default; on when the eval shows it earns its keep.
Small / classification Claude Haiku (fast, cheap) Cohere Command R, Llama-class small For deterministic classification tasks where a frontier model is overkill.

04 — Retrieval

SlotDefaultAlternativesNotes
Index OpenSearch Serverless (vector + BM25 in one index) pgvector + PG FTS, Pinecone, Weaviate OpenSearch when starting clean. pgvector when the customer already runs Postgres and wants one fewer service.
Retrieval pattern Hybrid dense + BM25, fused with RRF (k=60) Dense-only (small corpora), lexical-only (rare) RRF is ~10 lines of code and is the single highest-leverage quality move.
Chunking Custom, layout-aware per corpus Framework defaults (last resort only) Carry document ID + section anchor + byte range as metadata on every chunk.
Permission filter Pre-filter on identity tags at the index (none — post-retrieval filtering is a compliance leak) See Layer 04 for the full reasoning.

05 — Orchestration

SlotDefaultAlternativesNotes
Pattern Single-turn pipeline (retrieve → ground → answer) Agent (Bedrock Agents, LangGraph, MCP) Agents only when the workflow actually requires multi-step reasoning with tool use.
Runtime Lambda (single-turn) / Step Functions (multi-step) Temporal, LangGraph runtime Temporal when the workflow has long-running human-in-the-loop branches.
Tool protocol Claude tool-use (Bedrock or direct) + MCP for portable servers OpenAI function calling, custom MCP for tools we want to share across customers; native tool-use for engagement-specific.
Caching Prompt caching (Anthropic) + ElastiCache for retrieval results Prompt caching is the single biggest cost lever once a system is in steady state.

06 — Tools

SlotDefaultAlternativesNotes
Internal tools Custom Lambdas / Fargate tasks behind named IAM roles No "tool marketplace" pre-built integrations. Each tool has a named owner and an audit log.
Side-effect boundary Human-in-the-loop confirmation for any production-mutating call Allow-list of safe reversible calls Reversibility is the deciding factor. Mutating prod records always needs sign-off.
Audit log Every tool call → DynamoDB + S3 (long-term) CloudTrail (for AWS-API tools) Log the prompt + tool call + result, not just the call.

07 — Evaluation

SlotDefaultAlternativesNotes
Harness Custom on top of pytest + a small scoring library Promptfoo, Ragas, OpenAI Evals The harness is the contract; the customer owns it. We use whatever framework lets them re-run it without us.
Test set authoring Co-authored with the customer's domain experts Synthetic test sets miss the cases that actually matter. Real questions from real users.
Scoring Rubric-based, LLM-as-judge with human spot-checks Exact match, semantic similarity LLM-as-judge with a strict rubric. Validate the rubric on a human-scored sample before trusting it at scale.
Trigger CI on every PR + scheduled nightly + on every model bump The harness reruns automatically. If it does not, it stops being the contract.

08 — Observability

SlotDefaultAlternativesNotes
Metrics + dashboards CloudWatch (default), Datadog (when customer already has it) Grafana / Prometheus One dashboard for finance and engineering — token spend next to p95 latency.
Prompt-level tracing LangSmith (Anthropic-friendly) or Langfuse (self-host) OpenLLMetry, custom Langfuse when the customer wants the tracing data inside their boundary.
Cost AWS Cost Explorer + Bedrock per-model breakdown Vendor (Vantage, CloudHealth) Tag every Bedrock call with engagement / customer / workload to slice spend.
Alerting CloudWatch alarms on the four AI metrics (token spend, refusal rate, tool-call success, retrieval hit rate) Datadog monitors Alert on AI-specific metrics, not just CPU/memory.

09 — Governance

SlotDefaultAlternativesNotes
PII detection AWS Comprehend (default), Presidio (self-host) Detect on the way in, mask on the way out for unauthorized users.
Prompt-injection defense Layered: input sanitization + system prompt structure + tool-call whitelisting Lakera Guard, Rebuff No tool call from text extracted from a retrieved document. Document content is data, never an instruction source.
Audit trail Inference logs → KMS-encrypted CloudWatch → S3 archive (long-term) Customer-managed KMS key. Retention based on compliance regime, not "indefinite."
Policy as code OPA / Rego when the policy has enough branches to warrant it Custom Lambda authorizers For simple allow-lists, a custom Lambda is cheaper than introducing OPA.

Things deliberately not in the catalog

We get asked about these often enough that explaining why they are not our default is worth doing in writing.

  • End-to-end RAG-as-a-service vendors — the bought black box. The promise is generic ingestion + retrieval; the reality is debugging is impossible because the chunks and embeddings are not yours. See Layer 04 for the full reasoning.
  • Self-hosted foundation models at engagement scale — the GPU TCO almost never beats Bedrock's per-token at the volumes a customer engagement produces. Real exceptions exist but they are rare.
  • Tool marketplaces for production agentic systems — pre-built SaaS integrations that bypass the customer's permission model. Fine for personal-productivity agents, wrong shape for production.
  • Homegrown orchestration runtime — the "the bought one is too opinionated" trap. Eighteen months later the homegrown runtime is the most fragile thing in production.

Versioning

This catalog is updated as we change our picks. Significant changes are dated and called out below. Minor refinements are quiet.

  • 2026-06 — Initial publication. Defaults reflect the stack we run today across active Quantum Leap engagements.

If you spot a tool we do not list that we should evaluate, send a note — happy to talk about why we have or have not landed on it.

Take the playbook for a spin.

Quantum Leap is the initiative. Quantum Labs is the engagement model. Send a paragraph about what you're shipping and we'll tell you which layers are load-bearing.