Layer 08 — Observability

AI systems fail in ways the traditional observability stack does not catch. Latency is fine, error rate is fine, the dashboard is green — and the system has been refusing 40% of valid queries since last Tuesday's silent model bump. The observability layer is what makes the AI-specific failure modes visible. Without it, the operate phase is blind in the dimensions that matter most.

What this layer covers

The four AI-specific metrics every production system should surface
Prompt-level tracing — the trail from request to answer
Cost telemetry — tagged per engagement, per workload, per task
Alerting on AI metrics, not just infrastructure metrics
One dashboard for finance and engineering
Retention strategy for inference logs

The four AI metrics

Alongside the usual latency, error rate, throughput, and resource utilisation, every production AI system surfaces:

Token spend

Per minute, per hour, per day. Broken down by task, by model, by engagement. The single biggest cost line in most engagements, and the one most likely to triple silently on a prompt change. Alert thresholds on rate of change, not just absolute value — "spend is 3x the rolling 7-day average" catches more than "spend is over $X."

Refusal rate

The fraction of requests the system refused to answer. A sudden jump in refusal rate is one of the strongest signals a model or retrieval problem is brewing — the system has lost confidence on a query class it used to handle. Equally important: a sudden drop in refusal rate can mean the system has gotten over-confident, which is a different kind of bad.

Retrieval hit rate

The fraction of queries where the retrieval layer surfaced at least one chunk above a confidence threshold. A low hit rate is a coverage problem; the corpus is incomplete or the chunking is wrong. Combined with refusal rate, it lets us tell the difference between "the model is being honest about not knowing" (high refusal + low hit rate) and "the model is hallucinating fluently" (low refusal + low hit rate).

Tool-call success rate

For agent-shaped systems. The fraction of tool calls that completed without error and were used in the final answer. A drop means tool contracts are breaking, the model is calling them with wrong arguments, or the upstream APIs have changed. Most production AI incidents have this metric as the earliest signal.

Default reference architecture

Metric pipeline

Every inference Lambda emits structured logs to CloudWatch Logs, tagged with: task ID, model ID + version, engagement, workload, retrieval hit count, refusal flag, tool calls made, tokens in, tokens out, estimated cost. CloudWatch Metric Filters extract the numeric metrics; the AI-specific ones go to a dedicated namespace so they live alongside but not inside the infrastructure metrics.

Datadog when the customer already has it. Grafana + Prometheus when the customer prefers self-hosted. The substrate is interchangeable; the metric definitions are not.

Prompt-level tracing

LangSmith (Anthropic-friendly, hosted), Langfuse (self-host), or OpenLLMetry depending on the customer's data residency posture. Each trace captures: the conversation turn, the prompt sent, the retrieved chunks, the model response, the tool calls dispatched, the latency at each step, and the cost. Sampled at 10-20% in production; 100% in staging and during incident windows.

The trace is what makes post-incident review possible. "Why did the system return that answer six weeks ago" is a query against the trace store, not a guessing game.

Cost telemetry

Every Bedrock call is tagged at IAM level with the engagement, workload, and task ID. Cost Explorer breaks down spend on the same dimensions. Tagged-cost is a non-negotiable from day one; retrofitting tags onto a year of CloudTrail data is a project, tagging at the start is a config line.

AWS Cost Explorer for the rollup, Bedrock per-model breakdowns in CloudWatch for the detail. Third-party (Vantage, CloudHealth) when the customer's finance team already has a tool they prefer. The data is the same; the dashboard front-end varies.

One dashboard, two audiences

The dashboard the customer's finance team sees and the dashboard the engineering team sees are the same dashboard. Token spend next to p95 latency. Refusal rate next to error rate. When the AI metrics and the infrastructure metrics live on different surfaces, finance and engineering get different answers to "is the system healthy"; both are right; the system is healthy in one frame and unhealthy in the other. The shared dashboard is what forces the conversation to be one conversation. See Principle 06.

Alerting

Alarms on the four AI metrics, in addition to the usual infrastructure ones. Specifically:

Token spend rate > 3x rolling 7-day average
Refusal rate change > 15 points in a 1-hour window
Retrieval hit rate drop > 10 points sustained over 30 minutes
Tool-call success rate drop > 5 points sustained over 30 minutes

Thresholds are engagement-specific; the shape is universal. The alert routes to the on-call engineer for the workload, with the relevant trace IDs already attached so the investigation can start in the trace store, not in CloudWatch search.

Build vs. buy at this layer

Default: buy. Datadog, CloudWatch, Honeycomb, LangSmith, Langfuse. The vendors are better at metric storage, dashboarding, and trace UI than we will be in year one. The differentiator at this layer is not the storage; it is what we choose to instrument and what we choose to alert on. Spend the engineering hours on instrumenting the four AI metrics correctly, not on rolling our own metric pipeline.

Things to build:

The metric definitions. What does "token spend" mean for this engagement? Which calls count? Which models?
The tag schema. Engagement, workload, task ID, identity tags. Consistent across every emitted log.
The alert thresholds. Engagement-specific. Tuned over the first two weeks of production traffic, not set arbitrarily.
The dashboards. The finance-and-engineering shared view. Reflects the actual cost and quality model, not the vendor's defaults.

The five mistakes we see

1. Generic APM, no AI metrics

Datadog set up beautifully for latency and error rate, no token spend graph, no refusal rate, no retrieval hit rate. The AI-specific failure modes are invisible. The team learns about them from a customer escalation.

2. Cost telemetry retrofitted

Spend goes up; nobody can break it down by engagement or task because the Bedrock calls were not tagged at IAM level. Tagging a year of historical data is impossible. Tag from day one.

3. 100% trace sampling forever

Storing every prompt + response + retrieval trace forever. Costs surface a year later and the customer asks why observability is the second-largest line item after Bedrock itself. Sample at 10-20% in steady state, 100% during incidents, with retention matched to the audit regime.

4. Two dashboards

Finance gets one, engineering gets another. They tell different stories. The conversation about "is the system healthy" goes sideways every time. Shared dashboard from day one.

5. Alert thresholds based on absolute values

"Alert when spend exceeds $X / hour." Misses the case where spend triples because of a prompt change but stays under $X / hour because volume is low. Alert on rate of change, on rolling baselines, on percentile shifts — not on bare absolutes.

How it connects to the other layers

Observability sees signals from every other layer. Retrieval hit rate is a Layer 04 signal. Token spend by task is shaped by Layer 03's per-task model selection. Tool-call success rate comes from Layer 06. The trace store is what makes Layer 09's audit trail possible. The four AI metrics, taken together, are the early warning system for the rest of the stack drifting.

Without observability, the eval harness tells you the system was correct on the test set but says nothing about whether production traffic is staying inside the test-set distribution. The harness scores quality; observability scores behaviour in the wild.

Layer 08 — Observability | Orion Research