What this guide covers
Running an AI agent in production is a fundamentally different problem from running a deterministic web service. A REST endpoint either returns the correct JSON or it throws a 500. An agent can spend ten seconds and a pound's worth of LLM tokens, silently return a confident-sounding but factually wrong answer, and your monitoring dashboard will show a healthy green request. No exception raised. No error logged. No alert fired. Just a user who quietly stopped trusting your product.
This guide is for engineers who have an agent running — or nearly running — in production and need to debug it systematically rather than by guesswork. We will cover:
- Why distributed tracing is the right primitive for observing agents, and why logging and metrics alone fall short
- A concise introduction to OpenTelemetry: spans, traces, and context propagation
- How to instrument your LLM calls and tool calls as first-class spans with cost and token attributes
- Head-based versus tail-based sampling — and why the economics heavily favour tail sampling for agent workloads
- How to propagate cost tags from the HTTP request all the way into every LLM call, giving you per-user and per-tenant spend attribution
- A practical comparison of Jaeger, Grafana Tempo, Honeycomb, and Braintrust as backends, including free tiers and the AWS regions relevant for teams in India and the UK
- The common pitfalls — leaked context, zombie spans, and noisy auto-instrumentation — that trip up engineers new to OTel
- A real case study in which tail sampling caught a silent retrieval regression that traditional monitoring missed entirely
By the end, you will have a production-ready instrumentation pattern you can drop into any Python agent, along with a clear framework for choosing a backend and a sampling strategy that fits your traffic volume and budget.
Why debugging agents without traces is guesswork
Consider a typical multi-step agent: the user sends a question, the agent calls a retrieval function to fetch relevant documents, passes those documents with the question to an LLM, the LLM decides to call a tool (say, a SQL query executor), the result comes back, and the LLM synthesises a final answer. That is five or six distinct operations chained together, each with its own latency, its own cost, and its own failure modes.
Now ask yourself: when a user reports "the agent gave me the wrong answer", which step failed? Was the retrieval step returning irrelevant chunks? Did the LLM misread the context? Did the SQL tool return stale data? Did the LLM hallucinate a number when the tool call returned no rows? Without instrumentation, you are guessing. You might add some print statements, reproduce the query locally, and find the answer — or not, because the failure was non-deterministic and does not reproduce on demand.
The deeper problem is that in practice, the vast majority of agent failures produce no error at all. They simply return wrong answers. The agent completed successfully, from the perspective of every health check you have. This is categorically different from a null pointer exception or a database timeout. Traditional observability — error rate, latency p99, request count — is designed for systems that fail loudly. Agents fail quietly.
Metrics can tell you that your average agent response time increased from 1.8 seconds to 2.4 seconds. They cannot tell you which step slowed down, for which class of queries, under which conditions. Logs can capture the input and output of each step if you remember to add the right log lines everywhere, but they give you no causal structure — no way to see that this log line at 14:03:01 caused that log line at 14:03:03. Distributed traces give you the causal tree. You can see the entire execution path for one request: which tool was called when, how long it waited for the LLM, what the LLM received and returned, and what the entire request cost in tokens and money. That structure is what makes agent debugging tractable at scale.
OpenTelemetry in three minutes: spans, traces, and context propagation
OpenTelemetry (OTel) is a vendor-neutral observability framework maintained by the Cloud Native Computing Foundation. It defines a standard wire format and SDK for emitting traces, metrics, and logs, and it is supported by every major observability backend. You write your instrumentation once and switch backends without rewriting a line of application code.
The three concepts you need to internalise before writing any instrumentation code are:
Span. A span is a named, timed operation with a start timestamp, an end timestamp, a set of key-value attributes, and a status (OK, ERROR). Think of it as one unit of work: "call GPT-4o", "execute SQL query", "fetch documents from vector store". Each span has a unique span ID and knows the span ID of its parent (if any).
Trace. A trace is the complete tree of spans for a single request. The root span (no parent) represents the top-level operation — typically the incoming HTTP request. Every downstream call — retrieval, tool invocation, LLM call — becomes a child span. Together they form a directed acyclic graph that represents the full causal history of that request.
Context propagation. When work crosses a process boundary — an HTTP call to a microservice, a message pushed onto a queue, a background task — OTel serialises the current trace ID and span ID into the request headers (using the W3C TraceContext standard). The receiving service reads those headers, creates a child span, and attaches it to the same trace. Without propagation, every async hop breaks the trace and you end up with disconnected fragments.
Here is the minimal Python setup for an OTel-instrumented agent, using the OTLP gRPC exporter to send spans to any compatible backend:
# requirements: opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
def configure_otel(service_name: str, otlp_endpoint: str) -> trace.Tracer:
resource = Resource.create({"service.name": service_name})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
return trace.get_tracer(service_name)
# Initialise once at startup
tracer = configure_otel(
service_name="production-agent",
otlp_endpoint="http://localhost:4317", # or your Grafana/Honeycomb endpoint
)
The BatchSpanProcessor is critical for production. It buffers spans in memory and flushes them asynchronously, adding negligible overhead to your request path. Never use SimpleSpanProcessor in production — it exports spans synchronously, blocking the calling thread.
Instrumenting your LLM agent: wrapping tool calls and LLM calls as spans
The instrumentation pattern for an agent is straightforward in principle: every operation that has meaningful duration, cost, or failure modes becomes a span. In practice, this means at minimum: the top-level agent invocation, each LLM call, and each tool call. If your retrieval step involves multiple sub-operations (embedding, ANN lookup, re-ranking), those deserve their own child spans as well.
The key to useful spans is the attributes you attach. For LLM calls, you want to know the model, the input and output token counts, the cost, and whether the response was truncated. For tool calls, you want the tool name, the input arguments (sanitised if they contain PII), the duration, and the output status. This attribute data is what transforms a timing diagram into a debugging tool.
Define your span attribute keys as constants in a shared module and import them everywhere. Ad-hoc string keys like "llm_tokens" vs "llm.tokens" are the most common source of broken dashboards and failed alert queries. Standardise on the OpenTelemetry semantic conventions for GenAI attributes (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) where they exist — these are increasingly supported natively by backends.
Here is a production-grade agent wrapper that instruments both LLM calls and tool calls as spans, with full attribute coverage:
import time
from typing import Any, Callable
from opentelemetry import trace, baggage, context
from opentelemetry.trace import Status, StatusCode
# Reuse the tracer configured at startup
tracer = trace.get_tracer("production-agent")
# Pricing table — update as models change (USD per 1k tokens)
LLM_PRICE_PER_1K = {
"gpt-4o": {"input": 0.0025, "output": 0.010},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015},
"claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
pricing = LLM_PRICE_PER_1K.get(model, {"input": 0.002, "output": 0.008})
return (input_tokens / 1000 * pricing["input"]
+ output_tokens / 1000 * pricing["output"])
def traced_llm_call(
model: str,
messages: list[dict],
llm_client: Any,
**kwargs,
) -> Any:
"""Wrap any LLM call as an OTel span with full cost attribution."""
current_span = trace.get_current_span()
user_id = baggage.get_baggage("user.id") or "anonymous"
tenant_id = baggage.get_baggage("tenant.id") or "default"
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("llm.model", model)
span.set_attribute("llm.message_count", len(messages))
span.set_attribute("user.id", user_id)
span.set_attribute("tenant.id", tenant_id)
# Propagate feature context from baggage
feature = baggage.get_baggage("feature.name")
if feature:
span.set_attribute("feature.name", feature)
try:
response = llm_client.chat.completions.create(
model=model,
messages=messages,
**kwargs,
)
usage = response.usage
input_tokens = usage.prompt_tokens
output_tokens = usage.completion_tokens
cost = calculate_cost(model, input_tokens, output_tokens)
span.set_attribute("llm.input_tokens", input_tokens)
span.set_attribute("llm.output_tokens", output_tokens)
span.set_attribute("llm.cost_usd", round(cost, 6))
span.set_attribute("llm.finish_reason", response.choices[0].finish_reason)
span.set_status(Status(StatusCode.OK))
return response
except Exception as exc:
span.set_status(Status(StatusCode.ERROR, str(exc)))
span.record_exception(exc)
raise
def traced_tool_call(
tool_name: str,
tool_fn: Callable,
tool_input: dict,
) -> Any:
"""Wrap a tool invocation as an OTel span."""
with tracer.start_as_current_span(f"tool.{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.input_keys", list(tool_input.keys()))
start = time.perf_counter()
try:
result = tool_fn(**tool_input)
duration_ms = (time.perf_counter() - start) * 1000
span.set_attribute("tool.duration_ms", round(duration_ms, 2))
span.set_attribute("tool.result_type", type(result).__name__)
span.set_status(Status(StatusCode.OK))
return result
except Exception as exc:
duration_ms = (time.perf_counter() - start) * 1000
span.set_attribute("tool.duration_ms", round(duration_ms, 2))
span.set_status(Status(StatusCode.ERROR, str(exc)))
span.record_exception(exc)
raise
def run_agent(user_query: str, user_id: str, tenant_id: str, feature: str = "chat"):
"""Top-level agent span with baggage propagation."""
ctx = baggage.set_baggage("user.id", user_id)
ctx = baggage.set_baggage("tenant.id", tenant_id, context=ctx)
ctx = baggage.set_baggage("feature.name", feature, context=ctx)
with tracer.start_as_current_span("agent.run", context=ctx) as span:
span.set_attribute("agent.query", user_query[:200]) # truncate long queries
span.set_attribute("user.id", user_id)
span.set_attribute("tenant.id", tenant_id)
# ... agent orchestration logic here
pass
Notice that the baggage values (user.id, tenant.id, feature.name) are set once at the top of the request and flow automatically into every child span through context propagation. You do not need to thread these values through every function call manually. This is precisely the design of OTel Baggage — it is the distributed equivalent of a thread-local variable.
For agents that use frameworks like LangChain, LlamaIndex, or CrewAI, community-maintained auto-instrumentation packages exist (opentelemetry-instrumentation-langchain, openinference-instrumentation-llama-index). They are a reasonable starting point, but they often add more noise than signal for complex agents. Wrapping your own LLM calls as shown above gives you finer control over which attributes you capture and how much you pay in storage costs at the backend.
Tail-based vs head-based sampling — choosing the right strategy
Sampling is necessary because collecting 100% of traces in a high-traffic production system is expensive. A team running 50,000 agent requests per day at an average of 30 spans per trace would generate 1.5 million spans daily. At typical SaaS backend pricing (roughly 0.1 USD per GB of ingested data), that adds up quickly. The question is not whether to sample, but how.
Head-based sampling makes the keep-or-discard decision at the start of a request, before any spans have been created. A common configuration is "keep 10% of all traces". This is simple, cheap, and deterministic. The fatal flaw for agent workloads is that you have no way to guarantee that the traces you keep are the interesting ones. If a retrieval regression affects 2% of queries, a 10% head-based sampler will keep roughly 0.2% of the affected traces — a handful per day. You are almost certain to miss the pattern.
Tail-based sampling buffers the entire trace until it completes, then evaluates it against a set of rules before deciding whether to keep or discard it. This means you can guarantee that every trace containing an error span is kept, every trace exceeding your latency threshold is kept, and every trace exceeding a cost threshold is kept. Happy-path traces — the boring, fast, cheap ones — can be sampled aggressively. This is the correct strategy for agents.
The trade-off is that tail sampling requires a stateful collector (the OpenTelemetry Collector with the tail-sampling processor) that buffers spans in memory. For most teams, this means running one or more OTel Collector instances. It adds operational complexity, but the visibility gain is substantial.
| Sampling strategy | Data retention | Misses errors? | Latency overhead | Best for |
|---|---|---|---|---|
| Head-based (100%) | 100% | No | Near zero | Development, low-volume prod (under 1,000 req/day) |
| Head-based (10%) | 10% | Yes — 90% of errors discarded | Near zero | High-volume services where all requests are roughly equivalent |
| Tail-based (error + slow + costly) | 10–30% overall, 100% of errors/slow/costly | No | Buffering delay (typically 10–30s) | Production agents — catches silent failures and regressions |
| Tail-based (probabilistic fallback) | Configurable; typically 5–15% | No | Buffering delay (10–30s) | High-volume production; combined with deterministic rules above |
A practical tail-sampling policy for most agent deployments: keep 100% of traces where any span has status ERROR; keep 100% of traces where total duration exceeds the 95th percentile threshold (establish this from your first week of 100% sampling); keep 100% of traces where llm.cost_usd exceeds your 90th-percentile spend threshold; and probabilistically sample the remaining traces at 10–20%. This typically reduces your stored trace volume to 10–30% of total traffic while retaining every trace that matters for debugging.
Tail sampling requires the OpenTelemetry Collector's tail_sampling processor, which buffers spans in memory until a decision is made. The default decision wait time is 30 seconds. If your agents have long-running steps (multi-minute background tasks), increase this value or buffer to disk. Sizing the collector incorrectly — too little memory for your span volume — causes spans to be dropped silently, which is worse than no sampling at all because it creates misleading gaps in your trace data.
Propagating cost tags for per-user and per-tenant attribution
Knowing that your agent spent $0.03 on LLM tokens for a given request is useful. Knowing that tenant Acme Ltd spent $180 last week and that 60% of their spend came from one feature flag is the information that drives pricing decisions, abuse detection, and capacity planning. The mechanism for attaching that business context to every span — without threading arguments through every function — is OTel Baggage.
Baggage is a key-value store that propagates alongside the trace context. When you set a baggage value at the edge of your system (in the HTTP request handler or the queue consumer), it is automatically available in every span created anywhere downstream in the same request, including in any microservice that receives the propagated trace context. Each child span can read the baggage and copy the relevant values into its own attributes for indexing by the backend.
The cost attribution pattern has three steps: inject baggage at the request boundary, read baggage in the LLM call wrapper and stamp it onto the span, and export an OTel metric (a counter) that accumulates cost by the baggage dimensions.
from opentelemetry import baggage, context, metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
# Configure metrics alongside traces
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://localhost:4317"),
export_interval_millis=30_000,
)
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter("production-agent")
# Metric instruments
llm_cost_counter = meter.create_counter(
name="agent.llm.cost_usd",
description="Cumulative LLM cost in USD",
unit="USD",
)
llm_tokens_counter = meter.create_counter(
name="agent.llm.tokens",
description="Cumulative LLM token usage",
unit="tokens",
)
def traced_llm_call_with_metrics(
model: str,
messages: list[dict],
llm_client: Any,
**kwargs,
) -> Any:
"""LLM call wrapper with both span attributes and metric emission."""
user_id = baggage.get_baggage("user.id") or "anonymous"
tenant_id = baggage.get_baggage("tenant.id") or "default"
feature = baggage.get_baggage("feature.name") or "unknown"
# Metric dimensions — keep cardinality low (no free-form query text)
dimensions = {
"model": model,
"user.id": user_id,
"tenant.id": tenant_id,
"feature.name": feature,
}
with tracer.start_as_current_span("llm.call") as span:
span.set_attributes(dimensions)
try:
response = llm_client.chat.completions.create(
model=model, messages=messages, **kwargs
)
usage = response.usage
input_tok = usage.prompt_tokens
output_tok = usage.completion_tokens
cost = calculate_cost(model, input_tok, output_tok)
span.set_attribute("llm.input_tokens", input_tok)
span.set_attribute("llm.output_tokens", output_tok)
span.set_attribute("llm.cost_usd", round(cost, 6))
# Emit to metrics backend for dashboards and alerts
llm_cost_counter.add(cost, dimensions)
llm_tokens_counter.add(input_tok, {**dimensions, "token_type": "input"})
llm_tokens_counter.add(output_tok, {**dimensions, "token_type": "output"})
return response
except Exception as exc:
span.record_exception(exc)
span.set_status(Status(StatusCode.ERROR, str(exc)))
raise
# --- HTTP handler (FastAPI example) ---
from fastapi import FastAPI, Request
import opentelemetry.propagate as propagate
app = FastAPI()
@app.post("/agent/run")
async def agent_endpoint(request: Request, body: dict):
# Extract incoming trace context (if upstream caller is also instrumented)
ctx = propagate.extract(dict(request.headers))
# Set business-context baggage at the request boundary
ctx = baggage.set_baggage("user.id", body["user_id"], context=ctx)
ctx = baggage.set_baggage("tenant.id", body["tenant_id"], context=ctx)
ctx = baggage.set_baggage("feature.name", body.get("feature", "default"), context=ctx)
token = context.attach(ctx)
try:
with tracer.start_as_current_span("http.request.agent_run"):
result = run_agent(body["query"], body["user_id"], body["tenant_id"])
return {"result": result}
finally:
context.detach(token) # Always detach to avoid context leaks
The call to context.detach(token) in the finally block is not optional. In an async Python server handling concurrent requests, failing to detach a context token means the baggage from one request leaks into subsequent requests handled by the same coroutine or thread. This is one of the most common OTel bugs in production Python agents, and it is entirely silent — the wrong user ID silently gets attributed to the wrong requests.
On the dashboard side, your cost counter with user.id and tenant.id dimensions flows into Grafana or Honeycomb as a time-series you can group and filter. You can set alerts on per-tenant spend thresholds, build a cost-by-feature breakdown, and catch unusual spend spikes — a runaway agent loop, a misconfigured retry budget — within minutes rather than hours.
Connecting to a backend: Jaeger, Grafana Tempo, Honeycomb, Braintrust
The OTel SDK and collector are backend-agnostic. All four backends discussed here accept spans over the OTLP protocol (gRPC or HTTP), so switching between them is a configuration change rather than an instrumentation change. Your choice of backend is driven by four factors: whether you prefer self-hosted control or managed SaaS, the free tier that matches your traffic volume, the cloud region nearest your users and data residency requirements, and whether you want LLM-specific features built in.
| Backend | Free tier | Cloud regions (Mumbai / London) | LLM-specific features | Setup effort |
|---|---|---|---|---|
| Jaeger (self-hosted) | Unlimited (self-hosted); storage cost only | Any region — deploy on AWS ap-south-1 (Mumbai) or eu-west-2 (London) | None built-in; custom via attributes | Medium — Docker Compose or Helm; needs separate storage (Cassandra or OpenSearch) |
| Grafana Tempo (cloud) | 50 GB traces/month free; 14-day retention | AWS ap-south-1 (Mumbai), eu-west-2 (London) available on paid plans; free tier US-only | None built-in; integrates with Grafana dashboards for custom LLM panels | Low — OTLP endpoint provided; integrate with existing Grafana instance |
| Honeycomb (SaaS) | 20M events/month free; 90-day retention on free plan (verify current tiers at honeycomb.io) | SaaS (US-hosted); EU data residency available on Team/Enterprise | Query by any attribute; BubbleUp for anomaly detection; excellent for high-cardinality LLM attributes | Very low — send OTLP directly; no collector needed for basic setup |
| Braintrust (SaaS) | Generous free tier; verify current limits at braintrust.dev (pricing changes) | SaaS (US-hosted) | Native LLM spans (model, tokens, cost, prompts); prompt playground; evals integrated with traces | Very low — drop-in SDK wrapper; OTel-compatible OTLP endpoint |
For teams in India with data residency requirements, Jaeger self-hosted on AWS ap-south-1 is the most straightforward path to keeping trace data within Indian borders. For teams in the UK concerned about GDPR and data sovereignty, Jaeger on eu-west-2 or Grafana Tempo on an EU-region paid plan are the pragmatic choices. Honeycomb's EU data residency is available from the Team tier upwards, which starts at approximately $130/month — reasonable once you are past the free-tier volume.
Braintrust deserves special mention for teams who are primarily debugging agent behaviour rather than infrastructure. Its native understanding of LLM spans — it renders prompt templates, token counts, and costs in a purpose-built UI — eliminates significant dashboard-building work. The trade-off is that it is US-hosted SaaS with no on-premises option, which disqualifies it for regulated environments. For early-stage teams where debugging speed matters more than data residency, it is the lowest-friction starting point.
Start with Braintrust or Honeycomb free tier to build intuition for what your traces look like. Once you have defined your key span attributes and sampling rules, migrate to Grafana Tempo (if you are already in the Grafana ecosystem) or self-hosted Jaeger (if data residency is a constraint). The OTel abstraction means this migration is a one-line config change in your collector — your application code does not change at all.
Common pitfalls: leaking trace context, zombie spans, noisy spans
Instrumentation bugs are insidious because they do not cause your agent to malfunction — they corrupt your observability data, causing you to draw wrong conclusions about a system that is actually working fine, or to miss real problems in a system that is failing.
Leaking trace context is the most common bug. In async Python (asyncio or FastAPI), the OTel context is stored in a ContextVar. If you attach a context without detaching it in a finally block, it leaks into subsequent coroutines. In practice this means user A's baggage (and therefore their user.id and tenant.id) contaminates user B's traces. Your cost attribution dashboards become wrong. Always use context.attach() paired with context.detach(token) in a try/finally. If you are using thread pools, be aware that ContextVar does not propagate across ThreadPoolExecutor boundaries by default — you must copy the context explicitly using contextvars.copy_context().
Zombie spans are spans that are started but never ended because an exception escaped the span's context manager. With the start_as_current_span context manager, this should not happen — the context manager catches all exceptions and closes the span before re-raising. The zombie pattern typically appears when engineers use the lower-level tracer.start_span() API and manually call span.end(). A span that is never ended does not get exported. You lose the data silently. Prefer the context manager API in almost all cases.
Noisy auto-instrumentation spans arise when you enable OTel's auto-instrumentation libraries (opentelemetry-instrumentation-requests, opentelemetry-instrumentation-httpx) alongside manual instrumentation. Every HTTP call your agent makes — including the polling loop checking whether an async LLM response is ready — creates a span. You can easily end up with 200 spans per request, 190 of which are noise. The fix is to use auto-instrumentation selectively, or to apply span filters in your OTel Collector configuration to drop spans below a duration threshold (e.g., drop any span under 5 ms that is not marked as an error).
High-cardinality attribute mistakes can be expensive at the backend. Attributes like user.id and tenant.id are fine — they have bounded cardinality. Raw query text, on the other hand, is effectively unbounded. Storing the full user query as a span attribute in a SaaS backend with per-event billing will generate significant cost and make your traces slow to query. Truncate to 200 characters or store a hash; keep the full text in your logs with a correlation ID if you need it for debugging.
Case study — catching a silent retrieval regression with tail sampling
The following case is representative of a class of production issue we encounter repeatedly in agent deployments (details anonymised). In late 2025, an engineering team at a B2B SaaS company running a RAG-based contract analysis agent encountered a problem that their existing monitoring completely missed. The agent was a multi-step pipeline: the user asked a question about a contract, the agent retrieved relevant clauses using a dense vector search, passed the clauses to an LLM with a structured extraction prompt, and returned a JSON object with specific data fields populated.
Three weeks after a document ingestion schema change — a routine update to how clause boundaries were detected — customer success started receiving complaints that the agent was returning incomplete answers on questions about liability clauses. No error rates had changed. No latency alerts had fired. The agent was returning 200 responses as fast as ever. The only signal was a mild uptick in average LLM output tokens, which nobody had built an alert for.
The team had deployed OTel instrumentation one month earlier, with a tail-sampling policy configured to retain all traces where any span exceeded 4,000ms or where llm.output_tokens exceeded 800. The liability-clause queries were triggering both conditions: the retrieval step was returning semantically poor chunks (because the schema change had split liability clauses across chunk boundaries), the LLM was receiving irrelevant context and spending more tokens expressing uncertainty, and the response time was increasing as the LLM generated longer hedged answers.
Happy-path traces — questions about payment terms and delivery schedules, which were unaffected by the schema change — were being probabilistically sampled at 15% and mostly discarded. But every liability-clause query was being retained by the tail sampler.
When the team searched their trace backend for traces where retrieval.top_score was below 0.7 (an attribute they had added to their vector search span), they found that every liability query had retrieval confidence scores of 0.42–0.55, while the same queries from three weeks earlier (before the schema change, stored in the 90-day retention window) showed scores of 0.78–0.88. The trace timeline showed the retrieval span returning quickly — the right latency — but with low-confidence chunks. The LLM then spent three times as long generating a response that essentially said "the provided context does not contain enough information to answer this confidently".
The root cause was identified in under two hours: the schema change had altered how chunk IDs were generated, causing the embedding index to be partially invalidated. The chunks being served were from before the schema migration. The fix was a targeted re-indexing job for the affected document type. Without the tail-sampled traces retaining exactly the requests that exhibited the problem, the team estimates they would have spent several days in log archaeology before reaching the same conclusion — if they reached it at all without user complaints guiding them.
Conclusion and next steps
AI agents fail in ways that traditional observability was not designed to catch. The combination of OpenTelemetry distributed traces, tail-based sampling, and cost-attributed baggage gives you the observability primitive that matches how agents actually fail: silently, across multiple asynchronous steps, with costs and behaviours that vary by user and query type rather than uniformly across all traffic.
The practical starting point is straightforward. Wrap your LLM calls and tool calls with spans using the start_as_current_span context manager. Add the six key attributes: model, input tokens, output tokens, cost, tool name, and tool duration. Set baggage at your request boundary with user ID, tenant ID, and feature name. Start with 100% sampling to build your baseline, then introduce tail-based rules once you have calibrated your latency and cost thresholds. Pick whichever backend has the lowest friction for your team — Braintrust or Honeycomb to start, Jaeger or Tempo for production data-residency requirements.
Next steps to explore alongside this guide: improving your retrieval pipeline to reduce the low-confidence chunk problem entirely, cutting LLM costs using the spend data your new cost attribution surfaces, and evaluating self-hosted inference once your trace data tells you which models and call patterns dominate your spend.