// Production / LLM Observability & Reliability

OTel-native LLM observability.

OpenTelemetry GenAI conventions wired into the stack you already run - Prometheus, Grafana, Datadog, Langfuse, LangSmith, Phoenix. Per-tenant cost attribution, streaming-aware latency (TTFT, inter-token, total), and alerts that page the right oncall before the CFO sends a calendar invite.

Book a meeting See related case study

// What we see

When the bill doubles, the team finds out from finance - not the dashboard.

Cost attribution is a week of log diving

Finance asks which product line is burning the API budget. The answer takes someone three days of grepping through provider invoices and request logs. By the time the team has the numbers, the spike is already in the next month's bill.

Latency dashboards lie about user experience

Total request duration looks fine on Datadog. Users complain that the chat feels slow. The metric the team is watching isn't the one users feel - TTFT, inter-token latency, and stalls between tokens are invisible until someone instruments them.

Cost spikes page the on-call after the fact

Provider quotas catch overruns at month-end. The internal alerting fires when the credit card declines. Real-time spike detection - based on rate-of-change against a rolling baseline - is something most teams add only after their first $20k surprise.

// Case Study

Production LLM Processing at Surfer Scale

We helped Surfer handle massive content generation workloads with a reliable, cost-optimized LLM pipeline built for scale.

300B+
tokens processed
100k+
credits sold in 6 months
5 months
from concept to full product release

Read the case study

Production LLM Processing at Surfer Scale

// What we do

Three layers, from instrument to act.

Telemetry stack built on OpenTelemetry GenAI conventions, wired into your existing Prometheus/Grafana/Datadog/Langfuse/LangSmith - never a parallel observability silo.

Instrument - OTel GenAI conventions

OpenTelemetry semantic conventions (gen_ai.* span attributes) across every LLM call site. Auto-instrumentation for OpenAI, Anthropic, Bedrock, Gemini, vLLM, TGI. Distributed traces propagate across RAG, agent, and tool-use pipelines so a slow request shows up as a single trace, not a guessing game.

OTel auto-instrumentation for OpenAI, Anthropic, Bedrock, vLLM, TGI
Streaming-aware latency: TTFT, inter-token, tokens/sec, total p50/p95/p99
Per-request labels: tenant, feature, route, model, model version, region
Cardinality budgets - high-cardinality fields routed to traces, not metrics

Attribute - cost dashboards finance trusts

Real-time spend computed from inflight token counts and posted prices, reconciled daily against actual provider invoices. Per-tenant, per-feature, per-route attribution that holds up in a CFO meeting and answers "which product line burned the budget" in a single Grafana query.

Real-time $/route + nightly invoice reconciliation (OpenAI, Anthropic, Bedrock)
Cache hit-rate metrics - prompt caching, KV cache, semantic cache
Per-tenant token + $ dashboards exported to Grafana / Datadog / Looker
Cost per successful business outcome (definable per route)

Act - alerts and budget controls

Three-tier alerting: absolute budget breaches page on-call, rate-of-change spikes go to Slack, predictive month-end forecasts go to email. Per-tenant rate limits and budget enforcement at the gateway, with graceful degradation to smaller models when budgets get tight.

PagerDuty / Opsgenie / Slack routing on cost, latency, error, refusal signals
Per-tenant $ + token budgets enforced at the gateway, audit-logged
Tail-based trace sampling: 100% of errors + slow tails, 1–5% baseline
Graceful degradation to smaller models on budget pressure

// Method fit

If you're running LLM in production without observability, you're flying blind.

skip it if

It's a toy or prototype with a tiny bill
Internal tool nobody uses, weekend hackathon, paper draft - fine, the provider's own dashboard is enough.
For anything with real users or real money flowing through it, treat observability as load-bearing.

use it if

Almost everything else. If real users hit the system or real money flows through it, you need this.

Most agencies sell "LLM integration" and skip the observability layer. That's how teams find out about cost spikes from finance instead of from a dashboard, and about latency regressions from customer support instead of from on-call.

We don't have a softer position. Wire OTel week one. Cost dashboards by week three. Alerts and budget controls by week six.

Without this layer you don't know what your system actually costs or what your users actually feel. You just haven't been told yet.

// How we work

Wire OTel first. Build the dashboards. Hand off the alerts.

Instrument the call sites, build the dashboards your team will actually open, then wire the alerts into the on-call you already run. Every step ships working before we move to the next.

Wire OTel instrumentation across LLM call sites

OpenTelemetry GenAI conventions deployed across OpenAI, Anthropic, Bedrock, Gemini, vLLM - every call site exporting gen_ai.* spans and metrics. Distributed tracing connected to your existing trace backend (Tempo, Jaeger, Datadog, Honeycomb). First useful dashboards within 1–2 weeks.

Build cost attribution + latency dashboards

Per-tenant, per-route, per-model spend in real time, reconciled nightly against actual provider invoices. Streaming-aware latency dashboards (TTFT, inter-token, total) with p50/p95/p99 histograms. Cache hit-rate metrics. Dashboards-as-code so they live in Git, not someone's bookmarks.

Wire alerts + budget controls, hand off the runbooks

Three-tier alerts (absolute breach pages on-call, rate-of-change goes to Slack, predictive forecasts go to email) routed through your existing PagerDuty/Opsgenie/Slack. Per-tenant budgets enforced at the gateway. We hand off the runbook for adding new tenants and the playbook for tuning alert thresholds.

// Expert insight

“Most teams treat observability as the cost of running production. The teams that pull ahead treat it as the moat. Six months of high-resolution traces - which prompts your users actually hit, which paths fail, where the long tail of edge cases lives - is an asset a competitor with the same API access can't shortcut. Model parity is the floor of the industry now. What you build on top of your trace store is what differentiates, and it only accumulates if you're capturing the data in the first place.”

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of a Datadog SE and three weeks of trial.

Vendor support engineers can sell you a tool. They can't define what cost-per-outcome means for your product, design the cardinality budget that doesn't blow up your TSDB, or set thresholds that don't page on weekend dips. We've operated platforms at scale - we know which metrics get used in incidents and which ones gather dust.

OpenTelemetry-native by default

We build on OTel GenAI conventions so your telemetry isn't locked to one vendor's SDK or one observability platform. Swap Datadog for Honeycomb tomorrow - your data comes with you.

1B+ tokens/day instrumented in production

We've built the cost-attribution and alerting layer behind a top-10 SEO product. The patterns we deploy are the ones we ourselves run on-call against.

Prometheus + Grafana specialists

Dashboards-as-code, version-controlled alongside your platform. Cardinality budgets that don't surprise the SRE team. No more clicking around to recreate a panel.

Cross-tool fluency

Datadog LLM Observability, Langfuse, LangSmith, Phoenix, Helicone, New Relic - we know which tool fits which job and which trade-offs each one carries.

On-prem & air-gapped capable

Telemetry that stays inside your perimeter - for environments where SaaS observability vendors aren't an option. Self-hosted Langfuse + Prometheus stacks shipped end-to-end.

Senior team, no juniors

Every engineer has shipped LLMs to production at scale. We don't bring ramp-up time to your instrumentation sprint.

// FAQ

Common questions about LLM observability

Whichever one you already run. Prometheus + Grafana, Datadog LLM Observability, New Relic - we instrument with OTel and ship to it. If you have nothing yet, the default we'd reach for is Langfuse or Phoenix for traces plus Prometheus + Grafana for metrics - both open-source and self-hostable. The point is the data, not the tool.

Tenants, users, and request IDs go to traces and logs - not metric labels. Metrics carry bounded dimensions: route, model, model version, tier, region. We define a cardinality budget upfront, monitor active series count, and alert when a new label is about to cost more than the metric is worth. High-cardinality questions get answered via trace queries or analytics warehouse, not Prometheus.

Tail-based sampling. Keep 100% of errors, slow requests (above p95), and 1–5% of normal traffic. For trace-based eval and debugging, we increase the rate per-tenant or per-route on demand. Metrics are never sampled - those are aggregated server-side and stay accurate.

Two layers. A real-time spend dashboard built from inflight token counts and posted prices, accurate to within a few percent. And a daily reconciliation job that pulls actual provider invoices (OpenAI, Anthropic, Bedrock) and compares them line-by-line to our metered numbers - discrepancies trigger an alert. Finance gets numbers that match the invoice; engineering gets numbers that update in seconds.

Yes - vLLM, TGI, TensorRT-LLM, Ray Serve, and custom inference servers with the same OTel conventions. KV-cache hit rate, batch utilization, queue depth, GPU utilization - all surfaced alongside the API-side cost metrics for a unified picture across self-hosted and vendor models.

Three signals layered. Absolute thresholds (per-tenant budget breach), rate-of-change (spend velocity vs rolling 7-day baseline), and predictive (forecasted month-end spend at current rate). Each tier routes differently - predictive goes to email and Slack; absolute breaches page on-call. Combined, they catch real spikes within 5–15 minutes without firing on weekend traffic dips.

First useful dashboards in 1–2 weeks for a single application. Full instrumentation rollout across multiple services with budgets, alerts, and trace integration usually lands in 4–6 weeks. We work in increments - the per-route latency dashboard ships before the cost-attribution one, so the team gets value while the deeper integration progresses.

// Let's ship it

See your LLM workload before your CFO does.

Tell us your stack, your routes, and the questions you currently can't answer. We'll come back with an instrumentation plan and a number - usually within a business day.

Book a meeting hello@bards.ai

Karol Gawron

Head of R&D @ bards.ai