// Production / LLM Observability & Reliability

Regressionsyour dashboards don't show.

A prompt edit silently breaks 8% of traces. A vendor model update flips refusal behavior overnight. A tool-call regression at step 3 of your agent drops end-task success by 6% - invisible in per-call metrics. We build the detection layer - diff-based paired eval, behavior-cluster analysis, per-step agent trace analysis, and online change-point detection on production metrics - that catches them before customer support does.

Book a meeting See related case study

// What we see

The metric the team watches isn't the one that broke.

Headline metric flat. Enterprise long-context down 11%

Aggregated win rate looks fine after the prompt change. Sliced by tenant tier, language, and context length, the regression is concentrated where the team isn't looking. Dashboards averaged across the cohort that doesn't have the problem.

The vendor updated the model under you

OpenAI / Anthropic / Gemini push silent model updates multiple times per month. Refusal rate moves 1.5pp on a Tuesday. For agent pipelines, the same update that looks like a minor style shift in single-call evals wipes out 8% of end-task completions when cascaded through 5 hops. Without a held-out canary set running continuously and a change-point detector watching the output distribution, you find out three weeks later when an enterprise customer escalates.

Seventy diffs is too many to review by hand

The eval flagged 70 traces. The team has 20 minutes. Without behavior-cluster analysis grouping diffs by failure mode, the senior reviewer eyeballs the first ten, calls it noise, and ships. The 60 traces clustered into the same systematic regression don't get caught.

// Case Study

Production LLM Processing at Surfer Scale

We helped Surfer handle massive content generation workloads with a reliable, cost-optimized LLM pipeline built for scale.

300B+
tokens processed
100k+
credits sold in 6 months
5 months
from concept to full product release

Read the case study

Production LLM Processing at Surfer Scale

// What we do

Three layers, from diff to drift.

Compare every change to baseline on the dimensions that actually break. Cluster the diffs so a human can review patterns instead of traces. Watch production for the regressions only real traffic surfaces.

Diff-based eval, paired and segmented

Every prompt or model change runs against a paired golden set sampled from production traces. Baseline vs candidate, scored by metric and by segment (tenant tier, language, intent, context length). Bootstrap CIs and McNemar's keep "better" statistically real. Position swapping handles judge bias. Semantic equivalence detection avoids false positives on outputs that are textually different and meaningfully identical.

Paired comparison on golden sets, stratified by segment
Calibrated LLM-as-judge with position swap and length normalization
Bootstrap CIs + McNemar's for paired significance testing
Embedding + rubric-based semantic equivalence to suppress false positives

Behavior-cluster analysis on the diffs

Seventy regression traces become three clusters with twelve representative examples. Embedding-based clustering on diff'd output pairs, LLM-labeled with human-readable summaries. Reviewers see patterns and decide once, instead of eyeballing the first ten and calling it noise.

Embedding clustering on diff vectors with task-tuned thresholds
Cluster labels generated by LLM, edited by human reviewers
Drill-down from cluster → representative trace → full context
Persistent labels for tracking recurring regressions across releases

Production change-point + drift detection

Some regressions only show up in production - vendor model updates, traffic distribution shifts, retrieval index decay, and for agentic pipelines, step-level failures that degrade end-task success without touching per-call quality scores. We run continuous evaluation against a held-out canary set with fixed prompts and a change-point detector watching the output distribution. For agent pipelines we instrument at the step level. Alerts fire within minutes of a statistically significant shift.

Online change-point detection on quality, refusal, latency, error
Held-out canary set with fixed prompts to catch silent vendor updates
Per-step agent trace monitoring - catches step-level regressions invisible in end-task metrics
Embedding-drift monitors on input/output distributions, per-route and per-tenant

// Method fit

Regression detection earns its keep when changes ship often.

skip it if

You don't have evals yet
Regression detection is a layer on top of evals - the diff is between baseline and candidate scores, both produced by the eval framework. Without that framework, there's nothing to diff. Start with the eval engagement; this layers cleanly on top.
Custom LLM Evaluation Frameworks
You ship LLM changes once a quarter
At that cadence, your senior reviewer can hand-check a few hundred examples on each release and catch most of what matters. Regression detection earns its keep when changes ship weekly+ and the cost of missing one outweighs the cost of running the system.
Your headline metric is the only thing your customers care about
For some single-purpose products with one clear quality dimension (extract this field, classify this ticket), watching the headline metric in production is most of the way there. Multi-dimensional regression detection pays off when quality is a vector - tone, citation accuracy, refusal calibration, schema compliance - and regressions concentrate in segments.

use it if

You ship prompt or model changes weekly+, your customers span tiers/languages/use-cases, and at least one segment quietly regressed without the team noticing in the last six months.

You've been surprised by a vendor model update - silent refusal-rate shift, output style flip, latency tail change - and want the next one caught at the metric, not at the support ticket.

Your eval already runs on every change but the team is drowning in diffs to review by hand, and you want behavior-cluster grouping to compress 70 traces into 3 patterns.

You're evaluating a model migration - moving from one vendor's model to another, or from a frontier API to a fine-tuned open-weight model - and need quantified per-segment quality deltas before you flip traffic.

// How we work

Wire diff eval. Cluster the failures. Watch production continuously.

Diff-based eval against your existing golden set is the fast win. Behavior-cluster analysis comes once the diff volume justifies it. Production change-point detection runs continuously after the rollout machinery is in place.

Wire diff-based eval into CI

Paired eval on every prompt/model change, segmented by tenant tier, language, intent, and context length. Bootstrap CIs and McNemar's for significance. Semantic equivalence checks to keep false-positive rate low. PR comments with breakdowns by segment. First useful gates within 2–3 weeks.

Add behavior-cluster analysis on the diffs

Embedding-based clustering on regression traces, LLM-labeled with human-readable cluster summaries. Side-by-side diff UI with cluster context and trace history. Persistent cluster labels so recurring regressions are tracked across releases. The 70-trace review becomes a 3-cluster review.

Wire production change-point + drift detection

Held-out canary set with fixed prompts running continuously against your live model - catches silent vendor updates within hours. Online change-point detection on quality, refusal, latency. Embedding-drift monitors on input/output distributions. Alerts routed into your existing PagerDuty/Opsgenie/Slack on-call.

// Expert insight

“The regressions that hurt aren't the ones that show up in your headline metric - they're concentrated in a segment, a language, a tenant tier, or a context-length bucket nobody slices by. The whole job is making sure the system slices by them automatically, before a customer notices first. The teams that get this stop hearing about regressions from support; the teams that don't think their dashboards are working until they aren't.”

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of a senior engineer eyeballing diffs in a notebook.

Manual diff review scales until it doesn't - usually around the time a vendor model update lands and the team is suddenly debugging a refusal-rate shift instead of shipping. We've debugged enough silent LLM regressions in production to know which detection patterns earn their keep and which generate noise nobody acts on.

1B+ tokens/day in production

We've operated regression detection on platforms serving real customers - not just spun up a notebook against a benchmark.

CLARIN-PL research lineage

Spun out of academic NLP. We treat regression analysis as the statistical problem it is - paired comparisons, segment effects, change-point detection.

Statistics-first methodology

Bootstrap CIs, McNemar's, online change-point algorithms, and significance testing are reflexes - not slides we put in a deck.

16+ open-source models on Hugging Face

80K+ monthly downloads. We've watched models change behavior across versions - both ours and our customers' vendors - and have the scars to prove it.

On-prem & air-gapped capable

Detection pipelines that run inside your perimeter on your data - including environments where outbound traffic is blocked.

Senior team, no juniors

Every engineer has shipped LLMs to production at scale. We don't bring ramp-up time to your engagement.

// FAQ

Common questions about LLM regression detection

Two layers. Per-segment evaluation - every metric is broken down by tenant tier, language, intent, context length, and any other dimension that matters. A 1% headline regression that's actually 8% on enterprise long-context surfaces immediately. And behavior-cluster analysis on the diffs themselves - embedding-based clustering groups failures by pattern so a 70-trace regression becomes 3 clusters with 12 reviewable examples.

Power analysis sets the size, not folklore. For a primary metric where you need to detect a 2% regression with 80% power, a calibrated LLM-as-judge typically needs 1,500–3,000 paired examples. We start there, then add segments - at least 200–500 examples per critical segment (enterprise tier, regulated languages, high-stakes intents). Most production suites land between 2K and 10K examples, run as smoke (200–500) on PR and full nightly.

Three filters before anything pages a human. Statistical significance - bootstrap CIs and McNemar's where applicable, no flag without p-value. Semantic equivalence - embedding-based and rubric-based checks catch outputs that are textually different but mean the same thing. And cluster size - single-trace anomalies route to a low-priority review queue while clustered patterns trigger gates. Review-queue-to-real-regression ratio typically lands between 2:1 and 4:1.

Yes - that's one of the main use cases. We run continuous evaluation against a held-out canary set with fixed prompts and fixed input distribution. When the vendor updates, output distribution shifts, judge scores move, and our change-point detector flags the date and the metric. Combined with input/output embedding drift monitors, we typically catch silent updates within hours.

Hard gates for safety regressions (refusal, policy violation, prompt injection) and clear quality breaches (>5% regression with high statistical confidence). Human review for ambiguous cases - small regressions, mixed segment effects, novel failure modes. Split is typically 80/20 automation/human, and human decisions get logged as labels so the auto-classifier improves over time.

We integrate with whatever you use - Langfuse, PromptLayer, LaunchDarkly, custom Git-based - and tag every eval run with prompt version, model version, and config hash. When a regression appears in production, you can trace it back to the exact change that introduced it. If you don't have prompt versioning yet, we set it up as part of the engagement.

Diff-based eval on PRs and a baseline regression suite typically lands in 2–3 weeks. Behavior-cluster analysis adds another 2 weeks. Production change-point detection and human review queues - 6–8 weeks total for a production-grade setup. We work in increments so each milestone delivers usable capability before the deeper layers ship.

Agent regressions are harder to surface because a step-3 tool-call failure shows up only as a drop in end-task success rate - per-call quality scores look fine. We instrument at three levels: per-step (each LLM call and tool invocation gets its own trace and score), per-trajectory (common agent paths get golden-set comparison), and end-to-end (task completion rate as the headline metric). Behavior-cluster analysis groups failures by the step they originate at, so a reviewer sees 'step 3 tool-call parse failure: 23 cases' instead of 70 individual agent traces. Change-point detection runs on both per-step scores and end-task success rate so silent vendor updates that only manifest through cascaded agent failures still surface within hours.

// Let's ship it

Catch the regression before your customer does.

Tell us your application, your prompt-change frequency, and the last regression that caught you off guard. We'll come back with a detection plan and a number - usually within a business day.

Book a meeting hello@bards.ai

Karol Gawron

Head of R&D @ bards.ai