// Production / LLM Observability & Reliability
Canary and shadow for LLMs.
Prompt edits and model swaps look harmless until 8% of enterprise traces silently regress. We build the rollout machinery - shadow traffic with paired output capture, canary stages with statistical gates (win-rate, paired bootstrap, McNemar's), and auto-rollback wired into your existing API gateway, Ray Serve, Envoy, or LaunchDarkly setup.
// What we see
The smoke test passed. Customer success became the regression detector.
01
Five hand-picked prompts always pass
The team's smoke test is the same 50 prompts that worked when the system shipped. The 8% of enterprise traces that depend on long-range coherence aren't in the suite. The deploy goes out, the regression hits, customer success files the ticket two days later.
02
Eyeballing 200 outputs isn't statistics
Stochastic outputs need paired bootstrap CIs and McNemar's, not a glance at side-by-side examples. "Looks better to me" is how teams ship a model upgrade that quietly improves consumer traffic and degrades enterprise.
03
Rollbacks happen at 3am, not at the gate
Without an automated rollback armed on guardrail metrics, a regression discovered post-deploy means a manual revert under pressure. The team that's already worked all day is the one debugging the rollout at 3am.
// Case Study
Production LLM Processing at Surfer Scale
We helped Surfer handle massive content generation workloads with a reliable, cost-optimized LLM pipeline built for scale.
300B+
tokens processed
100k+
credits sold in 6 months
5 months
from concept to full product release

// What we do
Three layers, from shadow to auto-rollback.
Mirror the candidate against real traffic. Stage the rollout with statistical gates. Auto-revert when a guardrail breaches. Built on top of the gateway, flag system, and CI you already run.
Shadow traffic + paired output capture
Every production request hits the candidate too - asynchronously, with no impact on baseline latency. Outputs land in a side-by-side comparison store with full trace correlation. By the time the canary starts, you already know how the candidate behaves on real distribution.
- Async mirroring via Envoy / API gateway / Ray Serve
- Sample 5–100% of traffic with explicit cost controls
- Pre-canary signals - latency, error rate, output diff metrics
- Side-by-side output storage with prompt/model/version tagging
Staged canary with statistical gates
Traffic ramps 1% → 10% → 50% → 100% with explicit gates between each stage. Power-sized to detect the deltas you care about. Per-tenant rollout for high-risk customers (enterprise stays at 0% while consumer ramps).
- Configurable stages, durations, and traffic percentages
- Win-rate eval with paired bootstrap CIs and McNemar's test
- Per-tenant + sticky-session canary for stateful agents and chat
- Feature flags via LaunchDarkly, Unleash, or built-in
Auto-rollback + traffic control
Guardrail breach triggers automated revert in under 30 seconds. Threshold-based on quality, latency, error, refusal, and safety metrics. Manual override for ambiguous regressions, audit log on every rollback decision.
- Threshold-based triggers on quality, latency, error, refusal, safety
- Per-stage rollback policies and cooldown windows
- Kill switches that revert in under 30 seconds
- Slack / PagerDuty / Opsgenie postmortem-ready notification
// Method fit
Staged rollout machinery isn't the right move for every team.
skip it if
You ship LLM changes monthly or less
At low change frequency, a careful manual smoke test on a couple hundred prompts plus a fast revert path is enough. The full rollout machinery pays back when you're shipping prompt or model changes weekly+ and the cost of a regression is bigger than the cost of running the gates.
You don't have evals yet
Canary stages need gates. Gates need metrics. Without an eval framework that scores baseline vs candidate, the rollout has nothing to decide on. Start with the eval engagement - this one layers cleanly on top once that exists.
Custom LLM Evaluation FrameworksYou're still pre-PMF and the system shape is changing
Building rollout machinery for code that gets rewritten next sprint is plumbing for nothing. Ship the thing, find product fit, then layer rollout discipline on what stabilizes.
use it if
You're shipping prompt or model changes weekly+ to a multi-tenant production system, real customers feel the regressions, and the smoke test is starting to feel like luck.
You already have evals (calibrated LLM-as-judge, golden sets from production) and want to wire them as canary gates rather than as a manual step someone runs before merging.
You've been burned by a vendor model update or a prompt edit that quietly degraded a segment, and you want the next one caught at the gate - not by a customer.
// How we work
Mirror first. Stage with stats. Auto-rollback wired in.
Every engagement starts with shadow mirroring so the team has data on the candidate before any user traffic touches it. Then the canary state machine. Then the rollback wiring and CI integration.
01
Wire shadow traffic and paired output capture
Async traffic mirroring via your existing gateway (Envoy, Ray Serve, API gateway) with side-by-side output storage and trace correlation. Pre-canary metrics - latency, error rate, output diff - wired into your dashboards. First useful shadow data within 1–2 weeks.
02
Build the canary state machine with statistical gates
Configurable stages (1% → 10% → 50% → 100%), power-sized for your traffic volume. Win-rate eval with paired bootstrap and McNemar's. Per-tenant + sticky-session routing for stateful workloads. Feature flag integration with LaunchDarkly / Unleash or built-in flags.
03
Wire auto-rollback, integrate with CI, hand off the runbook
Guardrail-triggered revert under 30 seconds. PagerDuty/Opsgenie/Slack notifications with rollback context. Audit logs on every decision. Hand off the runbook for adding new gates, tuning thresholds, and per-tenant override patterns.
// Expert insight
“The hard part of LLM rollouts isn't the canary mechanics - it's deciding what "better" even means. A new prompt with a 3% higher win rate but a 0.5pp higher refusal rate on enterprise traffic isn't an improvement, it's a tradeoff. The framework has to make that tradeoff visible at the gate, before someone has to call a customer back.”
Karol Gawron
Head of R&D @ bards.ai
// Why bards.ai
Why us, instead of an SREwho's never shipped an LLM.
Generic deploy machinery assumes deterministic outputs and clean conversion metrics. LLMs have neither. The patterns that work for a stateless API don't catch the regression where 8% of enterprise tone subtly drifts. We've built rollouts at scale and we know which gates earn their keep.
1B+ tokens/day in production
Staged rollouts of prompts, model swaps, and pipeline changes on platforms serving real customers - not just internal tooling.
Statistics-first by training
Spun out of CLARIN-PL research. Power analysis, paired tests, bootstrap CIs, and McNemar's are reflexes, not afterthoughts.
Ray Serve + Envoy + LaunchDarkly fluency
We build the rollout layer on top of your gateway, not in parallel to it. Your SREs aren't asked to learn a new traffic system.
Eval-integrated by default
Canary gates are only as good as the metrics they read. We integrate with calibrated LLM-as-judge, win-rate eval, and your existing observability stack so the gate decisions reflect what your customers care about.
On-prem & air-gapped capable
Rollout infrastructure that runs inside your perimeter - including environments where SaaS feature-flag vendors aren't an option.
Senior team, no juniors
Every engineer has shipped LLMs to production at scale. We don't bring ramp-up time to your engagement.
// FAQ
Common questions about canary and shadow deployment
Traditional A/B assumes deterministic outputs and a clean conversion metric. LLMs are stochastic - the same prompt produces different outputs on retry - and your real quality metric is multi-dimensional (faithfulness, tone, refusal rate, latency). Our framework handles non-determinism via paired comparisons, treats quality as a vector not a scalar, and weights tradeoffs explicitly rather than collapsing everything to one number.
Three approaches, used together. Win-rate eval - score baseline vs candidate on the same input via calibrated LLM-as-judge with position swapping. Distributional metrics - compare refusal rate, hallucination rate, and tool-use accuracy across thousands of paired requests. Bootstrap confidence intervals on every metric so you know when a 1.5% delta is real and when it's noise. We size canary stages with power analysis to detect the effects you care about.
Configurable per route, but typically: any safety metric breach (refusal, prompt injection, policy violation), error rate above 2× baseline, p99 latency above budget for 5+ minutes, and quality metrics dropping more than a configured threshold with statistical significance. Anything ambiguous (small regression on one metric, gain on another) routes to a human review channel rather than auto-rolling back.
Procedurally similar, technically different. Prompt changes can ship faster - shorter shadow window, smaller canary durations - because the blast radius is bounded and the cost is just inference. Model swaps need a longer shadow phase to surface latency, cost, and tail-distribution shifts. Same framework, different default schedules and gate criteria.
Yes - that's standard. Enterprise customers usually start at 0% rollout while consumer traffic ramps to 100%, then we enable per-tenant after the consumer canary completes successfully. Sticky-session routing keeps individual users on a consistent variant for chat and agent workloads where mid-conversation flips would be disorienting.
Long enough to accumulate statistical power for your primary metric. For high-traffic routes (1M+ requests/day), 1–2 hours per stage is usually sufficient. For lower-traffic routes, stages can run 24–48 hours. The framework calculates required duration from your historical traffic shape and the minimum detectable effect you specify - and refuses to advance early.
We build on top of what you already run. If you have an API gateway (Envoy, Kong, AWS API Gateway), traffic splitting happens there. If you use LaunchDarkly or Unleash for flags, we wire into those. If you're on Ray Serve, we use its native traffic-shifting. The rollout state machine and metric gates are the part we add - the routing layer reuses your stack.
// Let's ship it
Ship LLM changes without praying it works.
Tell us how you currently roll out prompt and model changes, and where the last regression came from. We'll come back with a rollout architecture and a number - usually within a business day.
Karol Gawron
Head of R&D @ bards.ai