// Production / LLM Observability & Reliability

Custom LLM evals, in CI.

Off-the-shelf harnesses - Inspect AI, ragas, lm-eval-harness, DeepEval, promptfoo - get you to the starting line. They don't tell you whether your contract analyzer extracted the right clause, whether the chatbot's tone matches your brand, or how to grade an answer when ten different responses would all be correct. We build the layer that does, calibrate the judges that score it, and wire the result into your CI as a deploy gate.

// What we see

Public benchmarks pass. Real customers complain anyway.

01

MMLU going up tells you nothing about your product

The team upgrades the model, MMLU and HumanEval hold or improve, the dashboard says ship. Two days later the customer-success channel lights up about tone, citation accuracy, or refusal rate. The benchmark you watched isn't the one your customers grade you on.

02

There's no single right answer to compare against

How do you grade a customer-facing summary? A research synthesis? A coding assistant's explanation? The ground truth isn't a string - it's a region of acceptable answers. Exact-match metrics give you a flat zero on outputs your senior reviewer would happily accept.

03

Structured outputs fail where public benchmarks don't look

Schema compliance under noisy inputs, field-level accuracy, completeness against your form, refusal calibration on edge cases - these are the metrics that move the needle for B2B products. They're also the ones no public leaderboard tracks.

// Case Study

Production LLM Processing at Surfer Scale

We helped Surfer handle massive content generation workloads with a reliable, cost-optimized LLM pipeline built for scale.

  • 300B+

    tokens processed

  • 100k+

    credits sold in 6 months

  • 5 months

    from concept to full product release

Read the case study
Production LLM Processing at Surfer Scale

// What we do

Three layers, from off-the-shelf to deploy gate.

Reuse what's already good. Build the domain-specific layer that doesn't exist yet. Wire the result to your CI so a regression on a metric your customers care about blocks the merge.

Start from open harnesses, not from scratch

Off-the-shelf frameworks have solved the runner, the parallelism, and the result schema. We pick the one that fits your stack - Inspect AI for general capability suites, ragas for RAG, promptfoo for prompt diff, lm-eval-harness for academic baselines - and start there. Your team doesn't pay for plumbing they could've imported.

  • Inspect AI, ragas, lm-eval-harness, DeepEval, promptfoo
  • Langfuse / LangSmith / Phoenix for trace-based evals
  • Weights & Biases or MLflow for experiment tracking
  • Toolchain chosen for your stack, not vendor preference

Build the domain layer that doesn't exist yet

This is the engagement. Golden sets sampled from your production traces. Rubrics that match how your senior reviewer actually grades. Metrics for the questions public benchmarks ignore - schema compliance, citation accuracy, brand-voice match, infinite-correct-answer evals via win-rate or rubric scoring.

  • Golden sets stratified by tenant, intent, and tail behavior
  • Win-rate evals with paired bootstrap and McNemar's test (when there's no single right answer)
  • Rubric scoring with explicit criteria for structured outputs and B2B-grade fields
  • Calibrated LLM-as-judge with human-vs-judge agreement tracked over time

Wire it to CI as a deploy gate

An eval that runs once a quarter is documentation. The version your team trusts is the one a failed run blocks a deploy. We integrate with your CI, set per-metric thresholds, and tier the suites so the smoke run finishes inside a PR feedback loop.

  • GitHub Actions / GitLab CI / Buildkite integration with PR comments
  • Tiered suites - smoke (50–200, on PR) / full (1–5K, nightly) / deep (10K+, weekly)
  • Per-metric thresholds with explicit rollback criteria
  • Sharded execution to keep PR runs under 15 minutes

// Method fit

You can't ship what you can't grade.

skip it if

  • It's a research prototype with no users

    Notebook with twenty test cases is enough while you're still iterating on the structure of the system. Don't pay for CI integration before there's a CI to integrate with.

    For anything that touches a real customer, treat the eval suite as load-bearing.

use it if

Almost everything else. If real users will judge your output, you need evals to judge it first.

Most agencies sell "AI integration" with a smoke test on five hand-picked prompts. That's how teams learn about regressions from customer-success tickets instead of from a failed CI run.

We don't have a softer position. Off-the-shelf harness wired week one (Inspect AI / ragas / promptfoo). Domain golden sets from your traces by week three. Calibrated LLM-as-judge and CI gates by week six.

Without this layer, every prompt change is a coin flip and every model upgrade is a leap of faith. You'll find out what broke - just not from the system you control.

// How we work

Reuse what works. Build what doesn't exist. Hand off the deploy gates.

Start from a working off-the-shelf harness within days. Build the domain layer over weeks. Hand off CI integration plus the runbook for adding cases and re-calibrating judges as the system evolves.

01

Stand up an off-the-shelf harness against your system

Inspect AI for capability suites, ragas for RAG-specific metrics, promptfoo for prompt diff, lm-eval-harness for academic baselines. Connected to your model, your prompts, and your trace store. Working dashboard inside the first week - even before the domain layer exists.

02

Build the domain golden sets and rubrics

Production traces stratified by tenant, intent, and complaint signal. A few hundred to a few thousand examples labeled with your domain experts. Rubrics for structured outputs, citation accuracy, tone, and the questions where infinite answers are correct (graded by win-rate, not exact match). Versioned, refreshed on a cadence we agree.

03

Calibrate LLM-as-judge, wire to CI, hand off the runbook

100–500 human labels per metric to calibrate judges (agreement tracked over time, not assumed). Tiered eval suites in CI - smoke on PR, full nightly, deep weekly. Hand off the runbook for adding new cases, recalibrating after a judge model update, and refreshing golden sets when traffic shifts.

Karol Gawron

// Expert insight

The first version of an eval suite is always wrong. The second version, after you've watched it disagree with your senior reviewer twenty times, is the one that earns the right to gate a deploy. The teams that get this don't think of evals as a checkbox - they treat the rubric as a living artifact, the same way they treat their tests. The teams that don't, ship a model upgrade because MMLU went up and find out from customer success three days later.

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of a benchmark printout from your model vendor.

Your model vendor will hand you their leaderboard numbers. We build the evals that grade what your customers actually care about - and tell you the truth about whether the upgrade ships or rolls back.

CLARIN-PL research lineage

Spun out of academic NLP. We come from a tradition where benchmarks are scrutinized, not blindly trusted.

1B+ tokens/day in production

We've built eval pipelines that gate real deploys for real customers - not toy notebooks against public benchmarks.

Statistics-first methodology

Bootstrap CIs, paired tests, power analysis, and McNemar's on every comparison. Eyeballing leaderboards is how teams ship regressions.

16+ open-source models on Hugging Face

80K+ monthly downloads. We've evaluated more checkpoints than most teams have prompts.

On-prem & air-gapped capable

Eval pipelines that run inside your perimeter on your data - including environments where outbound traffic is blocked.

Senior team, no juniors

Every engineer has shipped LLMs to production at scale. We don't bring ramp-up time to your engagement.

// FAQ

Common questions about LLM evaluation frameworks

Nothing - until you ship. MMLU, HumanEval, GSM8K, MT-Bench, AlpacaEval are useful for cross-model triage and for catching capability regressions on a model upgrade. They're useless for telling you whether your contract analyzer extracted the right clause, whether your chatbot's tone matches your brand, or whether your RAG cited the right source. Public benchmarks are a sanity floor; domain evals are the ceiling that matters for your product.

Three approaches, often combined. Win-rate evals (your output vs. baseline, judged pairwise with paired bootstrap and McNemar's for significance) work for tone, persuasiveness, helpfulness. Rubric scoring breaks the answer into measurable criteria - citations present, schema compliant, no unsupported claims, on-brand voice - each scored independently. And reference-bag matching (acceptable answers as a set, not a single string) for cases where multiple specific outputs are correct.

Three layers. Calibration: judge-vs-human agreement on 100–500 labeled examples per metric. Bias correction: position swapping for pairwise, length normalization, self-preference penalties when the judge is also a candidate. Drift tracking: judge-vs-judge agreement across model versions, so when a vendor silently updates the model, we catch the metric shift before it corrupts a deploy decision.

Whenever the metric is genuinely subjective and your business cares - brand voice, persuasiveness, expert correctness in regulated domains, edge-case safety. Hybrid pipelines work best: human labels on the calibration set and ambiguous cases, LLM-as-judge for breadth. The split usually lands around 5% human, 95% automated, with disagreements flagged for review.

Tiered suites - fast smoke (50–200 examples) on every PR, full (1–5K) nightly, deep (10K+) weekly or pre-release. Aggressive judge-call caching. Stratified subsamples where statistical power allows. Smaller calibrated judges where they agree with the larger model. Eval bills typically land at 1–5% of inference spend.

Depends on traffic drift. Stable use cases (legal doc analysis, fixed-format extraction) - quarterly. Consumer-facing products with rapid intent drift - monthly. We set up data-drift monitors on production embeddings; when the distribution shifts past threshold, the system flags the golden set for review.

No. We build on top of existing harnesses where they fit (Inspect AI, lm-eval-harness, ragas, promptfoo) and extend them where they stop short. The framework, golden sets, rubrics, and judge configs are all your code in your repo. Swap the harness later and the data and rubrics carry over.

// Let's ship it

Build evals that earn the right to gate your deploys.

Tell us your application, your failure modes, and the metrics that would matter to your customers. We'll come back with an eval design and a number - usually within a business day.

Karol Gawron

Karol Gawron

Head of R&D @ bards.ai