// Production / Inference Optimization

Distillation. Smaller models, same answers.

Teacher-student training that takes a 70B model and gives you a 7B that holds up on your evals - at a fraction of the cost. We do this with eval gates, not vibes, and we tell you when distillation is the wrong tool.

Book a meeting See related case study

distill.py

def distill_loss(student, teacher, x, y):
  s_logits = student(x)
  with torch.no_grad():
    t_logits = teacher(x)

  kl = F.kl_div(
    F.log_softmax(s_logits / T, dim=-1),
    F.softmax(t_logits / T, dim=-1),
    reduction="batchmean",
  ) * (T * T)

  ce = F.cross_entropy(s_logits, y)
  return alpha * kl + (1 - alpha) * ce

# teacher: Llama-3-70B
# student: Llama-3.2-3B
# T=4.0, alpha=0.7

70B → 3B · 23x cheaper-2.1 pts on eval

// What we see

The frontier API bill is the symptom. The task is too narrow to justify it.

GPT-5 pricing on a classification step

A retrieval-reranker, an intent classifier, a domain-specific extractor - steps that run thousands of times a day on narrow, well-defined inputs. Teams reach for the frontier model out of habit and end up with a $40K/month API bill for a task a 3B fine-tune would nail at $2K.

Latency the product team sold that the model can't hit

The product commits to a sub-500ms response. The 70B model takes 2 seconds on a loaded cluster. The latency budget and the model choice were both made independently - and they don't fit. Swapping to a well-distilled 7B often closes that gap without touching the application.

Provider dependency baked into the unit economics

Every pricing change from OpenAI or Anthropic reprices the P&L. Every model update is a regression risk nobody tested. Teams discover this at the wrong moment - during a vendor negotiation or after a silent model update breaks production behavior.

// Case Study

Fine-tuned a small model to frontier quality - 50× cheaper at high volume

Customer's frontier-API entity-extraction pipeline worked but the per-token bill was eating margin at the volume they wanted to ship at. We split the task into hybrid retrieval + two fine-tuned Gemini 2.5 Flash Lite models. 98.3% F1 retention on the customer's existing eval suite, ~50× cheaper per 1000 requests, ~3× faster - without touching the prompt or the eval.

50×
cheaper per 1000 requests
3×
lower end-to-end latency
98.3%
F1 retention vs. frontier-API baseline

Read the case study

Fine-tuned a small model to frontier quality - 50× cheaper at high volume

// What we deliver

Distillation pipelines, end to end.

From teacher selection to deployed student, with the eval harness checked into your repo so you can repeat it next quarter.

Teacher and student selection

The teacher determines the ceiling and the student determines the floor. We profile both before training starts - your task, your data, your latency budget.

Open-weight teachers: Llama, Qwen, DeepSeek, Mistral
Closed-model distillation via API logs and synthetic generation
Student sized to your latency budget, not arbitrary parameter count
Architecture matching where it helps, divergence where it pays off

Distillation losses that work

KL divergence on logits is the textbook answer. The real world wants attention transfer, hidden state matching, and task-specific objectives layered on top.

Soft-target KL divergence with temperature tuning
MSE on logits for sequence-level transfer
Hidden state and attention matrix alignment
Combined CE + distillation losses with curriculum scheduling

Distillation dataset curation

The data is the model. We mine your traffic, generate from the teacher, and filter aggressively - because teacher mistakes become student dogma.

Traffic mining and PII-safe replay from production
Teacher generation with self-consistency filtering
Difficulty-aware sampling so the student learns the hard cases
Domain coverage audits against your taxonomy

Eval harness and regression reports

If you can't measure it, you can't ship it. Every distilled model goes through a custom eval suite tied to deploy gates.

Task-specific eval suites, not just MMLU and GSM8K
Per-segment regression analysis on your real traffic
Pairwise judge models for open-ended tasks
Cost-per-quality curves so the tradeoff is explicit

Distillation vs alternatives

Sometimes quantization is the right call. Sometimes a sharper prompt and a smaller stock model wins. We tell you which lever to pull.

Quantization (AWQ, GPTQ, FP8) where the model is already small
Pruning and structured sparsity where compute is the bottleneck
LoRA and adapter routing instead of a full distillation
Honest recommendation when no compression is the answer

Production rollout and watch

A distilled model behaves differently under real traffic than on your eval set. We instrument the rollout so the regression doesn't surface in a customer ticket.

Shadow deployment alongside the teacher for direct comparison
Canary rollout gated on live eval metrics
Drift detection on input distribution and output quality
Easy rollback path with the teacher kept warm during ramp

// Method fit

Distillation earns its cost when the task is narrow and high-volume.

skip it if

Quantization gets you close enough
If your model is already a reasonable size and the cost gap is modest, AWQ INT4 or FP8 quantization gets 30-50% savings at near-zero engineering cost. Run the benchmark first - if quantization lands within your budget, distillation is overkill.
The task keeps shifting
Distillation locks in a capability snapshot. If your product is pre-PMF and the task definition changes sprint-to-sprint, you'll retrain before the first deployment pays back. Ship on the frontier API, find product fit, then distill what stabilizes.
Your teacher isn't reliable on the task
Distillation transfers both the teacher's strengths and its mistakes. If the teacher hallucinates 10% of the time on your domain, the student learns to hallucinate more efficiently. Fix the teacher's accuracy first - with better prompts, RAG, or fine-tuning - then distill.
The payback period is longer than 3 months
Simple math: engagement cost vs. monthly inference savings. If the ROI is longer than 3 months, the engagement rarely makes sense - models change too fast and the student needs retraining as the task evolves. We'll tell you this after a scoping call rather than take the project.

use it if

You have a high-volume, narrow task - extraction, classification, structured output, domain Q&A - where a frontier model is 5-20x more expensive than the task warrants.

Your teacher is consistently correct on that task (>90% eval score on your data) and you have enough production traffic to mine for training examples.

You want a reproducible training pipeline you can re-run as the task evolves - not a one-off checkpoint. The pipeline is the deliverable; the first student is just the first run.

// How we work

Eval first. Teach on filtered data. Shadow before you swap.

The number we're moving is 'student quality within X points of teacher at Y cost.' Without a task-specific eval in place before training starts, distillation is guesswork dressed up as engineering.

Eval design + teacher profiling (week one)

Task-specific eval suite built from your production traces - not public benchmarks. We profile the teacher against it: per-segment accuracy, hallucination rate, and the tail of failures. The teacher gap sets the student's quality bar and tells us which training examples are safe to learn from.

Dataset construction + distillation run

Production traffic mined with PII-safe replay, teacher generation with self-consistency filtering, reward-model scoring to discard bad examples (we typically keep 20-40% of raw output). Distillation losses chosen per task - soft-target KL for general NLU, hidden-state alignment for structured outputs. Pilot on 10K examples first to validate the recipe before committing to the full run.

Shadow deployment + handoff

Student deployed alongside the teacher for direct comparison on real traffic. Canary ramp once shadow metrics hold. We hand off the training pipeline as code in your repo, the eval suite wired into CI, and a retraining runbook so the team can re-distill as the task evolves.

// Expert insight

“The mistake we see is teams distilling against academic benchmarks and then deploying for a product that looks nothing like MMLU. The eval has to be your eval - built from your traffic, judged the way your customers judge - or the distilled model will pass the test and fail the launch.”

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Researchers who train. Engineers who deploy.

Distillation is half ML research, half production engineering. We do both, and we don't hand off between them.

16+ open-source models on Hugging Face

80K+ monthly downloads. We've trained, distilled, and shipped models people actually use - not just papers about them.

CLARIN-PL spinoff, NLP research roots

We came out of one of Europe's strongest NLP research labs. Distillation losses and curriculum scheduling are not new territory for us.

Production training at scale

Multi-node training on H100s, FSDP, mixed precision, and the boring infrastructure that makes long runs not blow up overnight.

Eval harnesses we ship with

Every project includes a task-specific eval harness in your repo - so the next distillation, fine-tune, or prompt change is measurable too.

On-prem & air-gapped capable

We can train on your hardware, with your data, behind your firewall - including environments where outbound network is blocked.

Senior team, no juniors

Every engineer has trained and deployed models in production. We don't bring ramp-up time to your project.

// FAQ

Common questions about model distillation

On task-specific evals, a well-distilled student typically lands within 1-3 percentage points of a teacher 10x its size - sometimes closer when the task is narrow. On open-ended general evals the gap is larger. We measure this on your evals before promising anything, and we'll walk away if the regression is unacceptable.

They solve different problems. Quantization is cheapest when your model is already small enough - drop to FP8 or AWQ INT4 and ship. Pruning helps when compute, not memory, is the bottleneck. Distillation is the right choice when you need to drop one or two orders of magnitude in size and quantizing the big model isn't enough. Often the answer is a combination.

Yes - via API output distillation. We replay or generate prompts, capture teacher outputs, and train an open-weight student on the resulting dataset. Quality depends heavily on dataset coverage and filtering. Check your provider's terms of service first; some prohibit using outputs to train competing models.

Less than you'd think for narrow tasks - tens of thousands of high-quality teacher-labeled examples can be enough for a focused student. Broader behavior transfer wants hundreds of thousands to millions of examples, often generated and filtered. We typically combine real traffic, teacher synthetic generation, and curated public data.

We build a task-specific eval suite from your traffic and your acceptance criteria - not just MMLU and HellaSwag. For open-ended outputs we use pairwise judge models, often a separate strong LLM, validated against human ratings on a sample. The harness is checked into your repo so you can re-run it next quarter when something changes.

End-to-end distillation projects typically run 4 to 8 weeks. The first two weeks are dataset construction and eval harness; the next two to three are training and iteration; the rest is rollout and monitoring. We work in weekly increments with a measurable artifact at the end of each week.

Yes - we routinely train on customer infrastructure, including on-prem H100 clusters and air-gapped environments. We bring the training stack (FSDP, mixed precision, eval harness, checkpointing) and operate it on your boxes. Your data and your weights never leave your perimeter.

// Let's ship it

Cut your inference bill without cutting your quality.

Tell us your teacher, your task, and your latency budget. We'll come back with a distillation plan, an eval design, and a number - usually within a business day.

Book a meeting hello@bards.ai

Karol Gawron

Head of R&D @ bards.ai