// Production / Inference Optimization
Distillation. Smaller models, same answers.
Teacher-student training that takes a 70B model and gives you a 7B that holds up on your evals - at a fraction of the cost. We do this with eval gates, not vibes, and we tell you when distillation is the wrong tool.
// What we see
The frontier API bill is the symptom. The task is too narrow to justify it.
01
GPT-5 pricing on a classification step
A retrieval-reranker, an intent classifier, a domain-specific extractor - steps that run thousands of times a day on narrow, well-defined inputs. Teams reach for the frontier model out of habit and end up with a $40K/month API bill for a task a 3B fine-tune would nail at $2K.
02
Latency the product team sold that the model can't hit
The product commits to a sub-500ms response. The 70B model takes 2 seconds on a loaded cluster. The latency budget and the model choice were both made independently - and they don't fit. Swapping to a well-distilled 7B often closes that gap without touching the application.
03
Provider dependency baked into the unit economics
Every pricing change from OpenAI or Anthropic reprices the P&L. Every model update is a regression risk nobody tested. Teams discover this at the wrong moment - during a vendor negotiation or after a silent model update breaks production behavior.
// Case Study
Fine-tuned a small model to frontier quality - 50× cheaper at high volume
Customer's frontier-API entity-extraction pipeline worked but the per-token bill was eating margin at the volume they wanted to ship at. We split the task into hybrid retrieval + two fine-tuned Gemini 2.5 Flash Lite models. 98.3% F1 retention on the customer's existing eval suite, ~50× cheaper per 1000 requests, ~3× faster - without touching the prompt or the eval.
50×
cheaper per 1000 requests
3×
lower end-to-end latency
98.3%
F1 retention vs. frontier-API baseline

// What we deliver
Distillation pipelines, end to end.
From teacher selection to deployed student, with the eval harness checked into your repo so you can repeat it next quarter.
Teacher and student selection
The teacher determines the ceiling and the student determines the floor. We profile both before training starts - your task, your data, your latency budget.
- Open-weight teachers: Llama, Qwen, DeepSeek, Mistral
- Closed-model distillation via API logs and synthetic generation
- Student sized to your latency budget, not arbitrary parameter count
- Architecture matching where it helps, divergence where it pays off
Distillation losses that work
KL divergence on logits is the textbook answer. The real world wants attention transfer, hidden state matching, and task-specific objectives layered on top.
- Soft-target KL divergence with temperature tuning
- MSE on logits for sequence-level transfer
- Hidden state and attention matrix alignment
- Combined CE + distillation losses with curriculum scheduling
Distillation dataset curation
The data is the model. We mine your traffic, generate from the teacher, and filter aggressively - because teacher mistakes become student dogma.
- Traffic mining and PII-safe replay from production
- Teacher generation with self-consistency filtering
- Difficulty-aware sampling so the student learns the hard cases
- Domain coverage audits against your taxonomy
Eval harness and regression reports
If you can't measure it, you can't ship it. Every distilled model goes through a custom eval suite tied to deploy gates.
- Task-specific eval suites, not just MMLU and GSM8K
- Per-segment regression analysis on your real traffic
- Pairwise judge models for open-ended tasks
- Cost-per-quality curves so the tradeoff is explicit
Distillation vs alternatives
Sometimes quantization is the right call. Sometimes a sharper prompt and a smaller stock model wins. We tell you which lever to pull.
- Quantization (AWQ, GPTQ, FP8) where the model is already small
- Pruning and structured sparsity where compute is the bottleneck
- LoRA and adapter routing instead of a full distillation
- Honest recommendation when no compression is the answer
Production rollout and watch
A distilled model behaves differently under real traffic than on your eval set. We instrument the rollout so the regression doesn't surface in a customer ticket.
- Shadow deployment alongside the teacher for direct comparison
- Canary rollout gated on live eval metrics
- Drift detection on input distribution and output quality
- Easy rollback path with the teacher kept warm during ramp
// Method fit
Distillation earns its cost when the task is narrow and high-volume.
skip it if
Quantization gets you close enough
If your model is already a reasonable size and the cost gap is modest, AWQ INT4 or FP8 quantization gets 30-50% savings at near-zero engineering cost. Run the benchmark first - if quantization lands within your budget, distillation is overkill.
The task keeps shifting
Distillation locks in a capability snapshot. If your product is pre-PMF and the task definition changes sprint-to-sprint, you'll retrain before the first deployment pays back. Ship on the frontier API, find product fit, then distill what stabilizes.
Your teacher isn't reliable on the task
Distillation transfers both the teacher's strengths and its mistakes. If the teacher hallucinates 10% of the time on your domain, the student learns to hallucinate more efficiently. Fix the teacher's accuracy first - with better prompts, RAG, or fine-tuning - then distill.
The payback period is longer than 3 months
Simple math: engagement cost vs. monthly inference savings. If the ROI is longer than 3 months, the engagement rarely makes sense - models change too fast and the student needs retraining as the task evolves. We'll tell you this after a scoping call rather than take the project.
use it if
You have a high-volume, narrow task - extraction, classification, structured output, domain Q&A - where a frontier model is 5-20x more expensive than the task warrants.
Your teacher is consistently correct on that task (>90% eval score on your data) and you have enough production traffic to mine for training examples.
You want a reproducible training pipeline you can re-run as the task evolves - not a one-off checkpoint. The pipeline is the deliverable; the first student is just the first run.
// How we work
Eval first. Teach on filtered data. Shadow before you swap.
The number we're moving is 'student quality within X points of teacher at Y cost.' Without a task-specific eval in place before training starts, distillation is guesswork dressed up as engineering.
01
Eval design + teacher profiling (week one)
Task-specific eval suite built from your production traces - not public benchmarks. We profile the teacher against it: per-segment accuracy, hallucination rate, and the tail of failures. The teacher gap sets the student's quality bar and tells us which training examples are safe to learn from.
02
Dataset construction + distillation run
Production traffic mined with PII-safe replay, teacher generation with self-consistency filtering, reward-model scoring to discard bad examples (we typically keep 20-40% of raw output). Distillation losses chosen per task - soft-target KL for general NLU, hidden-state alignment for structured outputs. Pilot on 10K examples first to validate the recipe before committing to the full run.
03
Shadow deployment + handoff
Student deployed alongside the teacher for direct comparison on real traffic. Canary ramp once shadow metrics hold. We hand off the training pipeline as code in your repo, the eval suite wired into CI, and a retraining runbook so the team can re-distill as the task evolves.
// Expert insight
“The mistake we see is teams distilling against academic benchmarks and then deploying for a product that looks nothing like MMLU. The eval has to be your eval - built from your traffic, judged the way your customers judge - or the distilled model will pass the test and fail the launch.”
Karol Gawron
Head of R&D @ bards.ai
// Why bards.ai
Researchers who train. Engineers who deploy.
Distillation is half ML research, half production engineering. We do both, and we don't hand off between them.
16+ open-source models on Hugging Face
80K+ monthly downloads. We've trained, distilled, and shipped models people actually use - not just papers about them.
CLARIN-PL spinoff, NLP research roots
We came out of one of Europe's strongest NLP research labs. Distillation losses and curriculum scheduling are not new territory for us.
Production training at scale
Multi-node training on H100s, FSDP, mixed precision, and the boring infrastructure that makes long runs not blow up overnight.
Eval harnesses we ship with
Every project includes a task-specific eval harness in your repo - so the next distillation, fine-tune, or prompt change is measurable too.
On-prem & air-gapped capable
We can train on your hardware, with your data, behind your firewall - including environments where outbound network is blocked.
Senior team, no juniors
Every engineer has trained and deployed models in production. We don't bring ramp-up time to your project.
// FAQ
Common questions about model distillation
On task-specific evals, a well-distilled student typically lands within 1-3 percentage points of a teacher 10x its size - sometimes closer when the task is narrow. On open-ended general evals the gap is larger. We measure this on your evals before promising anything, and we'll walk away if the regression is unacceptable.
They solve different problems. Quantization is cheapest when your model is already small enough - drop to FP8 or AWQ INT4 and ship. Pruning helps when compute, not memory, is the bottleneck. Distillation is the right choice when you need to drop one or two orders of magnitude in size and quantizing the big model isn't enough. Often the answer is a combination.
Yes - via API output distillation. We replay or generate prompts, capture teacher outputs, and train an open-weight student on the resulting dataset. Quality depends heavily on dataset coverage and filtering. Check your provider's terms of service first; some prohibit using outputs to train competing models.
Less than you'd think for narrow tasks - tens of thousands of high-quality teacher-labeled examples can be enough for a focused student. Broader behavior transfer wants hundreds of thousands to millions of examples, often generated and filtered. We typically combine real traffic, teacher synthetic generation, and curated public data.
We build a task-specific eval suite from your traffic and your acceptance criteria - not just MMLU and HellaSwag. For open-ended outputs we use pairwise judge models, often a separate strong LLM, validated against human ratings on a sample. The harness is checked into your repo so you can re-run it next quarter when something changes.
End-to-end distillation projects typically run 4 to 8 weeks. The first two weeks are dataset construction and eval harness; the next two to three are training and iteration; the rest is rollout and monitoring. We work in weekly increments with a measurable artifact at the end of each week.
Yes - we routinely train on customer infrastructure, including on-prem H100 clusters and air-gapped environments. We bring the training stack (FSDP, mixed precision, eval harness, checkpointing) and operate it on your boxes. Your data and your weights never leave your perimeter.
// Let's ship it
Cut your inference bill without cutting your quality.
Tell us your teacher, your task, and your latency budget. We'll come back with a distillation plan, an eval design, and a number - usually within a business day.
Karol Gawron
Head of R&D @ bards.ai