// Production / Inference Optimization

Distillation. Smaller models, same answers.

Teacher-student training that takes a 70B model and gives you a 7B that holds up on your evals - at a fraction of the cost. We do this with eval gates, not vibes, and we tell you when distillation is the wrong tool.

// Why distill

Cut inference cost without re-architecting the application.

01

5-20x cheaper inference

A well-distilled 7B serves your task at a small fraction of the GPU footprint of a 70B teacher - and runs on hardware you can actually scale to in a region where the big model's quota is gone.

02

Latency that fits your SLA

Smaller models mean shorter prefill, smaller KV cache, faster TTFT. The 70B is academically interesting; the 3B answers in 200ms and stays inside the latency budget your product team committed to.

03

Quality you can defend

We don't ship a smaller model and hope. Every distillation comes with a regression report on your evals - task-specific, not just MMLU - and a clear go/no-go before anything reaches production.

// What we deliver

Distillation pipelines, end to end.

From teacher selection to deployed student, with the eval harness checked into your repo so you can repeat it next quarter.

Teacher and student selection

The teacher determines the ceiling and the student determines the floor. We profile both before training starts - your task, your data, your latency budget.

  • Open-weight teachers: Llama, Qwen, DeepSeek, Mistral
  • Closed-model distillation via API logs and synthetic generation
  • Student sized to your latency budget, not arbitrary parameter count
  • Architecture matching where it helps, divergence where it pays off

Distillation losses that work

KL divergence on logits is the textbook answer. The real world wants attention transfer, hidden state matching, and task-specific objectives layered on top.

  • Soft-target KL divergence with temperature tuning
  • MSE on logits for sequence-level transfer
  • Hidden state and attention matrix alignment
  • Combined CE + distillation losses with curriculum scheduling

Distillation dataset curation

The data is the model. We mine your traffic, generate from the teacher, and filter aggressively - because teacher mistakes become student dogma.

  • Traffic mining and PII-safe replay from production
  • Teacher generation with self-consistency filtering
  • Difficulty-aware sampling so the student learns the hard cases
  • Domain coverage audits against your taxonomy

Eval harness and regression reports

If you can't measure it, you can't ship it. Every distilled model goes through a custom eval suite tied to deploy gates.

  • Task-specific eval suites, not just MMLU and GSM8K
  • Per-segment regression analysis on your real traffic
  • Pairwise judge models for open-ended tasks
  • Cost-per-quality curves so the tradeoff is explicit

Distillation vs alternatives

Sometimes quantization is the right call. Sometimes a sharper prompt and a smaller stock model wins. We tell you which lever to pull.

  • Quantization (AWQ, GPTQ, FP8) where the model is already small
  • Pruning and structured sparsity where compute is the bottleneck
  • LoRA and adapter routing instead of a full distillation
  • Honest recommendation when no compression is the answer

Production rollout and watch

A distilled model behaves differently under real traffic than on your eval set. We instrument the rollout so the regression doesn't surface in a customer ticket.

  • Shadow deployment alongside the teacher for direct comparison
  • Canary rollout gated on live eval metrics
  • Drift detection on input distribution and output quality
  • Easy rollback path with the teacher kept warm during ramp
Karol Gawron

// Expert insight

The mistake we see is teams distilling against academic benchmarks and then deploying for a product that looks nothing like MMLU. The eval has to be your eval - built from your traffic, judged the way your customers judge - or the distilled model will pass the test and fail the launch.

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Researchers who train. Engineers who deploy.

Distillation is half ML research, half production engineering. We do both, and we don't hand off between them.

16+ open-source models on Hugging Face

80K+ monthly downloads. We've trained, distilled, and shipped models people actually use - not just papers about them.

CLARIN-PL spinoff, NLP research roots

We came out of one of Europe's strongest NLP research labs. Distillation losses and curriculum scheduling are not new territory for us.

Production training at scale

Multi-node training on H100s, FSDP, mixed precision, and the boring infrastructure that makes long runs not blow up overnight.

Eval harnesses we ship with

Every project includes a task-specific eval harness in your repo - so the next distillation, fine-tune, or prompt change is measurable too.

On-prem & air-gapped capable

We can train on your hardware, with your data, behind your firewall - including environments where outbound network is blocked.

Senior team, no juniors

Every engineer has trained and deployed models in production. We don't bring ramp-up time to your project.

// FAQ

Common questions about model distillation

On task-specific evals, a well-distilled student typically lands within 1-3 percentage points of a teacher 10x its size - sometimes closer when the task is narrow. On open-ended general evals the gap is larger. We measure this on your evals before promising anything, and we'll walk away if the regression is unacceptable.

They solve different problems. Quantization is cheapest when your model is already small enough - drop to FP8 or AWQ INT4 and ship. Pruning helps when compute, not memory, is the bottleneck. Distillation is the right choice when you need to drop one or two orders of magnitude in size and quantizing the big model isn't enough. Often the answer is a combination.

Yes - via API output distillation. We replay or generate prompts, capture teacher outputs, and train an open-weight student on the resulting dataset. Quality depends heavily on dataset coverage and filtering. Check your provider's terms of service first; some prohibit using outputs to train competing models.

Less than you'd think for narrow tasks - tens of thousands of high-quality teacher-labeled examples can be enough for a focused student. Broader behavior transfer wants hundreds of thousands to millions of examples, often generated and filtered. We typically combine real traffic, teacher synthetic generation, and curated public data.

We build a task-specific eval suite from your traffic and your acceptance criteria - not just MMLU and HellaSwag. For open-ended outputs we use pairwise judge models, often a separate strong LLM, validated against human ratings on a sample. The harness is checked into your repo so you can re-run it next quarter when something changes.

End-to-end distillation projects typically run 4 to 8 weeks. The first two weeks are dataset construction and eval harness; the next two to three are training and iteration; the rest is rollout and monitoring. We work in weekly increments with a measurable artifact at the end of each week.

Yes - we routinely train on customer infrastructure, including on-prem H100 clusters and air-gapped environments. We bring the training stack (FSDP, mixed precision, eval harness, checkpointing) and operate it on your boxes. Your data and your weights never leave your perimeter.

// Let's ship it

Cut your inference bill without cutting your quality.

Tell us your teacher, your task, and your latency budget. We'll come back with a distillation plan, an eval design, and a number - usually within a business day.

Karol Gawron

Karol Gawron

Head of R&D @ bards.ai