// Research / Custom Fine-tuning

SFT that actually ships.

Supervised fine-tuning the way the tooling now supports it: Axolotl or LlamaFactory configs you can re-run, TRL's SFTTrainer when you want closer-to-the-metal, Unsloth when you need single-GPU long-context, FSDP when you need full fine-tunes at scale. LoRA / QLoRA / full FT chosen by the data shape, not by habit. Templates correct, eval gates wired, every run reproducible.

eval/sft-vs-frontier · brand-classification
F1 vs Cost - Gemini 2.5 Flash Lite SFT matches GPT-5.1 quality at ~1/20th the cost and lower latency

SFT lifts a 2.5 Flash Lite from F1 0.84 → 0.91 - matching GPT-5.1 at ~1/20th the cost and 1/3rd the latency

// What we see

Most failed SFT runs failed before training started.

01

Wrong chat template, silently degraded model

Llama, Qwen, Mistral, Gemma each ship a different chat template. Train with the wrong one and the model technically learns - but loses the structured output behavior the base model had. The loss curve looks fine. The eval doesn't. Half the SFT engagements we audit hit this exact bug.

02

LoRA targeting only attention, not MLPs

Default LoRA configs in old tutorials apply adapters only to q/k/v/o. On modern architectures, that's where you leave performance on the floor - MLPs and gate/up/down projections often matter more. Per Thinking Machines' recent work, LoRA underperforms when adapters aren't applied to all linear layers, especially on MoE models.

03

Hyperparameters by folklore

Learning rates copy-pasted from blog posts. Epoch counts based on "3 sounds right." No ablations, no early stopping, no eval gates. The model overfits at epoch 2 but the training continues to epoch 5 because that's what the YAML said. Eval-gated training fixes most of this in two lines of config.

// Case Study

Fine-tuned a small model to frontier quality - 50× cheaper at high volume

Customer's frontier-API entity-extraction pipeline worked but the per-token bill was eating margin at the volume they wanted to ship at. We split the task into hybrid retrieval + two fine-tuned Gemini 2.5 Flash Lite models. 98.3% F1 retention on the customer's existing eval suite, ~50× cheaper per 1000 requests, ~3× faster - without touching the prompt or the eval.

  • 50×

    cheaper per 1000 requests

  • lower end-to-end latency

  • 98.3%

    F1 retention vs. frontier-API baseline

Read the case study
Fine-tuned a small model to frontier quality - 50× cheaper at high volume

// What we do

Three layers, from framework choice to deploy gate.

Pick the training framework that fits the engagement. Get the templates and hyperparameters right. Wire eval gates so the run stops when it should and only the good checkpoints ship.

Framework + adapter strategy

Axolotl for config-driven multi-GPU runs, TRL SFTTrainer when we want low-level control, LlamaFactory for breadth + LlamaBoard UI, Unsloth for single-H100 long-context (90% VRAM reduction, FP8 viable on consumer cards), MS-Swift / OpenRLHF / VeRL when scale demands it. LoRA when iteration speed matters; QLoRA when memory's the bottleneck; full FT on FSDP / DeepSpeed ZeRO-3 when adapters underfit.

  • Axolotl, TRL, LlamaFactory, Unsloth, MS-Swift, VeRL - picked per engagement
  • QLoRA-SFT 7B/13B on a single H100 in 18-36h, ~$50-100 compute
  • Full SFT on Llama 70B with FSDP / DeepSpeed ZeRO-3 across 8-32 GPUs
  • All-linear-layer LoRA targeting (q/k/v/o + gate/up/down) - not the default 4-module recipe

Template + hyperparameter discipline

Chat-template parity with the base model (ChatML, Llama, Qwen, Gemma, Phi-4 - each gets the right one). Sample packing for token efficiency. Learning-rate schedule and warmup that match the model size, not a 7B cookbook applied to a 70B run. Sequence length sized for the data, not for a round number.

  • Chat-template normalization with tokenizer-level validation
  • Sample packing + sequence length sized to the dataset's distribution
  • LR / warmup / weight-decay schedules tuned per model family
  • Gradient checkpointing + flash-attention-2 / flash-attention-3 by default

Eval-gated training + handoff

Eval suites mined from production traces and the failure modes you're trying to fix. Held-out preference eval, win-rate, and a small smoke set that runs every N steps. Early stopping on the eval, not on the loss. W&B or MLflow tracking with hashed configs and dataset fingerprints, so the run is reproducible six months later when you need to retrain.

  • Per-step eval on smoke suite, per-epoch eval on full suite
  • Early stopping driven by held-out eval, not by step count
  • W&B / MLflow tracking with config + dataset hash for full reproducibility
  • Final checkpoint promoted only on eval pass - not the last epoch by default

// Method fit

SFT is the right move when you have data and a base model.

skip it if

  • You don't have training data yet

    SFT eats demonstrations. If you don't have them, you need a synthesis pipeline first - teacher distillation, self-instruct, evol-instruct, persona-based generation. Get the data, then train. We do both halves; this is just the second one.

    Synthetic Data Pipelines
  • Prompt engineering already gets you there

    If a well-crafted system prompt + few-shot examples on a frontier model lands at your quality bar, SFT is overkill. The math is straightforward - fine-tuning costs more than prompt iteration unless inference cost or latency justifies a smaller deployed model.

  • SFT alone won't reach the bar - you need preference optimization

    If your eval shows SFT still has a gap and the gap is in subjective quality (style, safety, persuasiveness, last-mile reasoning), SFT is the bootstrap, not the destination. DPO/SimPO/KTO or GRPO is the next step - and the engagement cleanly stacks on top of this one.

    Preference Optimization (DPO / KTO / GRPO)
  • You're still pre-PMF and the workload keeps shifting

    Fine-tuning a model for a system that's getting rewritten next sprint is wasted compute. Ship the thing on a frontier API first, find product fit, then SFT what stabilizes - usually a 7B or 8B specialist that drops cost by 10-20× at the same quality.

use it if

You have demonstration data (real, synthetic, or both) and an open-weight base model - Llama, Qwen, Mistral, Gemma, Phi, DeepSeek - that you want to specialize.

Your inference bill or latency is high enough that replacing a frontier API with a fine-tuned 7B/8B/13B pays back inside a month.

You want a reproducible training pipeline you can re-run as your task evolves - not a one-off model handed back as a tarball.

// How we work

Get the data right. Get the templates right. Then train.

Almost every SFT failure we audit failed before the first gradient step. We spend the first week on data, templates, and eval - then training is the boring part.

01

Data audit + template normalization

Inspect the data: format, length distribution, balance, contamination against your evals. Pick the chat template the base model expects (Llama, Qwen, ChatML, Gemma) and validate it tokenizes round-trip cleanly. Eval suite mined from production traces or curated golden cases. By end of week one we know the run is set up to succeed before it starts.

02

Adapter strategy + framework choice + first run

QLoRA on Unsloth for single-H100 7B/13B engagements; Axolotl + LoRA when reproducibility matters and we're on multi-GPU; full SFT on FSDP when adapters underfit. Hyperparameter sweep on a small subset (~1k examples, 30 minutes) to lock learning rate and warmup before the full run. Typical baseline: 8B on 8×H100 with ~100K examples in ~20h, ~$620 in compute.

03

Eval-gated full run + handoff

Full training with per-step smoke eval and per-epoch full eval. Early stopping on eval, not on step count. W&B / MLflow run with hashed config + dataset fingerprint. Checkpoint promoted on eval pass. Hand off the Axolotl/TRL config + dataset + W&B project + retraining runbook so the team can re-run the recipe when the task evolves.

Karol Gawron

// Expert insight

SFT used to be the whole training game. Now it's the bootstrap before preference optimization. But the ceiling of what comes next is set by how well SFT was done - wrong template, wrong LoRA targets, wrong learning rate, and you're handicapping every downstream step. The teams that win at fine-tuning are the ones who sweat the boring details before they touch the optimizer.

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of a tutorial notebook and good intentions.

We've shipped 16+ open-weight models on Hugging Face with 80K+ monthly downloads - every one of them through the same SFT discipline we'd apply to your engagement. We know the failure modes because we've hit them.

16+ open-source models on Hugging Face

Polish-language SFTs, financial sentiment classifiers, fine-tuned chat models - 80K+ monthly downloads. We've shipped, served, and learned from each one.

Axolotl, TRL, Unsloth, LlamaFactory, MS-Swift

We pick the framework per engagement, not by habit. Configs and datasets stay with you, fully reproducible.

Full FT, LoRA, QLoRA fluency

Single-H100 QLoRA when iteration speed matters, all-linear-layer LoRA targeting per Thinking Machines' findings, full FT on FSDP / ZeRO-3 when adapters underfit. Choice driven by the data, not the default.

Eval-gated, not loss-gated

Held-out evals run per-step and per-epoch. Early stopping driven by what your customers care about, not by training loss flattening.

Template + tokenizer paranoia

We validate chat templates round-trip through the tokenizer before any GPU spins up. Half the failed SFT runs we've audited lost performance to a template bug nobody noticed.

Reproducible, license-aware

Every run hashed and tracked. Dataset provenance documented. License compatibility verified before training starts so you don't get surprised by an audit.

// FAQ

Common questions about SFT

QLoRA when memory is the bottleneck - single-H100 SFT of 7B/13B/34B models becomes feasible at the cost of slightly slower convergence. LoRA when you have multi-GPU but want fast iteration; per Thinking Machines, LoRA matches full FT on most SFT tasks if you target all linear layers (not just attention). Full FT on FSDP / ZeRO-3 when adapters underfit - common for large pretraining-like datasets, MoE models, or when you need to retrain the embedding layer.

Decode a tokenized training example back to text and compare it byte-by-byte with the base model's chat template documentation. The Llama-3 template, Qwen template, ChatML, Gemma template, and Phi-4 template are all subtly different - getting one wrong silently degrades structured output, refusal behavior, and tool-calling. Half the SFT failures we audit trace back to a template bug. We validate this in the data audit before any training spins up.

Less than people assume for narrow tasks, more than they assume for broad ones. For a focused capability (one domain, structured outputs, a specific tool-calling pattern), 5K-20K well-filtered examples is usually enough. For general instruction tuning, 50K-200K. Past 200K you're paying for marginal gains unless the task is genuinely broad. Quality of demonstrations dominates volume - well-filtered 10K beats noisy 100K every time.

Axolotl when you want config-driven, reproducible runs across multi-GPU and you'll iterate on the YAML over time. TRL's SFTTrainer when you want close-to-the-metal control or are integrating into a custom training loop. Unsloth when you're on a single H100 and need long-context or memory savings (90% VRAM reduction with their kernels). LlamaFactory for breadth of methods + LlamaBoard UI when you're triaging which approach works. MS-Swift for Chinese-language stack alignment. The configs stay with you regardless.

QLoRA-SFT on Qwen-7B with ~50K examples on 1×H100: 12-24h, ~$30-60 in compute. Axolotl LoRA-SFT on Llama-3-8B with ~100K examples on 8×H100: ~20h, ~$620. Full FT on Llama-70B with FSDP across 16 H100s: 2-4 days, ~$5-10K. The bigger expense in most engagements is the data preparation, not the GPUs.

OpenAI, Anthropic, and Google each expose SFT through their fine-tuning APIs - useful when the closed model is the right base for your task and the algorithm doesn't matter. The downside: you don't own the weights, you're locked into the provider's pricing, and you can't move to DPO/GRPO. We default to open-weights (Llama, Qwen, Mistral, Gemma, DeepSeek) for any engagement where downstream RL matters or where pricing flexibility is part of the brief.

A reproducible training pipeline: Axolotl/TRL config in your repo, dataset with provenance, W&B / MLflow project with all runs, model checkpoint, eval suite, and a runbook for retraining as the task evolves. Not a tarball thrown over the wall. The configs stay yours so your team can re-run the recipe six months from now without us.

// Let's train it

Ship a fine-tune that actually does the job.

Tell us your base model, your data, and the task you want it specialized for. We'll come back with an adapter strategy, framework choice, and a number - usually within a business day.

Karol Gawron

Karol Gawron

Head of R&D @ bards.ai