// Research / Custom Fine-tuning
DPO and GRPO for last-mile quality.
SFT teaches imitation. Preference optimization teaches preference. We pick the algorithm that fits your data and reward signal - DPO/SimPO/KTO when you have preference pairs, GRPO with the DAPO patches when only a reward function captures the objective - run it on TRL, Unsloth, Axolotl, or VeRL depending on the scale, and ship it past the bar SFT couldn't reach.

train/reward + composite reward funcs · ~2h warmup, exponential after first lucky rollout, plateau at ~20h
// What we see
SFT got you 80% of the way there. The last 20% needs a different tool.
01
SFT plateaus on subjective objectives
Brand voice, persuasiveness, reasoning quality, tool-use reliability, customer-facing tone - the team trains an SFT model on the best demonstration data they can find and the output still reads as "close, but off." That's the SFT ceiling. The remaining quality lives in preferences the next-token loss doesn't capture.
02
DPO has a real footgun called likelihood displacement
On noisy or semantically-similar pairs, DPO drives the chosen probability up and the rejected probability up too - just slightly less. The margin metric on TensorBoard looks fine while the model converges to a third style nobody asked for. SimPO drops the reference model and uses length-normalized rewards; ORPO collapses SFT+DPO into one stage; KTO works without paired data. Picking the right variant is a real engineering decision, not a paper-citing exercise.
03
Reward hacking is the hard problem, not the algorithm
Once you switch to GRPO, the optimizer is the easy part. The hard part is designing rewards the model can't game. Length hacking, markdown abuse, EOS exploits, answer-without-reasoning patterns are the canonical failures. Multiple complementary rewards, KL guardrails, held-out adversarial probes, and DAPO-style clip-higher all help - and we wire them in before the first epoch, not after we see the curves go weird.
// Case Study
Beating AI detectors with a GRPO-trained rewriter
We tried SFT. We tried DPO. Both produced outputs that still read as AI to detectors. GRPO trained from scratch hit near-perfect detection-evasion in a 20-hour run on 8xH100 - and ships today as Surfer's AI Content Humanizer.
0%
AI-detection score post-humanize
20h
GRPO training on 8xH100
50K
words per humanize request

// What we do
Three paths past the SFT ceiling.
DPO/SimPO/ORPO/KTO when you have preference pairs and want a cheap last-mile lift. GRPO (with the DAPO/Dr.GRPO patches that fix the original) when only a reward function captures your objective. Reward design as a first-class research problem - because that's the part that decides whether RL works.
DPO and the family - when you have preferences
Pairwise preferences → DPO with TRL, KL-tuned per dataset. Length-biased outputs → SimPO (no reference model, length-normalized). Want SFT and DPO collapsed into one stage → ORPO (Argilla shipped Zephyr-141B with 7k pairs this way). Only thumbs-up/down signal, no pairs → KTO. We pick the variant from the data shape, not from the abstract.
- DPO / SimPO / ORPO / KTO / IPO on TRL, LlamaFactory, Unsloth, Axolotl
- Single-H100 runs at 8B with Unsloth's 30% VRAM savings + 2× batch
- Preference data construction from production traces, support tickets, or LLM-as-judge with bias correction
- Held-out preference eval to catch likelihood displacement before it lands in prod
GRPO + the patches that make it actually work
Vanilla GRPO has known issues at scale - entropy collapse on low-prob tokens, length bias from sequence-level normalization, zero-gradient batches when all rollouts agree. We deploy the patches the field has converged on: DAPO's clip-higher and dynamic sampling, Dr.GRPO's bias-free advantage, GSPO/GMPO for MoE stability. vLLM or SGLang as the rollout engine, colocated or disaggregated depending on rollout cost.
- GRPO with DAPO patches (clip-higher, dynamic sampling, token-level loss) on TRL v1.0 / VeRL / OpenRLHF
- vLLM colocate or SGLang disaggregated rollouts depending on long-tail variance
- Truncated importance sampling to handle inference-vs-training engine mismatch
- Verifiable rewards (math/code/schema/tool-use) plus calibrated LLM-judge for subjective dimensions
Reward design + reward-hacking defenses
Per Kyle Corbitt: "the two hardest problems in modern RL are creating realistic environments and designing effective reward functions." We treat reward design as the engagement, not as the YAML field above the optimizer config. Multiple complementary signals, calibrated against humans, with held-out adversarial probes designed before training starts.
- Composite rewards (correctness + structure + style) with explicit weighting and dominance checks
- Calibrated LLM-judge (LLM-as-judge / RaR rubrics) with bias correction and human anchor
- Adversarial probes hand-crafted before training: outputs that score high but read terrible
- Track length, entropy, training reward, and held-out validation simultaneously - Cameron Wolfe's playbook
// Method fit
RL fine-tuning earns its keep when SFT plateaus.
skip it if
SFT already gets you to your quality bar
If your eval shows SFT already lands at the target metric, RL fine-tuning is overkill - extra cost, extra failure modes, no upside.
Most teams should run SFT first, evaluate honestly, and only escalate to DPO/GRPO when the eval shows a real gap.
Supervised Fine Tuning (SFT)You can't define a reward signal
RL needs something to optimize. Pairwise preference data (for DPO and family) or a programmatic / model-based reward function (for GRPO). If your team can't articulate what "better" means in a way an algorithm can score, the engagement is premature - fix the eval first, then come back.
Custom LLM Evaluation FrameworksYou don't have evals yet
Optimization without evals is just hope with a confident dashboard. Start with the eval engagement, then layer DPO/GRPO on top once the gap is something we can measure.
Custom LLM Evaluation FrameworksYou're still pre-PMF and it's not your core product
RL fine-tuning a model for a system that's getting rewritten next sprint is wasted compute. If the model isn't the product itself, ship the thing first, find product fit, then last-mile-tune what stabilizes. (If the model IS your core product, RL fine-tuning is the differentiator and the timing is fine.)
use it if
You've already done SFT and the eval shows a gap to your target - and you can articulate the gap in a way an algorithm can score (preference pairs, programmatic reward, calibrated judge).
You're working on a subjective or verifiable objective where SFT structurally can't reach the bar: brand voice, persuasiveness, reasoning quality, tool-use reliability, code/math correctness, safety calibration.
You're shipping at a scale where the last 5–10% of quality is worth the engagement. Reasonable rule of thumb: the cost of one bad output × your traffic > our engagement cost.
// How we work
Eval and reward first. SFT bootstrap. Then DPO or GRPO.
Every engagement starts with the eval suite and the reward function - the two things that decide whether RL fine-tuning will work at all. Then SFT as a baseline if you have demonstrations. Then the right algorithm for the signal you actually have, on the stack that fits the scale.
01
Build the eval and the reward function
Eval suite mined from production traces and the failure modes we're trying to fix. Reward function - verifiable where possible (math, code execution, schema compliance, tool-use success), calibrated LLM-as-judge or RaR rubrics where it has to be subjective. Adversarial probes designed before any training run, not after we see suspicious curves. By end of week one we know whether the objective is RL-tractable, or whether the engagement should pivot.
02
SFT bootstrap with the right framework for the scale
Single-GPU long-context Llama / Qwen / Mistral / Gemma → Unsloth (90% VRAM reduction, FP8 GRPO at 5GB on consumer cards). Multi-GPU config-driven runs → Axolotl. Breadth of methods + LlamaBoard UI for triage → LlamaFactory. Typical baseline: 8B on 8×H100 with ~100K examples in ~20h, around $620 in compute. A small SFT "cold start" of as few as 100 examples materially improves downstream RLVR per recent scaling-laws work.
03
DPO/SimPO/ORPO/KTO or GRPO depending on the signal
Pairwise preferences → DPO/SimPO/ORPO with TRL on a single H100, 24-48h, β/α tuned per dataset. Programmatic or model-based reward → GRPO with DAPO patches on TRL v1.0, OpenRLHF, or VeRL (HybridFlow's 1.5–20× throughput on big runs). vLLM or SGLang as the rollout engine, with truncated importance sampling for engine mismatch. Hand off the training pipeline + eval suite + reward-hacking probes so the team keeps tuning after we leave.
// Expert insight
“Preference optimization used to be a finicky process. Recent developments - GRPO and the follow-ups (DAPO, Dr.GRPO, GSPO) - made it much more steady. SFT is great for fine-tuning structure and general concept, but it often fails to capture nuance. GRPO reliably gets you the last-mile performance now, and the tooling caught up - TRL v1.0, Unsloth, VeRL, vLLM rollouts - to the point where it's no longer a research project.”
Michał Pogoda-Rosikoń
Co-founder @ bards.ai
// Why bards.ai
Why us, instead of someone who's read the DPO paper.
Reading the paper gets you a notebook. Shipping a GRPO model that beats production AI detectors and survives reward hacking takes engagements you've already done - and the scars from the runs that didn't work.
GRPO shipped in production
We've trained and shipped GRPO models that solve objectives SFT and DPO couldn't reach - including the rewriter behind a top-tier SaaS content tool. Production scars, not paper-replication scars.
Tooling fluency across the stack
TRL v1.0 for orthodox runs, Unsloth for single-GPU long-context, Axolotl for config-driven SFT, VeRL/OpenRLHF for big multi-node RL, vLLM/SGLang as the rollout engine. We pick the framework that fits the engagement, not the one we used last time.
DPO/SimPO/ORPO/KTO/GRPO/DAPO fluency
We've implemented each from the papers and shipped the right one for the problem. Algorithm choice precedes hyperparameter sweeps - and we follow the field through DAPO, Dr.GRPO, GSPO so we're not building on yesterday's defaults.
Jaxpot - our open-source RL stack
JAX-based vectorized self-play with leagues, MCTS, and reproducible configs. Built and maintained by our team - the engineering muscle behind "RL on real environments" lives in-house.
10+ peer-reviewed publications
CLARIN-PL spinoff. The team has reviewed RL papers in NeurIPS / ICML cycles and reproduced the ones worth reproducing - and skipped the ones that don't.
We tell you when RL is the wrong tool
Many problems are better solved with SFT, prompt engineering, ORPO with 7k pairs, or a different base model entirely. We say so before you spend the GPU budget.
// FAQ
Common questions about preference optimization
When SFT has plateaued on a subjective objective and you have pairwise preference data. DPO directly optimizes "prefer A over B" without needing a separate reward model - it's cheap (single H100, 24-48h on a small model) and well-behaved with clean preferences. The catch: with noisy or semantically-similar pairs, DPO can hit likelihood displacement - both chosen and rejected probabilities move together, the margin metric looks fine while the model gets worse on real evals. SimPO (length-normalized, no reference model) and ORPO (collapses SFT+DPO into one stage) sidestep different parts of this. Calibrated held-out evals catch the failure mode; vibes don't.
GRPO (DeepSeek-R1) drops the value model and uses group-relative advantages - dramatically less memory, often ~50% reduction vs PPO, and runs on a single GPU with vLLM colocated rollouts. It's mathematically equivalent to RLOO up to a scaling constant. The original GRPO has known issues at scale (entropy collapse, length bias, zero-gradient batches) that DAPO and Dr.GRPO patch. We default to DAPO-style clip-higher + dynamic sampling + token-level loss; the tradeoff is more knobs but a more stable curve.
DAPO (ByteDance, Mar 2025) hits AIME 2024 = 50 in half the steps of DeepSeek-R1-Zero-Qwen-32B by adding clip-higher (asymmetric ε to keep low-prob exploration tokens alive), dynamic sampling (drop prompts where every rollout agrees - zero gradient), token-level loss instead of sequence-level (fixes length bias), and overlong reward shaping (don't penalize truncated answers). Dr.GRPO removes the std-dev normalization in the advantage and the per-sequence length normalization - both were sources of bias. Plain GRPO often shows response length growing after rewards plateau; the patches stop that.
Three layers. (1) Multiple complementary rewards - a single reward is easy to game; three weighted rewards aren't. (2) Adversarial probes designed up front - we hand-craft outputs that score high but read terrible; if the model produces those, the reward function is the bug. (3) Track length, entropy, training reward, and held-out validation simultaneously per Cameron Wolfe's playbook. KL regularization helps for some setups; for verifiable-reward GRPO runs (DAPO and follow-ups), most groups now drop the KL penalty entirely - it isn't necessary and can hurt. The 2025 Anthropic paper on natural emergent misalignment from reward hacking is required reading; we treat reward hacking as a safety surface, not just a quality bug.
Less than people assume. Argilla/HF shipped Zephyr-141B with 7k pairs via ORPO. SmolLM3 used DPO/APO with HelpSteer-style mixes. At the larger end, Tülu 3 used 337–360k preference pairs for the 70B/405B mixes - and the key finding was that *unique prompts* matter; duplicating prompts with different responses doesn't help. Start small, evaluate on held-out preferences, scale up only when the eval shows the data is the bottleneck. Human preferences run $5–20 per sample; calibrated AI preferences (RLAIF) run under $0.01.
LoRA, in most cases, more aggressively than people assume. Thinking Machines published recent work showing LoRA matches full fine-tuning even at rank-1 for RL - because policy gradients carry only ~1 bit per episode of information vs ~1000× more for SFT tokens. Optimal LoRA learning rate is roughly 10× the full-FT rate. LoRA underperforms when (a) the dataset is large enough to look like pre-training, (b) batch size is huge, or (c) it's not applied to all linear layers (especially MLPs and MoE). Lewis Tunstall's framing: "LoRA forgets less, which is why DPO with LoRA often works really well - it acts as a regularizer."
SFT bootstrap on 8B with ~100K examples on 8×H100: ~20h, ~$620. DPO on the same scale: single H100, 24-48h, often under $50. GRPO depends entirely on rollout cost - multi-day on 8×H100 is normal for production-quality runs, with vLLM/SGLang as the rollout engine being the lever that matters. Engagement timeline: eval + reward function design week one, SFT bootstrap week two, DPO or GRPO iteration weeks three through six, total 6–10 weeks for production-grade engagements with multiple reward components and tight quality bars. Every week ships a measurable improvement, not a status update.
// Let's ship it
Get past the SFT ceiling - without burning your reward budget on a bad run.
Tell us your model, your evals, and the gap you're trying to close. We'll come back with an algorithm choice (DPO, KTO, GRPO + DAPO patches, or honest "don't bother") and a number - usually within a business day.
Michał Pogoda-Rosikoń
Co-founder @ bards.ai