// Custom Fine-tuning

Fine-tuned a small model to frontier quality - 50× cheaper at high volume

Customer's frontier-API entity-extraction pipeline worked but the per-token bill was eating margin at the volume they wanted to ship at. We split the task into hybrid retrieval + two fine-tuned Gemini 2.5 Flash Lite models. 98.3% F1 retention on the customer's existing eval suite, ~50× cheaper per 1000 requests, ~3× faster - without touching the prompt or the eval.

offices

USA / Europe

size

100+ employees

industry

B2B SaaS - anonymized

revenue

-

// Outcomes

The numbers that matter

  • 50×

    cheaper per 1000 requests

  • lower end-to-end latency

  • 98.3%

    F1 retention vs. frontier-API baseline

01 · A working pipeline that cost too much to scale at high volume

The Challenge

The customer's job was straightforward to describe: read a piece of unstructured text (article, mention, social post), extract every named entity referenced, and normalize each to a canonical entry in a database of ~250K records - with disambiguation for the cases where the same surface form could map to several different entries depending on context.

A frontier model with a carefully tuned prompt did the job at the customer's target F1. The prompt was already locked. The eval suite was already built. The model worked.

The bill was the problem. At the volume the customer wanted to ship at, the per-token cost on the frontier-API tier didn't survive contact with the unit economics. Swapping to a smaller frontier model (Gemini 2.5 Flash Lite) as a drop-in dropped F1 well below the bar - saving the cost but breaking the product. The team was stuck on the expensive tier, looking for a way to keep the quality without keeping the bill.

02 · Split the task. Add retrieval. Fine-tune both halves.

Approach

Step 1: Hybrid retrieval to bound the candidate pool.

The frontier model was being asked to do two things at once: extract entity mentions from text, AND match each to a row in a 250K-record database. The matching step was the expensive part - the model had to reason about the entire DB implicitly through context. We replaced that with a hybrid BM25 + dense-embedding retriever that narrows 250K candidates to a top-20 shortlist per extracted span. Top-20 in context is a tractable problem; top-250K via reasoning is not.

Step 2: Synthetic data from the working pipeline.

We ran the existing frontier-API pipeline against a curated set of customer texts to generate training data - extraction outputs and normalization decisions on real domain data. After deduplication and contamination checks against the held-out eval, ~120K extraction examples and ~80K normalization examples remained.

Step 3: Two fine-tuned Gemini 2.5 Flash Lite models.

Two LoRA-style fine-tunes via Vertex AI: one for extraction (text in → spans out), one for normalization (span + retrieved candidates → DB ID, or null if none match). Same chat template as the base, eval-gated training with the customer's existing eval suite, early stopping on the held-out F1.

Step 4: Evaluation against the locked eval suite.

The customer's eval was already in place - they didn't change a single test case to make our setup look good. We benchmarked the full pipeline (retrieval + extraction + normalization) on the same suite the frontier-API baseline was scored on.

03 · 98.3% F1 retention. ~50× cheaper. ~3× faster.

Result

The fine-tuned Gemini 2.5 Flash Lite pipeline matched the frontier-API baseline on the customer's existing eval suite, while flipping the unit economics. The chart below shows the four points the customer evaluated - F1 against per-1000-request cost.

F1 vs Cost - fine-tuned Gemini 2.5 Flash Lite matches the frontier-API baseline at ~1/50th the cost and 1/3rd the latency.
  • 98.3% F1 retention vs. the frontier-API baseline - within noise on the customer's locked eval.
  • ~50× cheaper per 1000 requests - what unlocked the high-volume use case.
  • ~3× faster end-to-end latency - fewer reasoning hops, smaller model.
  • Base Flash Lite (no fine-tuning) sat well below the F1 bar - the lift from fine-tuning is what made the swap viable.

Total cost of the fine-tuning engagement (compute + a few iterations) paid back inside the first month at the customer's traffic volume. The training pipeline ships to the customer's repo so they can re-run it as the database grows.

// What they say

Our prompt was already locked and the eval was already in place. Bards.ai didn't ask us to change either. They split the task, added retrieval, and fine-tuned the small models - 98.3% F1 retention, 50× cheaper. Paid back inside the month.
Customer testimonial

Customer testimonial

Head of ML - anonymized

// Ready to ship?

Let's build something that delivers numbers like these.

Book a meeting