// Production / Agentic Workflows & RAG
RAG that works on your data.
Your demo answered every question. Now your customers are asking question 31, and you can't tell whether retrieval missed or the model hallucinated. We build the retrieval layer, the eval suite, and the regression gates that catch question 31 before your customers do.
// What we see
Demo works. Production breaks. Always in the same places.
01
Question 31 retrieves nothing
Your demo questions all hit. The 31st is phrased differently, uses internal jargon, or spans two documents - and the top-50 misses the right passage entirely. You don't know it's happening because the model fills in the gap with a confident wrong answer.
02
You can't tell what failed
A bad answer comes back. Was it retrieval that missed, or the model that hallucinated? Most teams can't tell, because there's no eval that separates the two. So fixes are guesses, and the same bug keeps coming back.
03
Every improvement is anecdotal
Someone tweaks the chunk size and 12 questions silently regress. The team that made it better can't prove it. The team that made it worse won't see it. Production drifts, and the only signal is when a customer complains.
// Case Study
Text-search across 200 live city camera feeds
Municipal operators type a description and the system surfaces matching events from across the city's live CCTV network. We built it for Neural; the City of Oława's Straż Miejska runs it on-prem. 200 cameras per server; review time on a typical incident dropped from ~8 hours of manual scrubbing to under 1 hour - an ~88% reduction.
200
live cameras per on-prem server
~88%
less time per incident review
~33K
residents covered (Oława)

// What we do
Three things that do most of the work.
Most of the wins aren't the latest paper. They're a hybrid index wired correctly, a reranker on top, and an eval suite the team actually runs.
Hybrid search, weighted per corpus
Pure dense embeddings miss exact terms - codes, IDs, proper nouns. BM25 + dense beats either one alone on almost every customer corpus. We tune the weights to your data, not a generic default.
A reranker on top of recall
Top-50 recall is cheap; top-5 precision is what users see. A cross-encoder reranking the top-50 lifts top-5 precision 8-15 points on most corpora, for a few tens of milliseconds we can budget for.
An eval suite your team runs
200-300 golden questions with expected citations. We build it first, before any retrieval changes. Every change after that ships with a measured delta - no anecdotal improvements, no silent regressions.
// Method fit
Not every retrieval problem is a RAG problem.
skip it if
Knowledge fits in context
If your full corpus is under ~50K tokens, just stuff the prompt. Cheaper, faster, fewer moving parts to maintain. RAG adds infrastructure you'll be on the hook for.
The problem is the model
Wrong tone, wrong format, refusal behavior - those are model problems, not retrieval problems. Fine-tuning fixes them; RAG won't.
Supervised Fine Tuning (SFT)Your data is structured
Customer records, transactions, inventory belong in SQL or a graph DB. Retrieval over structured data should be queries, not embeddings.
use it if
RAG fits when your corpus is too big for context, the content is mostly unstructured (docs, tickets, code, transcripts), and questions are open-ended. That covers most production knowledge-Q&A.
// How we work
Eval first. Iterate in the open. Hand off code, not Confluence.
Every engagement starts with a shared eval and ends with your team running it in CI. Between those two points, your engineers watch the iteration as it happens - not in a Friday demo.
01
Shared eval as the contract
Week one, we sit with your team and write 100-300 golden questions with expected citations. The eval becomes the spec. No retrieval claim is allowed without a measured delta against it.
02
Iterate in the open
Every training and retrieval run lands in a Weights & Biases workspace your team has access to. You see what we're trying, what's working, and what we're killing. The dashboard replaces the status report.
03
Hand off, then stay nearby
We hand off code, the eval suite in your CI, and a runbook your on-call can read at 11pm. Slack for 30 days after delivery for the questions that come up after we leave.
// Weights & Biases - shared workspace

Real workspace from a recent engagement. 38 runs, 14 tracked metrics across recall, precision, and coherence tests. Your engineers get access on day one - no PDF status reports, no surprise findings at the demo. Run labels are anonymized when the customer requires it.
// Expert insight
“The teams that ship great RAG don't have a secret embedding model. They have a 200-question golden set, a hybrid index, and the discipline to gate every change on the eval. Most of the "tricks" matter much less than that.”
Karol Gawron
Head of R&D @ bards.ai
// Why bards.ai
Why us, instead of two senior ML engineers you'd hire.
You could hire the team. It would take a year and they'd learn this on you. We've already learned it - on production engagements at Brand24, SurferSEO, Comcast, and others.
Embedding fine-tunes that compete with 7B baselines
Our internal mxbai-large fine-tunes have matched gte-Qwen2-7B on customer IR tasks at ~1/15 the parameter count. Methodology reproducible across corpora.
Eval-first methodology
Every retrieval change ships with a measured delta. The W&B workspace is shared with your team. No silent regressions for a cosmetic win.
Senior engineers only, no juniors
Every person on your engagement has shipped retrieval to production. No ramp-up tax, no learning on your dollar.
// FAQ
Common questions about production RAG
RAG when the knowledge changes, you need citations, or the corpus is too big for context. Fine-tuning when you need to change behavior, tone, or tool-calling shape. Production systems often use both - retrieval for facts, a lightly tuned model for the answer style.
Sub-second p95 over millions of chunks is routine. Qdrant and pgvector both scale to hundreds of millions on the right hardware. Above that, sharding and hybrid index designs (HNSW + ScaNN, IVF-PQ for cold tiers) start mattering. We design for the scale you're heading to.
Either per-tenant collections (strongest isolation, more ops) or a shared collection with metadata filters and query-layer enforcement (denser, careful audit). We pick by data sensitivity and tenant scale, then verify isolation with red-team retrieval queries.
Engagements start at $40K. Most production-RAG projects land between $40K and $120K depending on corpus complexity, multi-tenant requirements, and whether embedding fine-tuning is in scope. We share a fixed-fee proposal after the first scoping call - no time-and-materials surprise.
// Let's ship it
Send us your eval. We'll send back a plan.
Tell us about the corpus, the question shapes you're failing on, and the recall bar. We'll come back with a retrieval design and an eval plan, usually within a business day. Engagements from $40K, typically 4-8 weeks.
Karol Gawron
Head of R&D @ bards.ai