// Production / Agentic Workflows & RAG

RAG that works on your data.

Your demo answered every question. Now your customers are asking question 31, and you can't tell whether retrieval missed or the model hallucinated. We build the retrieval layer, the eval suite, and the regression gates that catch question 31 before your customers do.

Book a meeting See related case study

// What we see

Demo works. Production breaks. Always in the same places.

Question 31 retrieves nothing

Your demo questions all hit. The 31st is phrased differently, uses internal jargon, or spans two documents - and the top-50 misses the right passage entirely. You don't know it's happening because the model fills in the gap with a confident wrong answer.

You can't tell what failed

A bad answer comes back. Was it retrieval that missed, or the model that hallucinated? Most teams can't tell, because there's no eval that separates the two. So fixes are guesses, and the same bug keeps coming back.

Every improvement is anecdotal

Someone tweaks the chunk size and 12 questions silently regress. The team that made it better can't prove it. The team that made it worse won't see it. Production drifts, and the only signal is when a customer complains.

// Case Study

Text-search across 200 live city camera feeds

Municipal operators type a description and the system surfaces matching events from across the city's live CCTV network. We built it for Neural; the City of Oława's Straż Miejska runs it on-prem. 200 cameras per server; review time on a typical incident dropped from ~8 hours of manual scrubbing to under 1 hour - an ~88% reduction.

200
live cameras per on-prem server
~88%
less time per incident review
~33K
residents covered (Oława)

Read the case study

Text-search across 200 live city camera feeds

// What we do

Three things that do most of the work.

Most of the wins aren't the latest paper. They're a hybrid index wired correctly, a reranker on top, and an eval suite the team actually runs.

Hybrid search, weighted per corpus

Pure dense embeddings miss exact terms - codes, IDs, proper nouns. BM25 + dense beats either one alone on almost every customer corpus. We tune the weights to your data, not a generic default.

A reranker on top of recall

Top-50 recall is cheap; top-5 precision is what users see. A cross-encoder reranking the top-50 lifts top-5 precision 8-15 points on most corpora, for a few tens of milliseconds we can budget for.

An eval suite your team runs

200-300 golden questions with expected citations. We build it first, before any retrieval changes. Every change after that ships with a measured delta - no anecdotal improvements, no silent regressions.

// Method fit

Not every retrieval problem is a RAG problem.

skip it if

Knowledge fits in context
If your full corpus is under ~50K tokens, just stuff the prompt. Cheaper, faster, fewer moving parts to maintain. RAG adds infrastructure you'll be on the hook for.
The problem is the model
Wrong tone, wrong format, refusal behavior - those are model problems, not retrieval problems. Fine-tuning fixes them; RAG won't.
Supervised Fine Tuning (SFT)
Your data is structured
Customer records, transactions, inventory belong in SQL or a graph DB. Retrieval over structured data should be queries, not embeddings.

use it if

RAG fits when your corpus is too big for context, the content is mostly unstructured (docs, tickets, code, transcripts), and questions are open-ended. That covers most production knowledge-Q&A.

// How we work

Eval first. Iterate in the open. Hand off code, not Confluence.

Every engagement starts with a shared eval and ends with your team running it in CI. Between those two points, your engineers watch the iteration as it happens - not in a Friday demo.

Shared eval as the contract

Week one, we sit with your team and write 100-300 golden questions with expected citations. The eval becomes the spec. No retrieval claim is allowed without a measured delta against it.

Iterate in the open

Every training and retrieval run lands in a Weights & Biases workspace your team has access to. You see what we're trying, what's working, and what we're killing. The dashboard replaces the status report.

Hand off, then stay nearby

We hand off code, the eval suite in your CI, and a runbook your on-call can read at 11pm. Slack for 30 days after delivery for the questions that come up after we leave.

// Weights & Biases - shared workspace

Weights & Biases workspace from a recent engagement showing 38 training runs and 14 tracked metrics

Real workspace from a recent engagement. 38 runs, 14 tracked metrics across recall, precision, and coherence tests. Your engineers get access on day one - no PDF status reports, no surprise findings at the demo. Run labels are anonymized when the customer requires it.

// Expert insight

“The teams that ship great RAG don't have a secret embedding model. They have a 200-question golden set, a hybrid index, and the discipline to gate every change on the eval. Most of the "tricks" matter much less than that.”

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of two senior ML engineers you'd hire.

You could hire the team. It would take a year and they'd learn this on you. We've already learned it - on production engagements at Brand24, SurferSEO, Comcast, and others.

Embedding fine-tunes that compete with 7B baselines

Our internal mxbai-large fine-tunes have matched gte-Qwen2-7B on customer IR tasks at ~1/15 the parameter count. Methodology reproducible across corpora.

Eval-first methodology

Every retrieval change ships with a measured delta. The W&B workspace is shared with your team. No silent regressions for a cosmetic win.

Senior engineers only, no juniors

Every person on your engagement has shipped retrieval to production. No ramp-up tax, no learning on your dollar.

// FAQ

Common questions about production RAG

RAG when the knowledge changes, you need citations, or the corpus is too big for context. Fine-tuning when you need to change behavior, tone, or tool-calling shape. Production systems often use both - retrieval for facts, a lightly tuned model for the answer style.

Sub-second p95 over millions of chunks is routine. Qdrant and pgvector both scale to hundreds of millions on the right hardware. Above that, sharding and hybrid index designs (HNSW + ScaNN, IVF-PQ for cold tiers) start mattering. We design for the scale you're heading to.

Either per-tenant collections (strongest isolation, more ops) or a shared collection with metadata filters and query-layer enforcement (denser, careful audit). We pick by data sensitivity and tenant scale, then verify isolation with red-team retrieval queries.

Standard RAG is one retrieval pass: embed the query, fetch top-k, generate. Agentic RAG replaces that with a loop where the model decides whether to retrieve, what to retrieve, and when context is sufficient - useful for multi-hop questions that span multiple documents or require iterative sub-query decomposition. The cost is real: 5-20x more expensive per query and non-deterministic latency. Our recommendation: standard RAG with hybrid search and a reranker handles 80-90% of production workloads. Agentic RAG earns its cost when queries are explicitly multi-hop, accuracy is non-negotiable (legal, medical, financial review), or when different document types require separate retrieval strategies. We scope this decision before touching architecture - it changes the cost model significantly.

GraphRAG builds entity-relationship graphs over your corpus and lets the model query relationships directly, not just text similarity. Real wins on cross-document synthesis, entity lineage, and multi-hop relationship reasoning. The operational cost is also real: graph extraction runs 3-5x the cost of standard ingestion, requires domain-specific entity and relation tuning, and the graph needs maintenance as documents update. For most production corpora, hybrid BM25 + dense + cross-encoder reranker gets 90% of the quality at under 5% of the operational complexity. We use GraphRAG when the corpus is explicitly relationship-dense (legal contracts, scientific literature with citation graphs, medical ontologies) and queries are provably multi-hop. We don't default to it - the maintenance burden usually exceeds the quality gain on standard enterprise knowledge-base workloads.

Engagements start at $40K. Most production-RAG projects land between $40K and $120K depending on corpus complexity, multi-tenant requirements, and whether embedding fine-tuning is in scope. We share a fixed-fee proposal after the first scoping call - no time-and-materials surprise.

// Let's ship it

Send us your eval. We'll send back a plan.

Tell us about the corpus, the question shapes you're failing on, and the recall bar. We'll come back with a retrieval design and an eval plan, usually within a business day. Engagements from $40K, typically 4-8 weeks.

Book a meeting hello@bards.ai

Karol Gawron

Head of R&D @ bards.ai