// Production / Agentic Workflows & RAG

RAG that works on your data.

Your demo answered every question. Now your customers are asking question 31, and you can't tell whether retrieval missed or the model hallucinated. We build the retrieval layer, the eval suite, and the regression gates that catch question 31 before your customers do.

// What we see

Demo works. Production breaks. Always in the same places.

01

Question 31 retrieves nothing

Your demo questions all hit. The 31st is phrased differently, uses internal jargon, or spans two documents - and the top-50 misses the right passage entirely. You don't know it's happening because the model fills in the gap with a confident wrong answer.

02

You can't tell what failed

A bad answer comes back. Was it retrieval that missed, or the model that hallucinated? Most teams can't tell, because there's no eval that separates the two. So fixes are guesses, and the same bug keeps coming back.

03

Every improvement is anecdotal

Someone tweaks the chunk size and 12 questions silently regress. The team that made it better can't prove it. The team that made it worse won't see it. Production drifts, and the only signal is when a customer complains.

// Case Study

Text-search across 200 live city camera feeds

Municipal operators type a description and the system surfaces matching events from across the city's live CCTV network. We built it for Neural; the City of Oława's Straż Miejska runs it on-prem. 200 cameras per server; review time on a typical incident dropped from ~8 hours of manual scrubbing to under 1 hour - an ~88% reduction.

  • 200

    live cameras per on-prem server

  • ~88%

    less time per incident review

  • ~33K

    residents covered (Oława)

Read the case study
Text-search across 200 live city camera feeds

// What we do

Three things that do most of the work.

Most of the wins aren't the latest paper. They're a hybrid index wired correctly, a reranker on top, and an eval suite the team actually runs.

Hybrid search, weighted per corpus

Pure dense embeddings miss exact terms - codes, IDs, proper nouns. BM25 + dense beats either one alone on almost every customer corpus. We tune the weights to your data, not a generic default.

A reranker on top of recall

Top-50 recall is cheap; top-5 precision is what users see. A cross-encoder reranking the top-50 lifts top-5 precision 8-15 points on most corpora, for a few tens of milliseconds we can budget for.

An eval suite your team runs

200-300 golden questions with expected citations. We build it first, before any retrieval changes. Every change after that ships with a measured delta - no anecdotal improvements, no silent regressions.

// Method fit

Not every retrieval problem is a RAG problem.

skip it if

  • Knowledge fits in context

    If your full corpus is under ~50K tokens, just stuff the prompt. Cheaper, faster, fewer moving parts to maintain. RAG adds infrastructure you'll be on the hook for.

  • The problem is the model

    Wrong tone, wrong format, refusal behavior - those are model problems, not retrieval problems. Fine-tuning fixes them; RAG won't.

    Supervised Fine Tuning (SFT)
  • Your data is structured

    Customer records, transactions, inventory belong in SQL or a graph DB. Retrieval over structured data should be queries, not embeddings.

use it if

RAG fits when your corpus is too big for context, the content is mostly unstructured (docs, tickets, code, transcripts), and questions are open-ended. That covers most production knowledge-Q&A.

// How we work

Eval first. Iterate in the open. Hand off code, not Confluence.

Every engagement starts with a shared eval and ends with your team running it in CI. Between those two points, your engineers watch the iteration as it happens - not in a Friday demo.

01

Shared eval as the contract

Week one, we sit with your team and write 100-300 golden questions with expected citations. The eval becomes the spec. No retrieval claim is allowed without a measured delta against it.

02

Iterate in the open

Every training and retrieval run lands in a Weights & Biases workspace your team has access to. You see what we're trying, what's working, and what we're killing. The dashboard replaces the status report.

03

Hand off, then stay nearby

We hand off code, the eval suite in your CI, and a runbook your on-call can read at 11pm. Slack for 30 days after delivery for the questions that come up after we leave.

// Weights & Biases - shared workspace

Weights & Biases workspace from a recent engagement showing 38 training runs and 14 tracked metrics

Real workspace from a recent engagement. 38 runs, 14 tracked metrics across recall, precision, and coherence tests. Your engineers get access on day one - no PDF status reports, no surprise findings at the demo. Run labels are anonymized when the customer requires it.

Karol Gawron

// Expert insight

The teams that ship great RAG don't have a secret embedding model. They have a 200-question golden set, a hybrid index, and the discipline to gate every change on the eval. Most of the "tricks" matter much less than that.

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of two senior ML engineers you'd hire.

You could hire the team. It would take a year and they'd learn this on you. We've already learned it - on production engagements at Brand24, SurferSEO, Comcast, and others.

Embedding fine-tunes that compete with 7B baselines

Our internal mxbai-large fine-tunes have matched gte-Qwen2-7B on customer IR tasks at ~1/15 the parameter count. Methodology reproducible across corpora.

Eval-first methodology

Every retrieval change ships with a measured delta. The W&B workspace is shared with your team. No silent regressions for a cosmetic win.

Senior engineers only, no juniors

Every person on your engagement has shipped retrieval to production. No ramp-up tax, no learning on your dollar.

// FAQ

Common questions about production RAG

RAG when the knowledge changes, you need citations, or the corpus is too big for context. Fine-tuning when you need to change behavior, tone, or tool-calling shape. Production systems often use both - retrieval for facts, a lightly tuned model for the answer style.

Sub-second p95 over millions of chunks is routine. Qdrant and pgvector both scale to hundreds of millions on the right hardware. Above that, sharding and hybrid index designs (HNSW + ScaNN, IVF-PQ for cold tiers) start mattering. We design for the scale you're heading to.

Either per-tenant collections (strongest isolation, more ops) or a shared collection with metadata filters and query-layer enforcement (denser, careful audit). We pick by data sensitivity and tenant scale, then verify isolation with red-team retrieval queries.

Engagements start at $40K. Most production-RAG projects land between $40K and $120K depending on corpus complexity, multi-tenant requirements, and whether embedding fine-tuning is in scope. We share a fixed-fee proposal after the first scoping call - no time-and-materials surprise.

// Let's ship it

Send us your eval. We'll send back a plan.

Tell us about the corpus, the question shapes you're failing on, and the recall bar. We'll come back with a retrieval design and an eval plan, usually within a business day. Engagements from $40K, typically 4-8 weeks.

Karol Gawron

Karol Gawron

Head of R&D @ bards.ai