// Research / Trust, PII & Safety

LLM guardrails that hold up on your fine-tune.

Off-the-shelf guardrails were trained on someone else's model and someone else's harms taxonomy. Fine-tunes shift the distribution; agents add a second attack surface through retrieved documents, MCP tool responses, and function-call outputs. We measure where Llama Guard drops on your fine-tune, train custom classifier heads where it does, and red-team the full attack surface - chat and agentic - before it goes live.

Book a meeting See related case study

// What we see

Generic safety classifiers miss your domain. They miss it quietly.

Llama Guard recall drops on your fine-tune

Fine-tuning shifts the model's distribution toward your domain, and the off-the-shelf safety classifier was never trained to match. Recall on a domain-tuned model is reliably worse than on the base. The team finds out from a support ticket, not from the eval.

The harms taxonomy doesn't match your trust policy

The vendor categories cover violence, hate, self-harm. Your trust team cares about competitor mentions, off-policy financial advice, and PII exfil. The classifier flags what was easy to label, not what your business needs flagged. The gap shows up in the wrong places.

The attacks that hit production aren't the ones on the demo

DAN and role-play variants are well-covered by Garak and public benchmarks. The incidents in deployed agents come from: MCP tool responses that override the system prompt, policy instructions embedded in RAG-retrieved documents, and multi-turn escalation that no single-request eval catches. Most guardrail stacks aren't instrumented at the agent tool boundary - the first signal is a support ticket, not a CI failure.

// Case Study

Guidemate - tourist AI agent for the Lubelskie region

We built Guidemate as a LangGraph-based agent embedded on the Lubelskie Voivodeship's official tourism portal - typed toolkit over the region's curated tourism knowledge, integrated with Poland's RCB government safety-warning feed, deployed on the LangGraph Platform agent service. Serves visitors in 30+ languages at ~$0.01 per message, with a guardrails layer running at 99.3% accuracy.

99.3%
guardrails accuracy
$0.01
per message
30+
languages supported

Read the case study

Guidemate - tourist AI agent for the Lubelskie region

// What we do

Three things that decide whether the guardrails hold.

Most safety failures aren't the absence of a classifier. They're a classifier trained on the wrong distribution, a taxonomy that doesn't match the trust policy, and a red-team that stops at public corpora.

Custom classifier heads on your distribution

Llama Guard 3 and NeMo Guardrails go in their right place; where the off-the-shelf taxonomy drops, we train distilled DeBERTa or ModernBERT heads on your domain. Sub-10ms inline inference, fine-tunable as the harm landscape shifts, with an active-learning loop on flagged production traffic.

Red-team beyond the public corpus

Garak runs the published attacks. Our team runs the ones that aren't published yet - domain-specific jailbreaks, indirect injection from RAG inputs and tool outputs, multi-turn escalation. The output is a report with concrete attack traces, not a vibes check.

Eval suite gating every deploy

Garak suite tuned to your model, plus your own red-team prompts, plus a regression set built from real incidents. Pass/fail thresholds per attack class block deploys before the cosmetic win lands. Streaming-aware on the output side - generation stops on the offending token, not after the whole response.

// Method fit

Custom guardrails aren't the right control for every LLM.

skip it if

You're calling a frontier API directly
If you call OpenAI or Anthropic and accept the vendor's safety policy as your safety policy, the off-the-shelf moderation endpoint plus a sane system prompt covers most of the surface. Custom is for fine-tunes, agentic systems, or domains where the vendor's taxonomy doesn't match yours.
Your real concern is data leakage
If 'PII goes to a third-party LLM' is the dominant risk, an egress-side redactor in front of the API solves it more cheaply than a guardrail layer. Different problem, different stack.
PII Redaction & LLM Data Privacy
The harms surface is small and static
If your use case is a narrow internal tool with a known input distribution and no adversarial users, a system prompt plus rule-based output validation is usually enough. Guardrails earn their cost when the input distribution is open and the user incentive to attack exists.

use it if

You've fine-tuned an open-weight model and need to verify the off-the-shelf classifier still holds on your distribution before shipping - not just on the base model benchmark.

You're running an agentic system - LangGraph, LlamaIndex, AWS AgentCore, custom - that consumes retrieved content, external tools, or MCP server responses. The injection attack surface is fundamentally different from a chat system.

Your harms taxonomy doesn't match the vendor categories. The classifier flags violence and self-harm; your trust team cares about competitor mentions, off-policy advice, and PII in generated text.

Enterprise procurement is asking for documented red-team results, classifier F1 on your distribution, and a CI-gated eval suite - not a link to a vendor safety page.

// How we work

Taxonomy first. Red-team in the open. Hand off the eval.

Every guardrail engagement starts with the harms taxonomy your trust team owns - and a measurement of where the off-the-shelf stack drops on your distribution. The classifier work comes after.

Taxonomy and gap measurement (week one)

We run a workshop with your trust and safety team to define what's blocked, flagged, allowed-but-logged, or out of scope. Then we measure where Llama Guard, NeMo, and the moderation API drop on your traffic. The number tells us which heads to train.

Iterate in a shared workspace

Every classifier head, every red-team round, every eval delta lands in a Weights & Biases or MLflow workspace your engineers can see. The dashboard replaces the status report - your team sees what passes, what fails, and what we're killing.

Hand off the eval and the runbook

We hand off the classifier heads, the integration code, the eval suite in your CI, and a runbook for the day a guard fires for the wrong reason - kill switches, tenant overrides, false-positive triage. Slack for 30 days after delivery.

// Expert insight

“Llama Guard is a strong default - until you fine-tune. Domain fine-tunes drift the model's behavior in ways the off-the-shelf classifier was never trained to catch, and the eval surface narrows in lockstep. Custom classifier heads exist because that gap is real, not because we want to sell more model training.”

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of two senior safety engineers you'd hire.

Adversarial ML rewards experience. We've shipped guardrails to production, watched them get attacked, and trained the heads from scratch - on our own published models.

Authors of the EU PII model

bardsai/eu-pii-anonymization-multilang is ours. Output PII validation runs on a model we publish - no licensing surprises, no opaque vendor stack on the safety path.

Agentic attack surface, not just chat jailbreaks

We test indirect injection through MCP tool responses, RAG-retrieved documents, and function-call outputs - the vectors that cause production incidents in 2025-2026, not the jailbreaks that are already on every public benchmark.

F1 measured on your distribution, not a vendor benchmark

We run the base classifier on a sample of your traffic before writing a line of integration code. The engagement scope is driven by that number, not by a sales estimate.

Evidence-grade deliverables for compliance buyers

Enterprise procurement asks for red-team reports, classifier calibration results, and eval suite pass/fail logs with per-attack-class thresholds. We produce the documentation format your buyers and auditors expect.

10+ peer-reviewed NLP publications

CLARIN-PL spinoff. We read the safety literature when it lands and ship classifiers from scratch - you get a custom guardrail head, not another wrapper.

Senior engineers only, no juniors

Every person on your engagement has shipped guardrails to production and watched them get attacked. No ramp-up tax on adversarial ML.

// FAQ

Common questions about LLM guardrails

Fine-tuning narrows the model's distribution toward your domain - and narrows the safety classifier's training distribution out of it. Llama Guard's recall on a domain-specific model is reliably worse than on the base it was trained against. The fix is a domain-tuned classifier head plus a real eval on your fine-tune, not a stronger off-the-shelf default.

Distilled classifier heads: 5-15ms per call on a T4 or L40S. Llama Guard 3 (8B) inline: 50-150ms depending on context length. For latency-sensitive routes we run a small classifier inline and a larger one async on a sample. Output guardrails run streaming - they stop generation on the offending token rather than waiting for the whole response.

Two ways. First, our red-team runs domain-specific attacks during the build - we don't ship without exploring the prompt space your customers actually inhabit. Second, the production system logs flagged inputs under a strict review policy, our team triages new attack patterns weekly, and the input classifier gets retrained on the new corpus.

Yes - and the threat model is meaningfully different. The dominant attack against production agents isn't a jailbreak in the user message; it's an instruction injected through a retrieved document, an MCP server response, or a tool's return value. We add input scrubbing at the agent tool boundary, structured-output validation on every tool call, and per-tool authorization policies that the agent can't override at runtime. Agent guardrails are a superset of chat guardrails - the underlying classifiers are the same, but the instrumentation points and the red-team playbook are distinct.

Article 15 of the EU AI Act requires high-risk AI systems to include appropriate accuracy, robustness, and cybersecurity measures - with technical documentation that supports post-market monitoring. A calibrated classifier suite with per-attack-class F1 measurements, a structured red-team report, and a CI-gated eval suite are the evidence artifacts that satisfy both internal audit and third-party conformity assessments. We produce these as part of every engagement; the documentation format is designed to map to Article 9 risk management and Article 15 technical requirements, not to serve as a marketing brochure.

Engagements start at $40K. Most guardrails projects land between $40K and $100K depending on classifier custom-training scope, red-team depth, and whether the eval suite is greenfield. Fixed-fee proposal after the first scoping call - no time-and-materials surprise.

// Let's ship it

Send us your harms taxonomy. We'll send back a plan.

Tell us about the model, the trust policy, and the production traffic shape. We'll come back with a guardrail architecture, a red-team plan, and a number - usually within a business day. Engagements from $40K, typically 4-8 weeks.

Book a meeting hello@bards.ai

Karol Gawron

Head of R&D @ bards.ai