// Research / Trust, PII & Safety

LLM guardrails that hold up on your fine-tune.

Off-the-shelf guardrails were trained on someone else's model and someone else's harms taxonomy. We measure where Llama Guard drops on your fine-tune, train custom classifier heads where it does, and red-team the result before it goes live - so the eval reflects what you ship, not a vendor whitepaper.

// What we see

Generic safety classifiers miss your domain. They miss it quietly.

01

Llama Guard recall drops on your fine-tune

Fine-tuning shifts the model's distribution toward your domain, and the off-the-shelf safety classifier was never trained to match. Recall on a domain-tuned model is reliably worse than on the base. The team finds out from a support ticket, not from the eval.

02

The harms taxonomy doesn't match your trust policy

The vendor categories cover violence, hate, self-harm. Your trust team cares about competitor mentions, off-policy financial advice, and PII exfil. The classifier flags what was easy to label, not what your business needs flagged. The gap shows up in the wrong places.

03

Indirect injection isn't on the demo

DAN and role-play prompts are well-covered. Indirect injection from a retrieved document or a tool output is the actual attack on production agents - and most stacks ship without testing for it. The first incident is a prompt smuggled in through a PDF.

// Case Study

Guidemate - tourist AI agent for the Lubelskie region

We built Guidemate as a LangGraph-based agent embedded on the Lubelskie Voivodeship's official tourism portal - typed toolkit over the region's curated tourism knowledge, integrated with Poland's RCB government safety-warning feed, deployed on the LangGraph Platform agent service. Serves visitors in 30+ languages at ~$0.01 per message, with a guardrails layer running at 99.3% accuracy.

  • 99.3%

    guardrails accuracy

  • $0.01

    per message

  • 30+

    languages supported

Read the case study
Guidemate - tourist AI agent for the Lubelskie region

// What we do

Three things that decide whether the guardrails hold.

Most safety failures aren't the absence of a classifier. They're a classifier trained on the wrong distribution, a taxonomy that doesn't match the trust policy, and a red-team that stops at public corpora.

Custom classifier heads on your distribution

Llama Guard 3 and NeMo Guardrails go in their right place; where the off-the-shelf taxonomy drops, we train distilled DeBERTa or ModernBERT heads on your domain. Sub-10ms inline inference, fine-tunable as the harm landscape shifts, with an active-learning loop on flagged production traffic.

Red-team beyond the public corpus

Garak runs the published attacks. Our team runs the ones that aren't published yet - domain-specific jailbreaks, indirect injection from RAG inputs and tool outputs, multi-turn escalation. The output is a report with concrete attack traces, not a vibes check.

Eval suite gating every deploy

Garak suite tuned to your model, plus your own red-team prompts, plus a regression set built from real incidents. Pass/fail thresholds per attack class block deploys before the cosmetic win lands. Streaming-aware on the output side - generation stops on the offending token, not after the whole response.

// Method fit

Custom guardrails aren't the right control for every LLM.

skip it if

  • You're calling a frontier API directly

    If you call OpenAI or Anthropic and accept the vendor's safety policy as your safety policy, the off-the-shelf moderation endpoint plus a sane system prompt covers most of the surface. Custom is for fine-tunes, agentic systems, or domains where the vendor's taxonomy doesn't match yours.

  • Your real concern is data leakage

    If 'PII goes to a third-party LLM' is the dominant risk, an egress-side redactor in front of the API solves it more cheaply than a guardrail layer. Different problem, different stack.

    PII Redaction & LLM Data Privacy
  • The harms surface is small and static

    If your use case is a narrow internal tool with a known input distribution and no adversarial users, a system prompt plus rule-based output validation is usually enough. Guardrails earn their cost when the input distribution is open and the user incentive to attack exists.

use it if

Custom guardrails fit when you've fine-tuned a model and the off-the-shelf classifier drops on it, your harms taxonomy doesn't match the vendor categories, you're running an agentic system that takes input from retrieved content or tools, or your buyers are starting to ask for evidence beyond a vendor claim.

// How we work

Taxonomy first. Red-team in the open. Hand off the eval.

Every guardrail engagement starts with the harms taxonomy your trust team owns - and a measurement of where the off-the-shelf stack drops on your distribution. The classifier work comes after.

01

Taxonomy and gap measurement (week one)

We run a workshop with your trust and safety team to define what's blocked, flagged, allowed-but-logged, or out of scope. Then we measure where Llama Guard, NeMo, and the moderation API drop on your traffic. The number tells us which heads to train.

02

Iterate in a shared workspace

Every classifier head, every red-team round, every eval delta lands in a Weights & Biases or MLflow workspace your engineers can see. The dashboard replaces the status report - your team sees what passes, what fails, and what we're killing.

03

Hand off the eval and the runbook

We hand off the classifier heads, the integration code, the eval suite in your CI, and a runbook for the day a guard fires for the wrong reason - kill switches, tenant overrides, false-positive triage. Slack for 30 days after delivery.

Karol Gawron

// Expert insight

Llama Guard is a strong default - until you fine-tune. Domain fine-tunes drift the model's behavior in ways the off-the-shelf classifier was never trained to catch, and the eval surface narrows in lockstep. Custom classifier heads exist because that gap is real, not because we want to sell more model training.

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of two senior safety engineers you'd hire.

Adversarial ML rewards experience. We've shipped guardrails to production, watched them get attacked, and trained the heads from scratch - on our own published models.

Authors of the EU PII model

bardsai/eu-pii-anonymization-multilang is ours. Output PII validation runs on a model we publish - no licensing surprises, no opaque vendor stack on the safety path.

10+ peer-reviewed NLP publications

CLARIN-PL spinoff. We read the safety literature when it lands and ship classifiers from scratch - you get a custom guardrail head, not another wrapper.

Senior engineers only, no juniors

Every person on your engagement has shipped guardrails to production and watched them get attacked. No ramp-up tax on adversarial ML.

// FAQ

Common questions about LLM guardrails

Fine-tuning narrows the model's distribution toward your domain - and narrows the safety classifier's training distribution out of it. Llama Guard's recall on a domain-specific model is reliably worse than on the base it was trained against. The fix is a domain-tuned classifier head plus a real eval on your fine-tune, not a stronger off-the-shelf default.

Distilled classifier heads: 5-15ms per call on a T4 or L40S. Llama Guard 3 (8B) inline: 50-150ms depending on context length. For latency-sensitive routes we run a small classifier inline and a larger one async on a sample. Output guardrails run streaming - they stop generation on the offending token rather than waiting for the whole response.

Two ways. First, our red-team runs domain-specific attacks during the build - we don't ship without exploring the prompt space your customers actually inhabit. Second, the production system logs flagged inputs under a strict review policy, our team triages new attack patterns weekly, and the input classifier gets retrained on the new corpus.

Yes - and the threat model is different. Indirect prompt injection (a malicious payload in a retrieved document or tool output) is the dominant attack against agents. We add input scrubbing on retrieved content, structured-output validation on tool calls, and per-tool authorization policies. Agent guardrails are a superset of chat guardrails, not the same thing.

Engagements start at $40K. Most guardrails projects land between $40K and $100K depending on classifier custom-training scope, red-team depth, and whether the eval suite is greenfield. Fixed-fee proposal after the first scoping call - no time-and-materials surprise.

// Let's ship it

Send us your harms taxonomy. We'll send back a plan.

Tell us about the model, the trust policy, and the production traffic shape. We'll come back with a guardrail architecture, a red-team plan, and a number - usually within a business day. Engagements from $40K, typically 4-8 weeks.

Karol Gawron

Karol Gawron

Head of R&D @ bards.ai