// Research / Trust, PII & Safety

PII redaction that catches the languages Presidio misses.

Your DPO won't sign off on sending customer names and national IDs to a US LLM API. We sit between your app and the model with multilingual neural detection, reversible tokenization, and audit logs your compliance team can actually query - so the LLM never sees what it doesn't need to.

// What we see

Demo passes English. Production fails on Polish, German, French.

01

Presidio is English-first

The standard stack (Presidio + spaCy) handles English well and degrades quietly on everything else. Polish PESEL, German Steuer-ID, French NIR, Italian codice fiscale go undetected. The compliance team finds out when an auditor pulls a sample.

02

Regex misses the long tail

PESEL has a checksum so regex catches it. Names don't. Internal MRNs, employee IDs, account numbers wrapped in domain prose - those need a model. Teams that stop at regex ship a redactor that catches the easy 60% and leaks the rest.

03

Reversibility is an afterthought

Most teams discover at the end of week one that the LLM needs to see something useful in place of redacted spans, and that the response needs the originals back. Without stable placeholders and a vault, you either lose semantics or leak PII in the rehydration step.

// Case Study

GDPR-compliant PII redaction - 24 EU languages, drop-in

Every GDPR-bound team we worked with was hand-rolling their own preprocessing - regex stacks, per-language NER, hours of pipeline work per dataset, and coverage that stopped at the languages they had labelers for. We shipped `bardsai/eu-pii-anonimization-multilang` as the drop-in library that replaces all of it: one import covers 24 EU languages with a GDPR-aware tagging schema, recall-tuned because false negatives leak. 0.890 F2 on the Gretel benchmark; hours of pipeline work collapse to seconds per dataset.

  • 0.890

    F2 on Gretel PII benchmark

  • 24

    EU languages with one model

  • Hours → s

    preprocessing per dataset

Read the case study
GDPR-compliant PII redaction - 24 EU languages, drop-in

// What we do

Three things that make PII redaction hold up in regulated traffic.

The model that catches non-English entities, the tokenization layer that keeps the LLM useful, and the audit log your DPO can hand to a regulator. The rest is integration.

A model trained on EU languages

bardsai/eu-pii-anonymization-multilang - our published Hugging Face model, trained on 20+ languages including the ones Presidio degrades on. Fine-tunable on your domain (medical, legal, financial) when the public taxonomy doesn't cover your identifiers.

Reversible tokenization with a vault you own

Each entity is replaced with a stable placeholder like [PERSON_1]; the original lives in a vault on your infrastructure. The LLM reasons over placeholders, the response gets rehydrated on the way back. Per-tenant keys, rotation, and an optional one-way mode for stricter policies.

Per-entity F1 measured on your data

Vendor benchmarks aren't your distribution. We measure per-entity, per-language F1 on your held-out set before go-live, and tune confidence thresholds to your risk model - conservative on national IDs, more permissive on locations. The metric reflects what you're actually shipping.

// Method fit

Edge redaction isn't the right control for every privacy problem.

skip it if

  • Your LLM is already inside your perimeter

    If you're running open-weight models on your own hardware and customer data never leaves your network, the egress problem doesn't exist. Redaction inside your perimeter is a different design - usually about minimizing data in logs and traces, not at the model boundary.

    On-Prem & Air-Gapped LLM
  • You only handle English text

    Microsoft Presidio plus a tuned spaCy pipeline gets you a long way on English-only traffic. Custom is for the multilingual case, for domain identifiers spaCy doesn't carry, or when reversible tokenization is part of the requirement.

  • Your problem is structured data exfil

    Customer records leaving via SQL exports, API responses, or backup pipelines aren't an LLM problem - they're a data-loss-prevention problem. Different stack (DLP gateways, data classification, egress firewalls), different vendor list.

use it if

Edge PII redaction fits when you call third-party LLM APIs with customer text, operate across multiple languages, need reversible tokenization to keep responses useful, and have a DPO who needs an audit trail rather than a vendor claim.

// How we work

Measure first. Tokenize early. Hand off the audit trail.

Every PII engagement starts with a measurement on your data, not a vendor benchmark. The architecture is decided after we know which entities your traffic actually contains and at what rate.

01

Measure on your traffic (week one)

We run the base model on a sampled, redacted-for-handling slice of your traffic and report per-entity, per-language F1. The number you ship against is calibrated to your distribution before any integration code is written.

02

Integrate where the trust boundary is

Proxy, SDK, or sidecar - we pick by where your trust boundary sits, not by what's easiest to demo. Streaming-aware: PII tokens get redacted before they leave the boundary, not after the response is done.

03

Hand off the audit trail

We hand off the model, the vault schema, the integration code, and an audit-log dashboard your compliance team can query for Article 30 records. Slack for 30 days after delivery for the questions that come up after we leave.

Michał Pogoda-Rosikoń

// Expert insight

The off-the-shelf PII stack is English-first and quietly silent on Polish, German, or French national identifiers. We published the EU PII model because the Presidio-plus-spaCy pipeline was failing on the languages our customers actually operate in - and the compliance teams were finding out from auditors, not from us.

Michał Pogoda-Rosikoń

Co-founder @ bards.ai

Read the PII research

// Why bards.ai

Why us, instead of a Presidio plug-in and hope.

PII detection is a research problem dressed up as a compliance problem. We did the research first - and the same model in your deployment is the one we publish on Hugging Face.

Published EU PII model

bardsai/eu-pii-anonymization-multilang covers 20+ languages with open weights. The same model in your deployment is what we publish - no licensing surprises, no opaque vendor stack on the safety path.

10+ peer-reviewed NLP publications

CLARIN-PL spinoff. We publish in NLP venues and ship the same models to production - the research and the product are the same artifact.

Senior engineers only, no juniors

Every person on your engagement has shipped NLP to production. No ramp-up tax on a domain we already own.

// FAQ

Common questions about PII redaction

On our public benchmark, the multilingual model lands in the 0.92-0.97 F1 range for high-resource entities (PERSON, ORG, LOC, EMAIL) across PL, DE, FR, ES, IT, EN. National identifiers and structured tokens (IBAN, PESEL, IP) hit 0.99+ because they're regex-validatable. We always re-measure on your data before go-live - vendor benchmarks aren't a substitute for your distribution.

On a single A10 or T4, redaction adds 5-15ms per request for typical chat-length inputs and runs in parallel with first-token streaming. For long documents we batch. A distilled variant drops latency by ~3x at a measurable but small recall cost when the budget is tighter.

On detection, each entity is replaced with a stable placeholder ([PERSON_1], [NAT_ID_1]) and the original is stored in a vault keyed by request and tenant. The LLM operates on placeholders. On the response path we rehydrate. The vault lives on your infrastructure - we never see plaintext PII.

It addresses data minimization and pseudonymization (GDPR Art. 5, Art. 25, Art. 32). It does not replace a DPA with your LLM vendor, your Record of Processing, or your DPIA. We help wire those together - but a tool is not a compliance program.

Engagements start at $40K. Most PII redaction projects land between $40K and $90K depending on language coverage, custom-entity work, and integration topology (proxy / SDK / sidecar / all three). Fixed-fee proposal after the first scoping call - no time-and-materials surprise.

// Let's ship it

Send us a sample of your traffic. We'll send back a plan.

Tell us about the languages you operate in, the entities your DPO is worried about, and where the LLM call actually happens. We'll come back with an integration plan and an accuracy estimate on your distribution - usually within a business day. Engagements from $40K, typically 4-8 weeks.

Michał Pogoda-Rosikoń

Michał Pogoda-Rosikoń

Co-founder @ bards.ai