// GDPR PII Redaction

GDPR-compliant PII redaction - 24 EU languages, drop-in

Every GDPR-bound team we worked with was hand-rolling their own preprocessing - regex stacks, per-language NER, hours of pipeline work per dataset, and coverage that stopped at the languages they had labelers for. We shipped `bardsai/eu-pii-anonimization-multilang` as the drop-in library that replaces all of it: one import covers 24 EU languages with a GDPR-aware tagging schema, recall-tuned because false negatives leak. 0.890 F2 on the Gretel benchmark; hours of pipeline work collapse to seconds per dataset.

offices

Wrocław, Poland

size

20-50 employees

industry

AI R&D - open-source release

revenue

-

// Outcomes

The numbers that matter

  • 0.890

    F2 on Gretel PII benchmark

  • 24

    EU languages with one model

  • Hours → s

    preprocessing per dataset

01 · GDPR-bound preprocessing was a hand-rolled job at every team

The Challenge

Every team we talked to was solving the same problem from scratch. Before any LLM training, fine-tuning, or analytics could touch user-generated text, the data had to be scrubbed of PII to satisfy GDPR - and that scrubbing was a bespoke pipeline at every team. Regex stacks for emails and phone numbers, off-the-shelf NER for names, custom rules for national IDs that the open-source models didn't know about, and a separate pipeline per language because the labels didn't transfer. Hours of preprocessing work per dataset, brittle in production, and coverage that stopped at whichever languages the team had labelers for.

GDPR scope is broad and language-specific. Names, addresses, phone formats, national IDs, and tax numbers are all country-specific entity types: a model that knows what a German Steuernummer looks like has no idea what a Polish PESEL is, what a Spanish DNI looks like, or how French SIREN numbers are formatted. Coverage across all 24 official EU languages was the whole job - and existing open-source NER tooling stopped short of that.

Recall is the metric that matters under GDPR. A false negative - a PII span the model missed - is a privacy violation that compounds downstream the moment the redacted text leaves the perimeter. A false positive - over-redacting a non-PII span - is recoverable. We needed a model tuned for the asymmetry: recall-first, with a tagging schema mapped to GDPR Article 4(1) categories, packaged so a downstream team could swap in one library call instead of rebuilding the pipeline.

02 · Build the canonical tool. Tune for recall. Ship as a library.

Approach

Step 1: One model, all 24 EU languages, one tagging schema.

We worked with linguists to define a tagging schema mapped to GDPR Article 4(1) categories (names, IDs, contact data, location data, financial data, health data, etc.) that holds across all 24 official EU languages. One schema means one downstream contract for users of the library - no per-language branching in their preprocessing code.

Step 2: Per-country format awareness, not English-with-translation.

The naive workaround - train on English-heavy data and translate at inference - doesn't work for PII. National IDs and tax numbers are language- and country-specific by construction. The model is trained with country-specific format awareness for Polish PESEL, Spanish DNI, French SIREN, German Steuernummer, Italian Codice Fiscale, and the equivalent IDs in every other EU member state. Same model, same checkpoint - the format-specificity lives in the training data and the schema.

Step 3: Recall-tuned eval gates.

Training gates favored recall over precision. We benchmarked against the four standard public PII suites (OpenPII, Gretel, Nemotron-PII, Privy) plus an internal cross-lingual eval, and tracked per-language F2 (which weights recall higher than precision) as the primary metric. Per-language gates caught any single language regressing while the average improved - important for the less-resourced languages (Maltese, Estonian, Slovenian, Croatian) where the model is doing the most work.

Step 4: Drop-in library, permissive license.

The model ships as `bardsai/eu-pii-anonimization-multilang` on Hugging Face under a permissive license, with a thin Python wrapper exposing a redact() call that returns the redacted text plus span-level metadata for downstream auditing. One import replaces the hand-rolled pipeline; small enough to run inside a customer's perimeter so the data never has to leave for the redaction step. Preprocessing primitives shouldn't be vendor-locked.

03 · Drop-in library. Seconds per dataset. In production at GDPR-bound teams.

Result

The model was benchmarked across the four standard public PII suites plus an internal cross-lingual eval. The headline numbers come out competitive across the board, with the strongest absolute score on Gretel and a recall profile that matches the actual job: missing a span is a privacy violation; over-redacting is recoverable. Hours of bespoke per-team preprocessing collapse to seconds per dataset.

  • 0.890 F2 on the Gretel PII benchmark - the strongest absolute single-benchmark result for the model.
  • 0.842 F2 on OpenPII, 0.715 on Nemotron-PII, 0.575 on Privy - competitive across all four public suites with a single multilingual model.
  • 0.826 average recall across the suite - the privacy-relevant metric, since false negatives leak and false positives only over-redact.
  • 24 EU official languages covered with one unified model and one GDPR-aware tagging schema - strongest gap on less-resourced languages (Maltese, Estonian, Slovenian, Croatian) where bespoke per-team labeling is the most impractical.
  • Hours of bespoke preprocessing collapse to seconds per dataset - one import call replaces the regex-and-NER stack, runs inside the customer's perimeter, no data leaves for the redaction step.
  • Published openly on Hugging Face (`bardsai/eu-pii-anonimization-multilang`) under a permissive license - used in production by GDPR-bound teams across the EU.

The library ships as the canonical preprocessing tool we point teams to when they're scoping a GDPR-bound text pipeline. The win is the seconds-per-dataset throughput once the preprocessing step stops being bespoke - that's what frees up the engineering time to focus on the actual model the customer cares about.

// Expert insight

Every team we worked with was hand-rolling their own PII pipeline before any LLM training could touch the data - regex, per-language NER, brittle in production. We built the library we wished existed: 24 EU languages, GDPR-aware tagging, recall-tuned for the privacy use case. One import call replaces hours of bespoke preprocessing.
Michał Pogoda-Rosikoń

Michał Pogoda-Rosikoń

Co-founder @ bards.ai

// Ready to ship?

Let's build something that delivers numbers like these.

Book a meeting