// Research / Computer Vision
Document AI& OCR pipelines that survive real scans.
OCR is solved on clean PDFs. Real intelligent document processing is layout-aware extraction, invoice OCR with line-item parsing, HTR on handwritten forms, multilingual scans, and pipelines that know when to ask a human. We've shipped both - including the engine behind EasyDocs.
// Why custom document AI
Cloud OCR breaks at the edges. We ship pipelines that don't.
01
Layout-aware, not just text
LayoutLMv3, Donut, and Nougat understand position, structure, and reading order - so tables stay tables, key-value pairs stay paired, and footnotes don't end up mid-paragraph.
02
On-prem when the data can't leave
Healthcare, legal, finance, and government often can't ship documents to a US-based API. We build pipelines that run inside your VPC, your data center, or air-gapped - same accuracy, no compliance fight.
03
Knows when to ask a human
Per-field confidence, calibrated thresholds, and a review queue that surfaces only the uncertain extractions. Your operators stop reviewing the 95% the model gets right.
// Case Study
We trained EasyDocs' invoice extraction model
EasyDocs is the platform provider - they ship document management software to their own customers. We trained the fine-tuned NLP model that runs inside it, auto-extracting VAT numbers, totals, and addresses from invoices and learning from every user correction. Deployed on their servers, no external dependencies.
98%
field-level extraction accuracy
<300ms
inference time per invoice
On-prem
deployment with no external dependencies

// What we deliver
OCR, layout, structure, and the LLM glue.
We pick the right model at each stage of the pipeline rather than forcing one model to do everything badly.
OCR engines, picked per task
PaddleOCR, Surya, EasyOCR, and Tesseract baselines - chosen based on language, script, and image quality after benchmarking on your data.
- Surya for high-accuracy multilingual OCR including Polish
- PaddleOCR for printed-text speed and CJK scripts
- Custom-finetuned recognizers for handwriting and domain fonts
- Tesseract baselines as a sanity check, not a target
Layout-aware understanding
Position matters. We use models that know where text sits on the page and how regions relate to each other.
- LayoutLMv3 for token classification and entity extraction
- Donut for end-to-end document understanding without OCR
- Nougat for academic papers, formulas, and multi-column scans
- Custom heads for forms, invoices, and proprietary templates
Tables, forms, and KV
The hard part of document AI. We treat tables and forms as first-class extraction targets, not afterthoughts.
- Camelot and pdfplumber for born-digital PDFs
- Hybrid CV + LLM pipelines for scanned and photographed tables
- Key-value extraction with schema validation
- Multi-page document linking and reading-order recovery
OCR + LLM hybrids
LLMs are great at normalization and bad at character recognition. We combine them to get the best of both.
- OCR output as grounded context for LLM extraction
- Schema-constrained generation with retry and validation
- Per-field confidence aggregation across CV and LLM signals
- Open-weight LLMs (Qwen-VL, InternVL) for on-prem deployment
Hardened for poor inputs
Real documents are crumpled, rotated, faxed, photographed, and partially redacted. The pipeline has to survive all of that.
- Deskew, dewarp, and binarization preprocessing
- Multi-resolution OCR for low-DPI scans
- Robustness benchmarks on synthetic degradations
- Polish and Eastern European script support out of the box
Audit trail and HITL
Every extraction is traceable, every low-confidence field is reviewable, every change is logged.
- Field-level confidence and provenance back to source pixels
- Human-in-loop review queues with adjudication
- Audit logs suitable for SOC 2, HIPAA, and DORA
- Active learning: corrections feed back into retraining

// Expert insight
“The biggest mistake in document AI is treating OCR as the whole problem. The real value is in the structure layer - tables, key-value pairs, reading order - and increasingly in the LLM that normalizes the output. OCR is just the first 30 percent.”
Norbert Ropiak
Co-founder @ bards.ai
// Why bards.ai
We built EasyDocs. We can build yours.
Production document AI in regulated industries. On-prem capable, multilingual, and audit-ready by default.
EasyDocs in production
Our document AI platform runs at 98% field-level accuracy under 300ms per page - on-prem, GDPR-clean, and battle-tested in finance and public sector.
Layout and OCR specialists
LayoutLMv3, Donut, Nougat, Surya - we've trained, fine-tuned, and shipped each in production for real customers.
Native Polish + Eastern European
Our team ships OCR and document models that handle Polish, Czech, Ukrainian, and other scripts that English-first vendors stumble on.
On-prem and air-gapped capable
Healthcare, legal, public sector. We've shipped document pipelines into facilities that block outbound traffic and require signed install bundles.
16+ open-source models on Hugging Face
We publish models as well as deploy them. Several of our OCR and embedding models are public, with 80K+ monthly downloads.
Senior team, no juniors
Every engineer has shipped document AI to paying customers. No ramp-up tax on your project.
// FAQ
Common questions about document AI pipelines
They're great when your documents look like generic invoices and you're allowed to send data to a US/EU cloud API. They fall over on domain-specific layouts, multilingual content beyond their training mix, and any environment where data residency rules out cloud OCR. Custom pipelines also unlock per-field confidence calibration that hosted services hide.
Yes - that's our default for regulated customers. The full stack (OCR, layout model, optional LLM) runs on customer-controlled GPUs with no outbound network calls. EasyDocs ships this way and so does most of our document work.
For structured forms with clean scans, 97-99% field-level accuracy is realistic. For semi-structured documents (invoices, contracts) with mixed quality, 92-97% with calibrated confidence and human review on uncertain fields. We always benchmark on a holdout of your real data before promising numbers.
Latin scripts including Polish, Czech, Hungarian, German, French, Spanish, Portuguese; Cyrillic for Ukrainian and Russian; CJK via PaddleOCR and Surya. Domain-specific languages (legal Latin, medical jargon, abbreviations) usually need targeted fine-tuning, which we handle.
We've handled everything from one-page forms to 500-page legal contracts with cross-references, tables that span pages, footnotes, and inline annotations. The architecture changes - Donut for short forms, hierarchical models for long contracts - but the framing stays the same.
Output is structured JSON validated against a schema you define, delivered via REST/gRPC or pushed to a queue/database. We've integrated with SAP, Salesforce, custom ERPs, and document management systems. Every field carries confidence and provenance for downstream policy.
Preprocessing (deskew, dewarp, denoise, binarization) plus multi-resolution OCR plus an LLM verification pass that catches OCR errors via context. For fundamentally unreadable inputs we surface them through the human-review queue rather than guessing - silent guesses are worse than asking.
// Let's ship it
Stop hand-keying documents. Ship a pipeline that doesn't.
Send us a sample of your documents and your target accuracy. We'll come back with a benchmark on a holdout slice and a deployment plan, usually within a business day.

Norbert Ropiak
Co-founder @ bards.ai