// Research / Computer Vision

Document AI& OCR pipelines that survive real scans.

OCR is solved on clean PDFs. Real intelligent document processing is layout-aware extraction, invoice OCR with line-item parsing, HTR on handwritten forms, multilingual scans, and pipelines that know when to ask a human. We've shipped both - including the engine behind EasyDocs.

// What we see

Cloud OCR hits 95% on the vendor's demo docs. Yours aren't demo docs.

01

Hosted services fail quietly on your actual document mix

Azure Document Intelligence and Google Document AI quote accuracy against clean, structured invoices. Your actual mix includes faxed contracts, photographed receipts, forms with handwritten corrections, and documents in Polish or German that no US vendor optimizes for. The gap shows up in production, not in the PoC.

02

Table extraction fails exactly where the value is

Free-text fields OCR passably. Line-item tables in invoices, multi-column financial statements, and structured forms with cross-references are where the extraction actually matters - and where region-based OCR without layout understanding falls apart. The expensive fields are the ones that break.

03

The compliance team needs a data residency answer you can't give yet

Healthcare, legal, and government documents can't go to a US API. The typical answer - 'we'll figure out on-prem later' - becomes a 6-month retrofitting project when the security review lands. Architecture that's air-gap-ready from day one costs about the same as one that isn't.

// Case Study

We trained EasyDocs' invoice extraction model

EasyDocs is the platform provider - they ship document management software to their own customers. We trained the fine-tuned NLP model that runs inside it, auto-extracting VAT numbers, totals, and addresses from invoices and learning from every user correction. Deployed on their servers, no external dependencies.

  • 98%

    field-level extraction accuracy

  • <300ms

    inference time per invoice

  • On-prem

    deployment with no external dependencies

Read the case study
We trained EasyDocs' invoice extraction model

// What we deliver

OCR, layout, structure, and the LLM glue.

We pick the right model at each stage of the pipeline rather than forcing one model to do everything badly.

OCR engines, picked per task

PaddleOCR, Surya, Docling, EasyOCR, and Tesseract baselines - chosen based on language, script, and image quality after benchmarking on your data, not by habit.

  • Surya for high-accuracy multilingual OCR including Polish and Eastern European scripts
  • PaddleOCR for printed-text speed and CJK scripts
  • Docling (IBM DS4SD) for high-fidelity PDF-to-structured-output with native table preservation - strong baseline for clean born-digital PDFs
  • Custom-finetuned recognizers for handwriting and domain-specific fonts
  • Tesseract baselines as a sanity check, not a target

Layout-aware understanding

Position matters. We use models that know where text sits on the page and how regions relate to each other.

  • LayoutLMv3 for token classification and entity extraction
  • Donut for end-to-end document understanding without OCR
  • Nougat for academic papers, formulas, and multi-column scans
  • Custom heads for forms, invoices, and proprietary templates

Tables, forms, and KV

The hard part of document AI. We treat tables and forms as first-class extraction targets, not afterthoughts.

  • Camelot and pdfplumber for born-digital PDFs
  • Hybrid CV + LLM pipelines for scanned and photographed tables
  • Key-value extraction with schema validation
  • Multi-page document linking and reading-order recovery

OCR + LLM hybrids

LLMs excel at normalization and schema enforcement; they're poor at raw character recognition. We combine them - OCR anchors the extraction, LLM verifies and normalizes - so you get the accuracy of specialized OCR with the flexibility of language understanding.

  • OCR output as grounded context for LLM extraction - avoids hallucination on structured fields
  • Schema-constrained generation with retry and validation loops
  • Direct vision LLM pipeline (Gemini 2.5 Flash, GPT-4o, Claude) for clean born-digital PDFs where OCR is overkill
  • Per-field confidence aggregation across CV and LLM signals
  • Open-weight VLMs (Qwen-VL, InternVL, Mistral Pixtral) for on-prem deployment

Hardened for poor inputs

Real documents are crumpled, rotated, faxed, photographed, and partially redacted. The pipeline has to survive all of that.

  • Deskew, dewarp, and binarization preprocessing
  • Multi-resolution OCR for low-DPI scans
  • Robustness benchmarks on synthetic degradations
  • Polish and Eastern European script support out of the box

Audit trail and HITL

Every extraction is traceable, every low-confidence field is reviewable, every change is logged.

  • Field-level confidence and provenance back to source pixels
  • Human-in-loop review queues with adjudication
  • Audit logs suitable for SOC 2, HIPAA, and DORA
  • Active learning: corrections feed back into retraining

// Method fit

Custom pipelines earn their cost over hosted APIs at the edges.

skip it if

  • Your documents are English, well-structured, and clean

    Azure Document Intelligence, Google Document AI, and Amazon Textract handle the common case well - fast to integrate, cheap to start, and good enough for standard invoice or receipt parsing in English. Custom is for the edges: multilingual, regulated, specialized layouts, or on-prem required.

  • You need a fast answer at low volume

    For prototyping or low-volume internal tooling, a vision LLM (Gemini 2.5 Flash, GPT-4o, Claude) with direct PDF or image input is the fastest path. At low volume the cost is acceptable and accuracy on clean docs is surprisingly good. Custom pipelines make sense when volume, accuracy on degraded inputs, or data residency rules that path out.

  • Your volume is below roughly 10K documents per month

    The fixed engineering cost of a custom pipeline isn't justified by the per-document savings at low volume. Hosted APIs plus a human review queue cover most low-volume regulated cases at acceptable cost.

use it if

Your document mix includes non-English text, domain-specific identifiers, or layouts that hosted services don't cover - and the accuracy gap is showing up as errors your operators hand-correct.

Data residency rules out sending documents to a US API, or your security team requires extraction to run inside your VPC or air-gapped environment.

Volume is high enough that per-page API cost compounds: at 100K pages per month, the difference between $0.015/page (custom on-prem) and $0.10/page (hosted API) is $100K per year.

You need field-level confidence, active learning from operator corrections, and an audit trail for SOC 2, HIPAA, or DORA - not just a structured JSON response.

// How we work

Benchmark on your docs first. Architecture decided after.

Every engagement starts with a held-out sample of your actual documents and a baseline measurement. The pipeline design follows the numbers - we don't recommend a layout model or an OCR engine before we've seen your data.

01

Holdout benchmark on your real documents

A stratified sample of your document types - best-case, typical, and worst-case inputs. We run your current state (hosted API or existing pipeline), our baseline model candidates, and the hybrid OCR + LLM approach against the same ground truth. You see per-field F1, per-document-type breakdown, and the latency and cost profile of each option before a line of integration code is written.

02

Build the pipeline in your environment

OCR engine selected per document type, not by default. Layout model chosen based on what your documents actually look like - LayoutLMv3 for token classification, Donut for end-to-end understanding, Nougat for academic and multi-column content. LLM verification layer where schema enforcement and normalization add accuracy over raw OCR. All work runs against your real documents, inside your infrastructure if on-prem is the requirement.

03

Hand off the confidence thresholds and the retraining loop

Per-field confidence thresholds calibrated to your operators' review capacity - surface the 5% of extractions your team needs to check, not 50%. Active learning loop wired so operator corrections feed back into model retraining. Audit trail per field, per document, per extraction run. Runbook for adding new document types and tuning thresholds as your document mix evolves.

Norbert Ropiak

// Expert insight

The biggest mistake in document AI is treating OCR as the whole problem. The real value is in the structure layer - tables, key-value pairs, reading order - and increasingly in the LLM that normalizes the output. OCR is just the first 30 percent.

Norbert Ropiak

Co-founder @ bards.ai

See our open-source work

// Why bards.ai

We built EasyDocs. We can build yours.

Production document AI in regulated industries. On-prem capable, multilingual, and audit-ready by default.

EasyDocs in production

Our document AI platform runs at 98% field-level accuracy under 300ms per page - on-prem, GDPR-clean, and battle-tested in finance and public sector.

Layout and OCR specialists

LayoutLMv3, Donut, Nougat, Surya - we've trained, fine-tuned, and shipped each in production for real customers.

Native Polish + Eastern European

Our team ships OCR and document models that handle Polish, Czech, Ukrainian, and other scripts that English-first vendors stumble on.

On-prem and air-gapped capable

Healthcare, legal, public sector. We've shipped document pipelines into facilities that block outbound traffic and require signed install bundles.

16+ open-source models on Hugging Face

We publish models as well as deploy them. Several of our OCR and embedding models are public, with 80K+ monthly downloads.

Senior team, no juniors

Every engineer has shipped document AI to paying customers. No ramp-up tax on your project.

// FAQ

Common questions about document AI pipelines

They're great when your documents look like generic invoices and you're allowed to send data to a US/EU cloud API. They fall over on domain-specific layouts, multilingual content beyond their training mix, and any environment where data residency rules out cloud OCR. Custom pipelines also unlock per-field confidence calibration that hosted services hide.

For clean, born-digital PDFs at low volume, that's often the right call - fast to ship, surprisingly accurate, no infra to run. It breaks down at scale (cost), on degraded inputs (hallucination from low-quality scans), on long multi-page documents (context window pressure distorts structure), and in any environment where the document can't leave your perimeter. Our typical recommendation is: vision LLM for low-volume clean-doc prototyping, custom pipeline for production at scale or in regulated environments.

Yes - that's our default for regulated customers. The full stack (OCR, layout model, optional LLM) runs on customer-controlled GPUs with no outbound network calls. EasyDocs ships this way and so does most of our document work.

For structured forms with clean scans, 97-99% field-level accuracy is realistic. For semi-structured documents (invoices, contracts) with mixed quality, 92-97% with calibrated confidence and human review on uncertain fields. We always benchmark on a holdout of your real data before promising numbers.

Latin scripts including Polish, Czech, Hungarian, German, French, Spanish, Portuguese; Cyrillic for Ukrainian and Russian; CJK via PaddleOCR and Surya. Domain-specific languages (legal Latin, medical jargon, abbreviations) usually need targeted fine-tuning, which we handle.

We've handled everything from one-page forms to 500-page legal contracts with cross-references, tables that span pages, footnotes, and inline annotations. The architecture changes - Donut for short forms, hierarchical models for long contracts - but the framing stays the same.

Output is structured JSON validated against a schema you define, delivered via REST/gRPC or pushed to a queue/database. We've integrated with SAP, Salesforce, custom ERPs, and document management systems. Every field carries confidence and provenance for downstream policy.

Preprocessing (deskew, dewarp, denoise, binarization) plus multi-resolution OCR plus an LLM verification pass that catches OCR errors via context. For fundamentally unreadable inputs we surface them through the human-review queue rather than guessing - silent guesses are worse than asking.

// Let's ship it

Stop hand-keying documents. Ship a pipeline that doesn't.

Send us a sample of your documents and your target accuracy. We'll come back with a benchmark on a holdout slice and a deployment plan, usually within a business day.

Norbert Ropiak

Norbert Ropiak

Co-founder @ bards.ai