Why off-the-shelf OCR fails for regulated documents
Every mid-sized dental chain, insurance brokerage, and accounting firm has tried an OCR tool at some point. It reaches 80% extraction accuracy on a clean sample, someone demos it to the ops VP, and then it dies in production.
The reason is simple: 80% accuracy means one in five documents is wrong. For a marketing use case that is acceptable. For EOBs, claim forms, or patient intake, one wrong value means one wrong billing, one wrong insurance reimbursement, or one patient record mis-matched. The ops team cannot trust it, so they check every document manually — and the tool becomes an expensive middleware.
The private LLM + HITL pattern
The architecture that works has three components, wired together so that AI does the speed work and humans do the judgement work.
- Private LLM on EU infrastructure (Hetzner or on-prem). No US data transfer, no third-party vendor contract to negotiate with the DPO.
- Structured extraction pipeline — the LLM returns JSON matching your document schema, not free text. Every field has a confidence score and a source-region reference back to the original document.
- HITL routing — high-confidence fields auto-populate. Low-confidence or high-value fields land in a human queue, matched to the right reviewer by document type and amount.
Nothing reaches a billing system, a patient record, or a regulator-facing export without a human signoff on the fields that matter. Nothing.
The audit-trail schema regulators actually want
When a DPA inspector or internal compliance officer asks for evidence, "we have a database of AI outputs" is not an answer. They want a CSV they can trace. At minimum, your schema should capture:
- Document ID + hash + S3 / storage path of the original.
- Extraction timestamp + model name + model version.
- Every field extracted with its confidence score.
- Human reviewer ID + review timestamp + decision (approve / override / reject).
- If overridden: the original AI value and the human-entered value, side by side.
- Downstream system writes triggered by approval (billing record ID, patient record ID, claim ID).
Export that as CSV and the inspector walks out in an hour instead of a week. The schema is also your insurance when a mistake happens — you can prove exactly where it went wrong.
GDPR posture — what your DPO will ask about
For regulated document processing, the defensible legal basis is usually a combination of Article 6(1)(b) (contract performance — the patient or client agreed to the service) and 6(1)(f) (legitimate interest in efficient back-office operations). Special-category data (health records, in the dental case) requires an additional 9(2) condition, most often 9(2)(h) for healthcare management.
- EU hosting only — document in your RoPA entry with the specific region.
- Explicit consent boundary — anything touching marketing (not clinical or claim processing) needs its own consent trail.
- Data processor agreement signed with every AI vendor. For private LLM deploys on your own infra, this reduces to your existing hosting provider.
- Retention policy: how long you keep the extraction + audit record after the underlying document is closed.
Sample EOB schema + routing rules
For a typical EOB (Explanation of Benefits), the schema extraction looks roughly like: patient identifier, claim number, service date, procedure codes, billed amount, allowed amount, patient responsibility, payment amount, denial reasons if any. Each field is a separate extraction with its own confidence.
Routing rules that work in practice:
- Billed amount under €100 + all fields > 95% confidence → auto-approve.
- Billed amount €100–€500 → junior reviewer queue, 4-hour SLA.
- Billed amount over €500 OR any denial → senior reviewer queue, same-day SLA.
- Any field under 70% confidence → mandatory human override regardless of amount.
- Any new provider or new CPT code not seen before → quarantine queue, ops manager reviews and extends the allow-list.
Real result: Tax-Fin-Lex
Same pattern, different regulated industry. Tax-Fin-Lex needed AI-native retrieval across 1.4 million Slovenian legal documents — court rulings, statutes, regulations, doctrine. Off-the-shelf AI failed: factual accuracy is non-negotiable in legal research, and EU/Slovenian compliance ruled out US-hosted LLMs.
We built a private retrieval system on Hetzner EU, semantic search over the corpus, structured court analysis that always links back to source documents. HITL guard rails mean every result is verifiable before a lawyer relies on it. Practising Slovenian lawyers now query in plain language and get cited answers in seconds instead of hours.
Where to start
If you process 50+ regulated documents per day and live under a DPA or regulator — start with a 3-day AI Opportunity Audit (€900). Money-back if we do not identify €3,000+ in annual savings. We will tell you honestly whether the pipeline fits before you commit a cent to construction.