The brief was simple. The problem wasn't.
An insurance company was processing thousands of documents a day — policy forms, claims, endorsements, certificates of insurance, loss runs. Each document type had its own layout, its own quirks, and its own set of fields that mattered. The operations team was spending an average of 15 minutes per document, extracting data by hand and keying it into their systems.
The ask: automate this. Make it fast, make it accurate, and make it work with the 60+ document types they already had, plus whatever new ones showed up next quarter.
On paper, this is a solved problem. OCR plus some extraction logic. In practice, everything that could go wrong did.
Attempt 1: The template approach
The first instinct was templates. Map each document type to a fixed layout. Define regions — "the policy number is always in the top-right corner, 40px from the header." Parse the OCR output, extract from the right coordinates, done.
This worked for about three document types. Then we got a batch from a different carrier whose "same" form had a slightly different layout. Then another carrier whose ACORD form was a scanned copy of a photocopy, tilted 3 degrees. Then a handwritten endorsement.
Templates are brittle. They break on every variation they haven't seen, and in insurance, variation is the entire game. We burned three weeks on this before I called it.
Attempt 2: The ML-everything approach
Swing the pendulum. Throw a model at it. We tried fine-tuning a LayoutLM variant on our document corpus. The idea: the model learns spatial relationships, understands where fields tend to appear, handles variation gracefully.
The model was decent in testing — 89% accuracy on a clean held-out set. In production, it collapsed:
- Scanned documents with noise dropped accuracy to the 60s
- New document types that weren't in the training set were a coin flip
- Retraining cycles took days, and every new carrier meant new annotations
- Confidence scores were unreliable — the model was confidently wrong too often
Pure ML was overkill for structured documents and underkill for the messy ones. We needed something in between.
What actually worked: a pipeline, not a model
The breakthrough was treating this as an engineering problem with ML components, not an ML problem with engineering around it. The final architecture had four stages:
1. Intake and normalization
Before any intelligence runs, clean the input. Deskew scanned pages. Normalize resolution. Convert everything to a consistent format. Run OCR with confidence scores per word. This step alone fixed 30% of the accuracy issues — garbage in was the single biggest problem.
2. Classification
A lightweight classifier — not a heavyweight transformer — to bucket each document into its type. This used a combination of keyword heuristics and a small trained model. Classification accuracy hit 99.2% because document types have strong textual signals. You don't need a billion parameters to distinguish a loss run from a certificate of insurance.
3. Extraction
Here's where it gets interesting. Instead of one model for all document types, we used a hybrid approach:
- Structured documents (known layouts): rule-based extraction with coordinate zones, but with fuzzy matching to handle layout drift
- Semi-structured documents (variable layouts): LLM-based extraction using few-shot prompting with document-type-specific examples
- Unstructured documents (free text, handwritten notes): LLM extraction with human-in-the-loop validation flagged by default
This tiered approach meant we weren't paying LLM latency and cost for documents that could be handled by simple rules, and we weren't trying to force rules onto documents that needed understanding.
4. Validation and human-in-the-loop
Every extraction result came with a confidence score. Above 95%: auto-approved. Between 80-95%: flagged for quick human review, with the extracted values pre-filled. Below 80%: sent to manual processing with the OCR text highlighted.
The key insight: the human-in-the-loop wasn't a fallback, it was a feature. It made the system trustworthy from day one, before accuracy was perfect. And the review data fed back into improving the extraction rules and few-shot examples over time.
The numbers
Six months in:
- Average processing time: 30 seconds per document (down from 15 minutes)
- Extraction accuracy: 98.1% across all document types
- Auto-approval rate: 76% of documents needed zero human touch
- Daily volume: 3,000+ documents with a team that used to cap out at 400
What I'd tell you if you're building something similar
Don't start with the model. Start with the data pipeline. Clean input is worth more than a clever architecture. I've seen teams spend months fine-tuning extractors when the real problem was that their OCR was garbage.
Classify first, then extract. A document type is the strongest prior you have. Once you know what you're looking at, extraction gets dramatically easier. Most teams try to extract first and classify implicitly. That's backwards.
Tier your approach. Not every document needs an LLM. Not every document can be handled by rules. Design for the spectrum.
Build the human loop from day one. Don't treat human review as an admission of failure. Treat it as a data flywheel. Every correction makes the system better, and having it in place means you can ship with 85% accuracy and improve to 98% in production — instead of never shipping because you're chasing 99% in the lab.
The platform is still running. Last I checked, the operations team had forgotten what the old process was like. That's the best compliment an infrastructure project can get.