The brief was simple. The problem wasn't.

An insurance company was processing thousands of documents a day — policy forms, claims, endorsements, certificates of insurance, loss runs. Each document type had its own layout, its own quirks, and its own set of fields that mattered. The operations team was spending an average of 15 minutes per document, extracting data by hand and keying it into their systems.

The ask: automate this. Make it fast, make it accurate, and make it work with the 60+ document types they already had, plus whatever new ones showed up next quarter.

On paper, this is a solved problem. OCR plus some extraction logic. In practice, everything that could go wrong did.

Attempt 1: The template approach

The first instinct was templates. Map each document type to a fixed layout. Define regions — "the policy number is always in the top-right corner, 40px from the header." Parse the OCR output, extract from the right coordinates, done.

This worked for about three document types. Then we got a batch from a different carrier whose "same" form had a slightly different layout. Then another carrier whose ACORD form was a scanned copy of a photocopy, tilted 3 degrees. Then a handwritten endorsement.

Templates are brittle. They break on every variation they haven't seen, and in insurance, variation is the entire game. We burned three weeks on this before I called it.

Attempt 2: The ML-everything approach

Swing the pendulum. Throw a model at it. We tried fine-tuning a LayoutLM variant on our document corpus. The idea: the model learns spatial relationships, understands where fields tend to appear, handles variation gracefully.

The model was decent in testing — 89% accuracy on a clean held-out set. In production, it collapsed:

Pure ML was overkill for structured documents and underkill for the messy ones. We needed something in between.

What actually worked: a pipeline, not a model

The breakthrough was treating this as an engineering problem with ML components, not an ML problem with engineering around it. The final architecture had four stages:

1. Intake and normalization

Before any intelligence runs, clean the input. Deskew scanned pages. Normalize resolution. Convert everything to a consistent format. Run OCR with confidence scores per word. This step alone fixed 30% of the accuracy issues — garbage in was the single biggest problem.

2. Classification

A lightweight classifier — not a heavyweight transformer — to bucket each document into its type. This used a combination of keyword heuristics and a small trained model. Classification accuracy hit 99.2% because document types have strong textual signals. You don't need a billion parameters to distinguish a loss run from a certificate of insurance.

3. Extraction

Here's where it gets interesting. Instead of one model for all document types, we used a hybrid approach:

This tiered approach meant we weren't paying LLM latency and cost for documents that could be handled by simple rules, and we weren't trying to force rules onto documents that needed understanding.

4. Validation and human-in-the-loop

Every extraction result came with a confidence score. Above 95%: auto-approved. Between 80-95%: flagged for quick human review, with the extracted values pre-filled. Below 80%: sent to manual processing with the OCR text highlighted.

The key insight: the human-in-the-loop wasn't a fallback, it was a feature. It made the system trustworthy from day one, before accuracy was perfect. And the review data fed back into improving the extraction rules and few-shot examples over time.

The numbers

Six months in:

What I'd tell you if you're building something similar

Don't start with the model. Start with the data pipeline. Clean input is worth more than a clever architecture. I've seen teams spend months fine-tuning extractors when the real problem was that their OCR was garbage.

Classify first, then extract. A document type is the strongest prior you have. Once you know what you're looking at, extraction gets dramatically easier. Most teams try to extract first and classify implicitly. That's backwards.

Tier your approach. Not every document needs an LLM. Not every document can be handled by rules. Design for the spectrum.

Build the human loop from day one. Don't treat human review as an admission of failure. Treat it as a data flywheel. Every correction makes the system better, and having it in place means you can ship with 85% accuracy and improve to 98% in production — instead of never shipping because you're chasing 99% in the lab.


The platform is still running. Last I checked, the operations team had forgotten what the old process was like. That's the best compliment an infrastructure project can get.