Contract review pipeline with 0.4% hallucination rate
Built an evaluated extraction system over a clause-level test set. Replaced a stalled internal effort and now runs unattended in production.
Where they were
A legal SaaS company shipped a contract review feature that customers loved in demos and didn’t trust in production. The system extracted ~40 fields from uploaded contracts — payment terms, jurisdiction, termination clauses, liability caps — and a measurable share of those extractions were wrong in ways that weren’t obvious.
There was no eval set. There was no per-field accuracy number. There was a Slack channel where account executives forwarded customer complaints to engineering, and a quarterly internal meeting where someone described the system as “directionally correct.”
For a legal product, that is not a viable position.
What we changed
We built the test set first. Six weeks, end to end: a team of contract reviewers labelled 1,200 contracts at clause level, producing a ground-truth set of ~48,000 extractions tied to specific spans of source text. The data quality work alone was unglamorous and expensive; it is also the only reason the rest of the project worked.
With the eval in place, we could measure. Field-by-field accuracy ranged from 71% (force majeure interpretations) to 98% (governing law). We rewrote the extraction system as a per-field ensemble: cheap structured extraction for high-accuracy fields, model-based with span-level citation for the rest, and a hallucination check that compares every extraction against retrieved source text.
The 0.4% number — citation-grounded hallucination rate on the held-out test set — is what gets reported to enterprise legal teams during procurement. They ask for it specifically now.
What it costs to maintain
Two engineers, half-time, mostly maintaining the eval set as new contract types come in. The model layer is monitored continuously; per-field regressions trigger a Slack alert and block deploy.
Outcome
- 0.4% hallucination rate across all extractions, citation-grounded
- 94% field-weighted accuracy on the held-out set (up from a previous undocumented baseline)
- Two enterprise legal-team customers signed citing the eval methodology in their procurement notes
- The internal “directionally correct” framing has been retired