Legal SaaS — 6 month engagement

Contract review pipeline with 0.4% hallucination rate

Built an evaluated extraction system over a clause-level test set. Replaced a stalled internal effort and now runs unattended in production.

Industry

Legal SaaS

Duration

6 months

Outcome

0.4% hallucination rate

Where they were

A legal SaaS company shipped a contract review feature that customers loved in demos and didn’t trust in production. The system extracted ~40 fields from uploaded contracts — payment terms, jurisdiction, termination clauses, liability caps — and a measurable share of those extractions were wrong in ways that weren’t obvious.

There was no eval set. There was no per-field accuracy number. There was a Slack channel where account executives forwarded customer complaints to engineering, and a quarterly internal meeting where someone described the system as “directionally correct.”

For a legal product, that is not a viable position.

What we changed

We built the test set first. Six weeks, end to end: a team of contract reviewers labelled 1,200 contracts at clause level, producing a ground-truth set of ~48,000 extractions tied to specific spans of source text. The data quality work alone was unglamorous and expensive; it is also the only reason the rest of the project worked.

With the eval in place, we could measure. Field-by-field accuracy ranged from 71% (force majeure interpretations) to 98% (governing law). We rewrote the extraction system as a per-field ensemble: cheap structured extraction for high-accuracy fields, model-based with span-level citation for the rest, and a hallucination check that compares every extraction against retrieved source text.

The 0.4% number — citation-grounded hallucination rate on the held-out test set — is what gets reported to enterprise legal teams during procurement. They ask for it specifically now.

What it costs to maintain

Two engineers, half-time, mostly maintaining the eval set as new contract types come in. The model layer is monitored continuously; per-field regressions trigger a Slack alert and block deploy.

Outcome

0.4% hallucination rate across all extractions, citation-grounded
94% field-weighted accuracy on the held-out set (up from a previous undocumented baseline)
Two enterprise legal-team customers signed citing the eval methodology in their procurement notes
The internal “directionally correct” framing has been retired

Have a workflow that needs fixing?

We'll tell you straight. Sometimes the answer isn't AI.

Get invited

# Contract review pipeline with 0.4% hallucination rate - Industry: Legal SaaS - Engagement: 6 month engagement - Summary: Built an evaluated extraction system over a clause-level test set. Replaced a stalled internal effort and now runs unattended in production. - Industry: Legal SaaS - Duration: 6 months - Outcome: 0.4% hallucination rate --- ## Where they were A legal SaaS company shipped a contract review feature that customers loved in demos and didn't trust in production. The system extracted ~40 fields from uploaded contracts — payment terms, jurisdiction, termination clauses, liability caps — and a measurable share of those extractions were wrong in ways that weren't obvious. There was no eval set. There was no per-field accuracy number. There was a Slack channel where account executives forwarded customer complaints to engineering, and a quarterly internal meeting where someone described the system as "directionally correct." For a legal product, that is not a viable position. ## What we changed We built the test set first. Six weeks, end to end: a team of contract reviewers labelled 1,200 contracts at clause level, producing a ground-truth set of ~48,000 extractions tied to specific spans of source text. The data quality work alone was unglamorous and expensive; it is also the only reason the rest of the project worked. With the eval in place, we could measure. Field-by-field accuracy ranged from 71% (force majeure interpretations) to 98% (governing law). We rewrote the extraction system as a per-field ensemble: cheap structured extraction for high-accuracy fields, model-based with span-level citation for the rest, and a hallucination check that compares every extraction against retrieved source text. The 0.4% number — citation-grounded hallucination rate on the held-out test set — is what gets reported to enterprise legal teams during procurement. They ask for it specifically now. ## What it costs to maintain Two engineers, half-time, mostly maintaining the eval set as new contract types come in. The model layer is monitored continuously; per-field regressions trigger a Slack alert and block deploy. ## Outcome - 0.4% hallucination rate across all extractions, citation-grounded - 94% field-weighted accuracy on the held-out set (up from a previous undocumented baseline) - Two enterprise legal-team customers signed citing the eval methodology in their procurement notes - The internal "directionally correct" framing has been retired --- ## Navigation - [Back to case studies](/case-studies) - [Home](/) - Markdown for agents: <https://veth.io/case-studies/contract-review.md>