Case studies
Legal SaaS — 6 month engagement

Contract review pipeline with 0.4% hallucination rate

Built an evaluated extraction system over a clause-level test set. Replaced a stalled internal effort and now runs unattended in production.

Industry
Legal SaaS
Duration
6 months
Outcome
0.4% hallucination rate

Where they were

A legal SaaS company shipped a contract review feature that customers loved in demos and didn’t trust in production. The system extracted ~40 fields from uploaded contracts — payment terms, jurisdiction, termination clauses, liability caps — and a measurable share of those extractions were wrong in ways that weren’t obvious.

There was no eval set. There was no per-field accuracy number. There was a Slack channel where account executives forwarded customer complaints to engineering, and a quarterly internal meeting where someone described the system as “directionally correct.”

For a legal product, that is not a viable position.

What we changed

We built the test set first. Six weeks, end to end: a team of contract reviewers labelled 1,200 contracts at clause level, producing a ground-truth set of ~48,000 extractions tied to specific spans of source text. The data quality work alone was unglamorous and expensive; it is also the only reason the rest of the project worked.

With the eval in place, we could measure. Field-by-field accuracy ranged from 71% (force majeure interpretations) to 98% (governing law). We rewrote the extraction system as a per-field ensemble: cheap structured extraction for high-accuracy fields, model-based with span-level citation for the rest, and a hallucination check that compares every extraction against retrieved source text.

The 0.4% number — citation-grounded hallucination rate on the held-out test set — is what gets reported to enterprise legal teams during procurement. They ask for it specifically now.

What it costs to maintain

Two engineers, half-time, mostly maintaining the eval set as new contract types come in. The model layer is monitored continuously; per-field regressions trigger a Slack alert and block deploy.

Outcome

  • 0.4% hallucination rate across all extractions, citation-grounded
  • 94% field-weighted accuracy on the held-out set (up from a previous undocumented baseline)
  • Two enterprise legal-team customers signed citing the eval methodology in their procurement notes
  • The internal “directionally correct” framing has been retired
Have a workflow that needs fixing?

We'll tell you straight. Sometimes the answer isn't AI.

Get invited
# Contract review pipeline with 0.4% hallucination rate

- Industry: Legal SaaS
- Engagement: 6 month engagement
- Summary: Built an evaluated extraction system over a clause-level test set. Replaced a stalled internal effort and now runs unattended in production.
- Industry: Legal SaaS
- Duration: 6 months
- Outcome: 0.4% hallucination rate

---

## Where they were

A legal SaaS company shipped a contract review feature that customers loved in demos and didn't trust in production. The system extracted ~40 fields from uploaded contracts — payment terms, jurisdiction, termination clauses, liability caps — and a measurable share of those extractions were wrong in ways that weren't obvious.

There was no eval set. There was no per-field accuracy number. There was a Slack channel where account executives forwarded customer complaints to engineering, and a quarterly internal meeting where someone described the system as "directionally correct."

For a legal product, that is not a viable position.

## What we changed

We built the test set first. Six weeks, end to end: a team of contract reviewers labelled 1,200 contracts at clause level, producing a ground-truth set of ~48,000 extractions tied to specific spans of source text. The data quality work alone was unglamorous and expensive; it is also the only reason the rest of the project worked.

With the eval in place, we could measure. Field-by-field accuracy ranged from 71% (force majeure interpretations) to 98% (governing law). We rewrote the extraction system as a per-field ensemble: cheap structured extraction for high-accuracy fields, model-based with span-level citation for the rest, and a hallucination check that compares every extraction against retrieved source text.

The 0.4% number — citation-grounded hallucination rate on the held-out test set — is what gets reported to enterprise legal teams during procurement. They ask for it specifically now.

## What it costs to maintain

Two engineers, half-time, mostly maintaining the eval set as new contract types come in. The model layer is monitored continuously; per-field regressions trigger a Slack alert and block deploy.

## Outcome

- 0.4% hallucination rate across all extractions, citation-grounded
- 94% field-weighted accuracy on the held-out set (up from a previous undocumented baseline)
- Two enterprise legal-team customers signed citing the eval methodology in their procurement notes
- The internal "directionally correct" framing has been retired

---

## Navigation

- [Back to case studies](/case-studies)
- [Home](/)
- Markdown for agents: <https://veth.io/case-studies/contract-review.md>