← Back to Blog

Typed RAG Contract: Cutting AI Coding Cost by Making Hallucinations Impossible-by-Type

By Eric Bush · July 5, 2026 · 8 min read

A neat stack of legal contracts with a metal fountain pen resting on top, symbolising a strict schema

The Idea in One Sentence

This week's featured Towards Data Science essay argues that most RAG hallucination problems come from letting the model return prose at all. Instead of asking the model to "answer the question," the schema asks it to fill typed values: an Amount becomes a record with value, currency, and unit, not a string. Every claim is pinned to specific line ranges in the retrieved passages. Self-assessment fields let the model declare confidence and completeness, which the pipeline routes on.

The engineering argument is interesting. The cost argument is more important. Typed answers do not just reduce hallucination — they collapse three expensive categories of downstream work: retries, human-in-the-loop review, and stronger-model fallbacks. Any one of those changes moves the cost curve; combined, they can cut RAG bills 40-70% without touching the model.

Where Free-Text RAG Bleeds Money

A production RAG pipeline built on free-text answers has predictable failure modes. Each one costs money in a way that shows up on a different budget line:

  • Retries after regex or downstream parser fails. "The rent is $3,200 per month starting January 15" — is the amount 3200 or "3,200"? Is currency USD or unspecified? Is January 15 the start of the lease or the start of billing? Every parse failure re-runs the model, doubling or tripling per-answer cost.
  • Escalation to a stronger model. When your Haiku-tier pipeline can't reliably answer, teams instinctively swap to Opus-tier — a 12-25x price increase per answer.
  • Human review queues. Prose answers require prose review. A reviewer spends 30-90 seconds per answer at $50-100 fully-loaded per hour.
  • Post-hoc validation LLMs. Many teams add a second LLM to verify the first LLM's output. That doubles input tokens on every request.

Cost Line-by-Line: 10,000-Query RAG Workload

Consider a document-intelligence pipeline processing 10,000 queries per day, retrieving ~3,000 tokens of context per query and generating ~500 tokens per answer. Baseline model: Claude Haiku 4.5. Numbers below use rough per-query billing assumptions ($0.80/M input, $4/M output for Haiku):

Cost line Free-text RAG Typed contract RAG
Base inference (10k queries)$44$44
Parser retries (12% failure rate baseline)$5.30$0.90 (schema violation retries only)
Escalation to Opus (2% escalation rate)$120$40 (schema flags edge cases explicitly)
Post-hoc validator LLM$44$0 (validation happens in Python via provenance fields)
Human review (1% queue rate)$130$65 (typed values are faster to review)
Daily total$343$150

That is a 56% reduction on the same volume, same model, same data. The typed contract does not do anything magical — it just moves work from expensive branches (Opus escalation, human review) to cheap deterministic branches (Python-side comparison, provenance line-check).

The Big Idea: "Extract First, Compare in Python"

The single most cost-relevant rule in the essay is extract first, compare in Python. For any comparative question — "is amount A larger than amount B?", "does clause X contradict clause Y?" — the model's only job is to extract typed values from each passage. The comparison happens in deterministic code.

This matters for cost because comparison is exactly where prose answers hallucinate confidently. Ask a model "is 4.8% larger than 5.1%?" in prose and it will occasionally get the answer wrong depending on framing. Ask it to extract "4.8" and "5.1" as typed Percentage values and then let Python compare them, and it never gets it wrong. Every avoided hallucination is a retry, an escalation, or a human review you did not pay for.

Provenance as a Free Validator

The other cost lever is the provenance requirement — every extracted value points to specific line ranges in the retrieved passages. That means a downstream validator can check "did the model actually see this claim?" without re-reading the whole document or asking another LLM. Deterministic provenance validation is measured in microseconds and costs nothing per query beyond the CPU time. Replacing a $0.004 verifier-LLM call with a $0 provenance check across 10k queries per day is $40 back in your pocket.

Where It Breaks

Typed contracts are not free. Two costs are real:

  • Schema design engineering. Someone has to define Amount, Date, Percentage, Contract Term, and every domain-specific type. This is an upfront cost of one-two engineer-weeks for a medium-complexity domain.
  • Slightly higher output tokens per answer. A typed record with provenance fields is 30-60% longer than a raw prose answer. This offsets some of the retry savings — usually by 5-10% of total spend.

Both are one-time or small linear costs; they are worth taking on any pipeline processing more than ~500 queries per day. Below that volume, the schema engineering does not amortize.

How to Adopt Without Blowing Up Your Pipeline

  1. Start with your two most expensive query classes. Extraction and comparison usually dominate spend. Move those to typed contracts first.
  2. Instrument retry rate before and after. If retry rate does not drop by 60%+ after the schema, your schema is not tight enough — usually a missing enum or a string-typed field that should be structured.
  3. Use provenance fields as your primary validator. Delete the post-hoc validator LLM. If you cannot delete it, your provenance is not strong enough.
  4. Add "completeness" and "confidence" self-assessment fields. These let the pipeline route unclear cases to Opus or to a human — instead of always escalating on every answer.
  5. Compute completeness deterministically, do not just ask the model. Peek at parts of the document the model did not see. This catches truncated-list errors, where the model literally cannot know that more items existed.

Why This Beats Buying a Bigger Model

The default reaction to a hallucinating RAG pipeline is to buy a stronger model. Moving from Haiku to Sonnet cuts hallucination but multiplies cost 3-4x. Moving from Sonnet to Opus multiplies cost 6-8x. Typed contracts get most of the accuracy improvement of a model upgrade at roughly 5% of the added spend — because the fix is architectural, not model-tier.

For any team hitting reliability ceilings on Haiku- or Sonnet-tier RAG, the correct order of operations is: typed contract first, provenance validator second, extract-and-compare third, and only then, if you still have accuracy gaps, escalate to a bigger model. The essay is a good technical primer for the schema design. Treat this post as its cost-side companion.

Want to calculate exact costs for your project?

Frequently Asked Questions

What is a typed answer contract in RAG?

A schema that forces the LLM to return structured, typed fields (an Amount with value/currency/unit, a Date, a Percentage, etc.) instead of free-text prose. Each field is pinned to specific line ranges in the retrieved passages via provenance metadata, and self-assessment fields let the model declare confidence and completeness.

How much money does a typed RAG contract actually save?

In a representative 10k-query/day pipeline, replacing free-text answers with typed contracts cuts total daily cost from ~$343 to ~$150, roughly 56%. The savings come from lower retry rate, less escalation to premium models, no post-hoc validator LLM, and faster human review — not from cheaper base inference.

Is a typed RAG contract better than upgrading to a stronger model?

Almost always, for cost-conscious teams. Upgrading Haiku to Sonnet or Sonnet to Opus multiplies inference cost 3-8x, while typed contracts get most of the reliability improvement at roughly 5% of the added spend. Adopt schema-as-contract first; escalate the model tier only if reliability gaps remain.

What is the biggest hidden cost of moving to typed contracts?

One-time schema engineering: someone must design Amount, Date, Percentage, and domain-specific types (Contract Term, Case ID, Product SKU, etc.). Budget one to two engineer-weeks for a medium-complexity domain. On pipelines under ~500 queries per day, this cost does not amortize.

What is the 'extract first, compare in Python' rule?

For any comparative RAG query — 'is A larger than B?', 'does X contradict Y?' — the model's only job is to extract typed values. The comparison itself happens in deterministic Python code. This eliminates a class of hallucination that would otherwise force retries, escalations, or human review, saving both money and latency.