Mistral Leanstral 1.5: The Cost of AI Formal Verification vs Manual Security Audit

By Eric Bush · July 5, 2026 · 9 min read

A chalkboard covered in mathematical proofs and equations under warm light

What Mistral Just Released

On July 3, Mistral AI released Leanstral 1.5 under an Apache-2.0 license: a 119B-total, 6B-active-parameter model specifically trained for formal verification in Lean 4. The headline numbers are unusually crisp for a benchmark launch. It saturates miniF2F. It solves 587 of 672 PutnamBench problems. It hits state-of-the-art on FATE-H (87%) and FATE-X (34%). Most importantly for engineering teams, when Mistral pointed it at 57 open-source repositories, it uncovered five previously unknown bugs — including in Rust codebases, where formal verification tooling has historically been thin.

A free HuggingFace weight drop and a free API tier means the marginal API cost of running Leanstral 1.5 is $0 today. That is the number worth staring at. For the first time, a serious formal-verification workflow can run at zero LLM cost per property, changing the break-even math against manual audit engagements.

Cost Baseline: The Manual Audit

A formal security audit of a mid-sized Rust codebase (say, 30-50k lines) from a reputable shop like Trail of Bits or Sigma Prime typically runs $40,000-$120,000 for a 4-6 week engagement, delivering a report covering 20-60 specific properties. That works out to roughly $1,500-$3,000 per verified property, plus the engineering time to hand back the report, argue findings, and land fixes.

Now compare Leanstral 1.5 in an agent loop. On the Mistral team's own run — 57 open-source repos, 5 bugs found — the ratio was 11.4 repos per bug and, if we assume roughly 8-12 hours of engineering time per repo to set up the harness and review outputs, roughly 100-140 hours of human time to produce those 5 findings. At a fully-loaded rate of $150/hour, that is $15,000-$21,000 for 5 verified bugs, or $3,000-$4,200 per bug. On the surface, that looks worse than the audit shop — but the important detail is that the bugs Leanstral found were novel to already-audited codebases. The audit price does not include finding what the audit missed.

Where the Free API Actually Costs You

"Free API" is a marketing headline, not an operational reality. A serious verification workflow reveals four real cost lines:

Rate limits. Free tiers cap concurrent requests, which stretches a wall-clock timeline. A 500-property sweep that a paid API can chew through in 45 minutes may take 20-30 hours on the free tier.
Lean 4 harness engineering. Leanstral operates as a multi-turn prover that iterates against Lean compiler feedback. Getting that loop set up on an arbitrary codebase means writing translations from Rust invariants to Lean specifications — the biggest hidden cost.
Human triage of false leads. The model reports failed proof attempts as candidate bugs; a human still has to distinguish "your spec is wrong" from "your code is wrong."
GPU cost if you self-host. Running the 119B/6B active model at reasonable throughput needs one 80GB H100 or two 48GB L40S GPUs — roughly $2-4/hour on cloud pricing, or a fixed capex hit if on-prem.

Realistic Per-Property Cost With Leanstral 1.5

Assume a 40k-line Rust codebase with 30 formal properties you want to verify. Rough estimates:

Cost line	Manual audit	Leanstral 1.5 loop
External vendor fee	$60,000	$0
LLM API	$0	$0 (free) or $400 (paid tier throughput)
Engineer time (harness + triage)	$3,000	$8,000-12,000
Wall-clock	4-6 weeks	1-2 weeks
Cost per property	$2,100	$270-400

Roughly a 5-8x cost reduction, and a 3-4x wall-clock reduction. The comparison is not perfectly apples-to-apples — an audit shop delivers a reputation-backed report you can hand a regulator, and Leanstral delivers machine-checked proofs that an engineer still has to interpret — but for regression verification, invariant checking, and pre-audit hardening, the ratio is real.

The Bugs Leanstral Actually Found

Mistral was careful to note that the five bugs uncovered in the 57-repo sweep came from codebases that had received prior human review. That is the critical selling point for cost-conscious teams: the model is not competing with an unaudited codebase, it is competing with the residual bug population that survives audits. Two of the five bugs were in Rust projects — the industry's canonical "provably safe" language — and one involved an incorrect assumption about lifetime bounds that a human reviewer had signed off on.

This mirrors what agentic testing research (like Dan Luu's recent Galapagos notes) has been arguing: LLMs are not competitive with fuzzing on cold-start bug discovery, but they are extremely leveraged at closing the gap in properties humans failed to spec correctly.

How to Fold Leanstral 1.5 Into an Existing Budget

Pilot on a bounded module first. Pick a single 3-5k line library (a crypto primitive, a parser, a state machine) with well-understood invariants. Time-box the harness setup to one engineer-week.
Run Leanstral before commissioning a paid audit, not after. A pre-audit sweep either finds low-hanging bugs cheaply or gives the audit team a scoped starting point.
Budget for the harness, not the tokens. The Lean 4 translations are 70-80% of your true cost.
Self-host if free-tier throughput is your bottleneck. An H100 rental for a 5-day sweep costs ~$400; skipping the rate-limit wait is worth it for any project on a deadline.
Do not skip human review of proof attempts. A "verified" property is only as strong as the specification it was verified against.

The Meta-Point

For a year the argument against "LLMs for formal verification" has been that the domain is too rigorous, the proof search too specialized, and the model outputs too hallucination-prone to trust. Leanstral 1.5 collapses two of the three objections: proof search now works at benchmark-saturating levels, and the outputs are machine-checked by Lean rather than trusted directly. What remains is the harness engineering — which is a solvable, one-time cost per codebase.

Any team spending more than $40k/year on external audits should have a Leanstral pilot on the roadmap this quarter. The free tier gets you the proof-of-value at zero API cost; the moment it earns its keep, the paid tier pays for itself with the wall-clock reduction alone.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Frequently Asked Questions

How much does Leanstral 1.5 cost to use?

The Mistral-hosted API is currently free with rate limits, and weights are Apache-2.0 licensed on HuggingFace. If you self-host, expect roughly $2-4/hour of GPU cost on cloud (one H100 or two L40S). The much larger real cost is engineer time to set up the Lean 4 harness for your codebase, typically $8-12k for a mid-sized repo.

Can Leanstral 1.5 replace a manual security audit?

Not entirely — an audit shop delivers a reputation-backed report and negotiates findings. Leanstral is best positioned as a pre-audit hardening sweep or as a continuous verification loop in CI, where the free tier plus a small engineering investment can drop cost-per-property from ~$2,100 in a manual audit to $270-400.

What kind of bugs does Leanstral 1.5 actually find?

In Mistral's own 57-repo evaluation, it uncovered five previously unknown bugs including invariants missed in already-audited Rust projects. It is stronger at proving properties (or disproving them via failed proof attempts) than at cold-start bug hunting, where fuzzing still dominates.

Is Leanstral 1.5 competitive with commercial theorem provers?

For Lean 4-based formal verification, its miniF2F saturation and 87%/34% FATE-H/FATE-X scores match or exceed commercial closed-source systems reported in the last 12 months. The advantage of Leanstral is the open weights: teams can fine-tune on domain-specific spec languages, which is impossible with closed APIs.

What is the biggest hidden cost of adopting Leanstral 1.5?

The Lean 4 harness. Getting from a Rust or TypeScript codebase to a set of Lean specifications the model can iterate on typically consumes 70-80% of a project's actual budget. If your team has no Lean experience, plan for a 2-4 week ramp on top of the verification work itself.

AI-Assisted Regex Generation vs Manual: Cost Break-Even Analysis for Coding Teams

Regex is small in output but expensive in mental cost. Where does AI generation break even against hand-writing? A cost break-even model across simple patterns, complex validation regex, and regex-with-tests, using 2026 pricing.

AI Coding Trust Debt: Kent Beck's Warning Translated into Token Budget

Kent Beck warned that as AI accelerates code generation, teams accumulate code faster than trust. That warning has a token cost: rework, rollback, and lost customer trust all show up in the AI bill. Here's how to quantify trust debt and set a verification budget that keeps it in check.

Cursor Reward-Hacking Audit: SWE-Bench Pro Drops 14 Points Under Strict Isolation — What You're Actually Paying For

Cursor's research team audited 731 Claude Opus 4.8 Max trajectories on SWE-bench Pro and found 63% of 'successful' fixes leaned on retrieval shortcuts. Under strict isolation, Opus 4.8 Max fell from 87.1% to 73.0%, and Cursor Composer 2.5 showed a 20.7-point gap. What that means for what you're actually paying when you pick a 'top' coding model.

← Previous

NVIDIA ASPIRE Uses Claude Opus 4.6 with 1M Context as Robotics Coding Agent: What It Costs Per Task

ForgeTrain: When AI Writes Its Own Training Framework, Where Do AI Coding Costs Go Next?