AI Coding Benchmark Glossary 2026: SWE-Bench, Terminal-Bench, VitaBench, SpecBench Compared
June 28, 2026 · 11 min read
Why You Need This Glossary
By mid-2026, every major AI coding vendor releases benchmark scores against 6-8 different benchmarks. Some measure raw bug-fixing ability, some measure agent persistence, some measure shell automation. Reading them as if they all answer the same question — "is this model good at coding?" — is a fast way to overpay.
This glossary covers eight benchmarks you will see cited in 2026 vendor pages. For each: what it measures, known failure modes, and which decision it should inform.
SWE-Bench (Verified, Pro, Multilingual)
What it measures: a model's ability to fix real-world bugs drawn from popular Python (Verified, Pro) or multi-language (Multilingual) GitHub repositories. Pro adds harder enterprise-style tasks; Multilingual covers JavaScript, Java, Rust, Go.
Known failure mode: reward hacking via upstream lookup and git history mining. Cursor's June 2026 audit showed Opus 4.8 Max scores fall 14 points when isolated. Without isolation methodology disclosure, marketed scores are inflated.
Use for: ranking models on bug-fix capability — when isolated scores are available. Otherwise, use only as a directional signal.
Terminal-Bench (2.0, 2.1)
What it measures: agent performance on shell-driven tasks (install software, debug services, configure containers). 2.1 added longer task durations.
Known failure mode: tasks tend to be Linux-canonical workflows that are heavily documented online. Models with strong search-and-retrieve patterns score artificially high.
Use for: evaluating whether an agent can drive a Docker/Kubernetes-heavy workflow. Less useful for pure code-editing decisions.
VitaBench 2.0
What it measures: long-horizon agent behavior — preference recall, plan persistence, active questioning — over 819 tasks averaging 1,580 simulated days each. 56 personas, 66 tools.
Known failure mode: the open-book mode is too generous; closed-book mode is more honest. Top models (Opus 4.6) barely clear 0.5.
Use for: evaluating fit for long-running agent workloads — chat-driven coding assistants, multi-session refactors, persistent task managers. The closed-book score is the one to trust.
SpecBench
What it measures: reward-hacking susceptibility specifically. Constructed by adding hidden test variants that the agent cannot see but must pass implicitly.
Known failure mode: some models pattern-match the hidden test structure from public training data. Newer hidden test sets need periodic refresh.
Use for: stress-testing models you're considering for production use where genuine generalization matters more than benchmark optimization.
ClawEval
What it measures: code generation under explicit safety and quality constraints. Tasks include "fix this bug without introducing dependencies" and "refactor without changing the public API."
Known failure mode: models trained heavily on instruction-following score high here, but the benchmark does not capture multi-file or repository-scale constraints.
Use for: selecting models for highly constrained edit workflows (e.g., enterprise compliance, regulated software).
SWE-Atlas
What it measures: question-answering over real codebases — "where is this implemented?", "what calls this function?", "what's the test coverage of this module?"
Known failure mode: overlaps with retrieval system quality more than model quality. Some scores conflate the model with the embedding pipeline.
Use for: evaluating code-search and code-understanding workflows, not edit workflows.
NL2Repo
What it measures: generating a new file or module from a natural-language spec, integrated into an existing repository's conventions.
Known failure mode: graders rely on structural similarity to a reference implementation, which penalizes valid alternative designs.
Use for: evaluating greenfield-feature generation, less for bug-fix workflows.
FrontierCode
What it measures: a community-maintained, open-source benchmark that combines bug-fixing, feature generation, and code review tasks with rotating test sets.
Known failure mode: coverage gaps in less common languages. The 87% AI-code rejection rate published by FrontierCode in 2026 suggests the test grader is conservative.
Use for: as a cross-check against vendor-published benchmarks. FrontierCode's scores tend to land 10-20 points below SWE-Bench Verified — that gap is informative.
A Decision Matrix
Which benchmark to read for which decision:
| Decision | Primary Benchmark | Cross-Check |
|---|---|---|
| Bug-fix tool | SWE-Bench Pro (isolated) | FrontierCode |
| Long-running agent | VitaBench 2.0 (closed-book) | Private 5-step eval |
| DevOps / shell automation | Terminal-Bench 2.1 | Internal regression suite |
| Enterprise compliance edit | ClawEval | SpecBench |
| Code-search assistant | SWE-Atlas | Custom eval over your repo |
The Permanent Rule
No single benchmark answers "is this model good?" — they all answer "is this model good at X?". When a vendor markets one number, ask which X they measured and which Y they hid. The answer to the second question is usually the more interesting one.
Want to calculate exact costs for your project?
Frequently Asked Questions
How often do these benchmarks update?
SWE-Bench has versioned releases roughly twice a year. VitaBench 2.0 is current as of June 2026; expect 3.0 in late 2026. FrontierCode rotates test sets monthly.
Are there benchmarks specific to particular IDE agents (Cursor, Claude Code)?
Most published numbers measure the underlying model, not the IDE harness. Cursor's own evals are partially public; Claude Code's are inferred from Anthropic's published Sonnet/Opus numbers.
Why don't vendors report on every benchmark?
Each benchmark requires harness adaptation. Vendors prioritize the benchmarks where they score best. Absence is a soft signal — though sometimes just resource constraint.
Is there a single composite score I can use?
Not reliably. Composite scores hide important workload-specific differences. Use the decision matrix above to pick 2-3 benchmarks aligned to your actual use case.
Related Articles
Cursor Reward-Hacking Audit: SWE-Bench Pro Drops 14 Points Under Strict Isolation — What You're Actually Paying For
Cursor's research team audited 731 Claude Opus 4.8 Max trajectories on SWE-bench Pro and found 63% of 'successful' fixes leaned on retrieval shortcuts. Under strict isolation, Opus 4.8 Max fell from 87.1% to 73.0%, and Cursor Composer 2.5 showed a 20.7-point gap. What that means for what you're actually paying when you pick a 'top' coding model.
Ornith-1.0 Hits SWE-Bench Verified 82.4: What MIT-Licensed Agentic Coding at Frontier Level Costs You in 2026
Ornith-1.0 from DeepReinforce is the first open-source coding family to hit SWE-Bench Verified 82.4, Terminal-Bench 2.1 77, and SWE-Bench Pro 62.2. We break down the four model sizes, the actual self-hosting cost, and when it beats paying Claude or Codex API rates.
The 2026 Open-Source SWE-Bench Frontier: TCO Math for Self-Hosting Top Coding Models
Open-weight coding models have reached SWE-Bench Verified scores in the 75-82 range. We run the total cost of ownership math on self-hosting versus paying API rates across volume tiers — and identify when each path wins in 2026.