Apple Research: 9 LLM Judges = 2 Independent Votes. Stop Paying for Redundant Judge Panels
June 24, 2026 · 8 min read
The Counterintuitive Finding
Apple Machine Learning Research published a paper on June 24, 2026, titled "Nine Judges, Two Effective Votes: Correlated Errors Weaken LLM Judge Panels." The result is sharp: when you run a 9-model judge panel to evaluate LLM output, the panel does not provide 9 independent votes of information. Because the judges share training data and architectural priors, they tend to make the same mistakes on the same inputs. The effective information equivalent is about 2 independent votes.
For anyone running LLM-as-judge in production (eval pipelines, content moderation, RAG quality scoring, agent self-verification), this is a direct cost story. You are paying 4-5x more than necessary for the same evaluation reliability.
The Cost of a Naive Judge Panel
A common production setup for evaluating coding-agent output runs 5-9 judge models per generation. Token cost for a single evaluation, on a 5K-input / 1K-output coding task:
- 1 judge × Claude Sonnet 4.6 ($3/$15 per M) = $0.015 + $0.015 = $0.030
- 5 judges (Sonnet, Opus, GPT-5.5, Gemini 3.1, Grok 4.3) = ~$0.20
- 9 judges (add Llama, DeepSeek, Qwen, Mistral) = ~$0.30
At eval volumes any serious team runs — say, 100K evaluations per month for a CI pipeline — the 9-judge panel costs $30,000/month. The 5-judge panel costs $20,000/month. The Apple paper says both panels deliver about the same effective signal as 2-3 well-chosen judges, which would cost $6,000-$9,000/month.
Why the Correlation Happens
The paper traces the correlated-error effect to three root causes:
Shared training data. All frontier models are pretrained on overlapping web-scale text. They share a common prior about what "good code" looks like, what is idiomatic, what is style-correct. When that shared prior is wrong (because the convention in your codebase differs from the open-source mainstream), all judges fail in the same direction.
Shared RLHF data. Models trained with similar human preference data converge on similar judgment patterns. They reward verbose explanations, penalize blunt code, prefer defensive checks — even when the task does not call for any of that.
Shared architectural inductive bias. Transformer-based judges with similar size and attention patterns will agree on attention-driven errors (e.g., misattributing causality to lexical proximity). Adding another transformer judge does not add architectural diversity.
A Cheaper Judge Panel That Actually Works
The paper's prescriptive part is the more useful half: when 9 judges collapse to 2 votes, the path forward is not more judges — it is more diverse judges. Three practical patterns:
Cross-family diversity. Pick judges from different model families (one OpenAI, one Anthropic, one Google, one DeepSeek). Same parameter count, different priors. The paper shows this delivers roughly 2.5 effective votes per 4 judges, vs 1.5 for 4 judges from the same family.
Heterogeneous prompting. Even with the same model, two judges with very different evaluation rubrics produce more independent signal than two judges with similar rubrics. Have one judge score on correctness, one on idiom adherence, one on cost-efficiency.
Combined with a rules-based check. Pair the LLM panel with a deterministic checker (unit tests, AST validators, type checkers). A single rules check often dominates 5 LLM votes for code correctness, and it costs essentially nothing.
A Concrete Redesign
Imagine a CI pipeline that runs 100K coding-agent evaluations per month. Today's setup: 7 LLM judges (Opus, Sonnet, GPT-5.5, Gemini 3.1 Pro, Grok 4.3, Llama, DeepSeek). Monthly bill: ~$24,000.
Apple-aligned redesign:
- 3 cross-family judges with diverse rubrics (Opus correctness, Gemini idiom, DeepSeek cost) at ~$0.10/eval = $10,000/month
- Add a deterministic test runner (free)
- Total: ~$10,000/month for at least equivalent signal
Net savings: $14,000/month, $168,000/year. The paper's finding is one of the highest-leverage cost-cutting opportunities published this year for any team running LLM evaluation at scale.
Why the Industry Still Runs Big Panels
Big judge panels feel safer. "We ran it through 9 models" sounds robust in a postmortem. The Apple paper makes the case that the perceived safety is illusory — the 9-model panel has the same blind spots as a 2-model panel, just at 4-5x the cost. Diversity, not redundancy, drives independent signal. Anyone reviewing their eval pipeline this quarter has a clear opportunity.
Frequently Asked Questions
How many LLM judges do I actually need for reliable evaluation?
Per Apple's paper, 3 cross-family judges with diverse rubrics provide about the same effective signal as 9 same-family judges. Adding a deterministic rules check (unit tests, AST validators) often dominates LLM votes for code correctness — at zero token cost.
Why do 9 LLM judges only equal 2 independent votes?
Three causes: shared training data (similar priors about 'good code'), shared RLHF data (similar judgment patterns), and shared architectural inductive bias (similar attention-driven failure modes). Correlated errors collapse the apparent diversity of the panel.
How much can I save by redesigning an LLM judge pipeline?
A team running 100K monthly evals on a 7-judge panel (~$24K/month) can typically cut to a 3-cross-family-judge + deterministic-check setup at ~$10K/month for equivalent or better signal. That's $14K/month, $168K/year saved.
What makes judges 'diverse' versus 'redundant'?
Cross-family (OpenAI + Anthropic + Google + DeepSeek beats four Anthropic models), heterogeneous prompting (different rubrics for the same task), and pairing with rules-based checks (unit tests, type checkers). Same-family judges with similar prompts are redundant, not diverse.
Want to calculate exact costs for your project?
Related Articles
AI Code Review Cost: Single Reviewer vs Multi-Agent Judge Panel — Which Actually Saves Money?
Comparing the cost-per-PR economics of a single Claude Opus reviewer against a multi-agent judge panel. We use Apple's June 2026 'correlated errors' research to design a panel that saves 60% without losing signal.
NatureBench Result: Only 17.8% of AI Agent Tasks Beat Published SOTA — What That Means for Research-Agent Cost
NatureBench tested AI coding agents on Nature-paper-grade research tasks. The strongest configuration cleared SOTA on just 17.8% of jobs. We break down what that result means for cost per research-grade task.
OpenRouter vs Portkey: Which LLM Gateway Cuts AI Coding Costs More in 2026?
A detailed comparison of OpenRouter and Portkey as LLM gateways for AI coding teams. Covers routing strategies, cost optimization, latency, compliance, and when to choose each platform.