← Back to Blog

Fable 5 Hits 16.1% on Remote Labor Index — What a 6x Jump in 8 Months Means for Coding Costs

By Eric Bush · July 4, 2026 · 9 min read

A remote worker's overhead workspace with laptop and notes, symbolising freelance project delivery

The Headline Number

The Remote Labor Index (RLI) is one of the more grounded AI benchmarks: 240 real freelance projects, sourced from actual marketplaces, worth $144,000 in total contract value. Each project is graded by human clients on whether the delivered work meets professional quality. The July 2 leaderboard released this week has Claude Fable 5 at 16.1% pass rate. Eight months ago, the best system in the field scored 2.5%. That is a 6.4x jump.

For context, here are the current top scores:

Model RLI pass rate Notes
Claude Fable 516.1% (worst-case 14.6%)218/240 projects evaluated after US access restrictions
Claude Opus 4.88.3%Full 240 evaluated
GPT-5.56.3%AI judge over-rated by ~15% vs human judgment
Gemini 3 Pro1.25%Underperforms older Gemini variants
Best system November 20252.5%Reference point for 6.4x growth

The Cost Number Nobody Publishes

Benchmarks report pass rate. They rarely report the dollars spent to get there. From reported RLI methodology, each Fable 5 attempt on a full project consumes 500K to 1.5M tokens across the trajectory, mostly weighted to input due to long context windows and repeated tool calls. Using current Fable 5 pricing:

  • Cost per attempt (mid-estimate): ~$18 in API charges alone.
  • Total spend to complete 240 attempts: ~$4,300.
  • Attempts that "passed" (16.1% of 240): 39 projects.
  • Cost per successful completion: ~$110 in API charges.

The freelance projects that Fable succeeded on have an average human contract value of about $600. So Fable 5 is delivering successful projects at roughly 18% of the freelancer's price — a meaningful margin, but not zero.

The 6.4x Curve Won't Last

Extrapolating naively from 2.5% → 16.1% in 8 months gives 40-60% by mid-2027 and 100% by year-end. That is unlikely to hold because the remaining projects are systematically harder:

  • The 16% Fable completes are the ones with clear specs, no ambiguous scope, and no missing artifacts.
  • The other 84% typically require: interpreting vague briefs, clarifying with the client, negotiating scope changes, or making judgment calls about acceptance criteria.
  • Those are exactly the categories current agents fail at most consistently, independent of raw capability.

Expect the curve to flatten sharply as it climbs into the 30-40% band, because unlocking those projects will require agent architecture changes (real client dialogue, adaptive replanning) rather than pure model improvements.

Why AI Judges Over-Rate

The RLI paper flagged that AI-based grading over-rates model performance, and the gap is largest for GPT-5.5 (roughly 15 percentage points inflated compared to human evaluators). This matters because most vendor-published benchmarks use AI judges. Two takeaways:

  1. When you see a bench number from a vendor's own materials, discount it 10-15 points before comparing to reality.
  2. Fable 5's RLI number is more trustworthy than most because RLI uses paying clients, not other LLMs, as judges.

What This Means for Your Coding Budget

Three practical implications:

  • Do not budget as if Fable will complete every task on the first try. Plan for 3-6 attempts on non-trivial work. Multiply your token estimate accordingly.
  • Use Fable 5 on projects with clear specs, not vague ones. Its edge is largest on the well-scoped 16%; on the 84% you are burning tokens to fail.
  • Track cost per successful outcome, not cost per API call. A workflow that costs $8 per API call but succeeds 60% of the time is cheaper than one that costs $3 per call and succeeds 15% of the time.

The Gemini 3 Pro Anomaly

Gemini 3 Pro's 1.25% score is worth its own line. It scored lower than older Gemini models on RLI, despite passing higher on Google's own benchmarks. The most credible explanation is that Gemini 3 Pro is optimized for instruction-following and short-form completion, not for the open-ended multi-step trajectories a freelance project demands. If your workflow is chatbot-shaped, Gemini 3 Pro remains competitive; if it is agent-shaped, RLI suggests otherwise.

Recommendation

RLI is currently the most useful public benchmark for anyone paying real money to run coding agents on real projects. Track the leaderboard quarterly. Use its numbers as ceiling estimates when forecasting budgets, and remember the 84% failure tail — the cost of retries dwarfs the cost of individual API calls.

Want to calculate exact costs for your project?

Frequently Asked Questions

What is the Remote Labor Index (RLI)?

A benchmark measuring what percentage of 240 real paid freelance projects (worth $144K total) an AI agent can complete to professional quality. Unlike most benchmarks, RLI uses paying clients as judges rather than other LLMs, which reduces the AI-judge over-rating problem.

What is Claude Fable 5's RLI score?

16.1% pass rate on the full run, with a worst-case 14.6% after US access restrictions limited Fable to 218/240 projects. That is 6.4x the best system's score eight months earlier and roughly 2x the next best model (Opus 4.8 at 8.3%).

What is the actual cost per successful project completion for Fable 5?

Approximately $110 in API charges per completed project, based on 500K-1.5M tokens per attempt at current Fable pricing. Since the average freelance project in the dataset is worth ~$600, Fable delivers successful projects at ~18% of the freelancer's price.

Why does Gemini 3 Pro score only 1.25% on RLI?

The most credible explanation is that Gemini 3 Pro is optimized for short-form instruction-following, not for the open-ended multi-step trajectories a freelance project demands. Its raw benchmark scores can be strong while agent-shape performance lags.

How should this benchmark affect my AI coding budget?

Three ways: budget for 3-6 attempts on non-trivial work (not one-shot completion), reserve Fable for well-scoped work where its edge is largest, and track cost per successful outcome rather than cost per API call.