← Back to Blog

JetBrains Picked Codex as Default AI Agent: The Evaluation Methodology That Got It There

June 26, 2026 · 9 min read

Engineer analyzing data on multiple monitors in a modern office

A Decision Worth Studying

On June 25, 2026, JetBrains announced that OpenAI Codex is now the recommended default agent in JetBrains AI, the in-IDE assistant integrated across IntelliJ, WebStorm, PyCharm, GoLand, and other JetBrains products. The interesting part of the announcement isn't the choice itself — it's the methodology JetBrains used to pick it.

Most teams shopping for a coding agent in 2026 default to benchmark rankings: pick whoever's highest on SWE-Bench Verified this quarter. JetBrains explicitly described a two-stage process: systematic real-world benchmark sweeps plus online A/B testing against user behavior. The methodology is more important than the conclusion, because the conclusion will change as the field moves. The methodology generalizes.

Why Benchmark Rankings Aren't Enough

SWE-Bench Verified, Terminal-Bench, AgentArena, and similar public benchmarks are useful for ranking agent capability at a single point in time. They're not useful for predicting user-perceived quality. The disconnect comes from a few sources:

Benchmark distribution ≠ your team's distribution. SWE-Bench draws from a specific set of GitHub repositories with their own language mix, idiomatic patterns, and bug types. Your team's day-to-day coding tasks have different characteristics. An agent that wins SWE-Bench by 5 points may underperform on the kind of code your team actually writes.

Benchmark optimization is real. Major agent vendors invest engineering effort specifically in benchmark performance. Some of that effort generalizes; some of it is overfit to specific benchmark suites. The headline number drifts upward without proportional improvement in real-task performance.

User experience matters beyond correctness. Latency, conversation quality, error messages, recovery from confusion, willingness to ask for clarification — none of these show up in SWE-Bench scores, but all of them affect whether a developer keeps using the agent the next day.

Stage 1: Systematic Real-World Benchmark Sweeps

JetBrains' first stage is broader and more grounded than a single public benchmark. The team ran multiple candidate agents — Codex, Claude Code, Cursor, and several others — through batteries of real-world tasks drawn from JetBrains users' actual workflows. Tasks were anonymized, organized by language and task type, and scored on multiple dimensions:

  • Code correctness (does the change actually solve the task?)
  • Code quality (idiomatic, maintainable, well-formatted)
  • Test discipline (does the agent run/write tests appropriately?)
  • Tool use (correct file edits, search queries, terminal commands)
  • Token efficiency (cost per completed task)

This is more expensive than running a public benchmark — you need labeled real tasks, evaluator time, and scoring infrastructure — but it produces a more honest picture of which agent performs best on the kinds of tasks your users actually have.

Stage 2: Online A/B Testing Against Real User Behavior

Real-world benchmarks narrow the candidate set. Online A/B testing picks the winner. JetBrains rolled candidate agents to subsets of JetBrains AI users and measured what they care about:

  • Acceptance rate (does the user keep the agent's suggestion?)
  • Retry rate (does the user re-prompt the same task?)
  • Session length (do users engage with the agent for longer or shorter?)
  • User-initiated dismissals (does the user kill the agent mid-task?)
  • Subjective rating (in-IDE thumbs-up/thumbs-down feedback)

A/B testing captures the variables that benchmarks miss: latency feel, perceived quality, error recovery behavior, and the soft signals that determine whether a developer keeps reaching for the agent in their daily workflow.

Why JetBrains Picked Codex (As of June 2026)

The announcement doesn't disclose every detail, but the published methodology makes the rough picture clear:

Codex performed well across language coverage. JetBrains supports a wide language mix (Java, Kotlin, Python, JavaScript, Go, Rust, C#, PHP, Ruby). Codex's training and tooling appear to perform consistently across this set, whereas some competitors specialize in a narrower subset.

Token economics are competitive. Codex's per-task token usage in JetBrains' benchmark sweeps was lower than several alternatives — meaning JetBrains AI subscribers using Codex experienced fewer rate-limit issues and lower quota burn.

User behavior signals were positive. In A/B testing, Codex-paired sessions showed higher acceptance rates and lower retry rates than several alternatives.

The "recommended" framing leaves room to change. JetBrains is careful to use "current choice" language, signaling that the recommendation will be revisited as new agents launch and existing agents update.

How to Apply This Methodology to Your Team

You probably don't have JetBrains' user base for A/B testing. You can still use a scaled-down version of the same process:

1. Build a small evaluation set from real tasks. Collect 20-50 representative coding tasks from your team's recent work. Use them as a private benchmark. Run each candidate agent through every task. Score on the dimensions you care about (correctness, style, tests, cost).

2. Pilot with a small user cohort. Give the top two candidates to subsets of your team for one or two weeks each. Track acceptance rates, retry frequency, and qualitative feedback.

3. Commit but revisit. Pick the winner, deploy it broadly, and put a calendar reminder to re-evaluate in 6 months. The frontier shifts quarterly; your "default" should not be locked in for a year.

The Hidden Cost JetBrains is Avoiding

The alternative to systematic evaluation is the default: pick whoever's on top of SWE-Bench, deploy them, hope it works. The hidden costs of that approach are real:

  • If the agent doesn't match your team's task distribution, productivity gains underdeliver.
  • If retry rates are high, your token bill inflates beyond projections.
  • If users dislike the agent, adoption stalls and your seat-cost ROI collapses.
  • Switching agents 6 months in is expensive — config migrations, retraining users, vendor negotiations.

JetBrains spent meaningful engineering effort on evaluation, but the alternative would have been more expensive in expectation.

Bottom Line

JetBrains picked Codex as the default in JetBrains AI through systematic real-world benchmarks plus online A/B testing — not by reading SWE-Bench rankings. The methodology generalizes to any team picking a coding agent: build a small private benchmark, run a pilot, measure user behavior, and commit but plan to revisit. The cost of doing this is real; the cost of skipping it is usually higher.

Frequently Asked Questions

Why is JetBrains' agent selection methodology worth studying?

Because it explicitly goes beyond public benchmarks. SWE-Bench Verified, Terminal-Bench, and similar rankings predict capability but not user-perceived quality. JetBrains combined private real-world task batteries with online A/B testing to measure actual user behavior (acceptance rate, retry rate, session engagement), which is a much better predictor of whether an agent will deliver value in production.

What dimensions did JetBrains evaluate beyond raw correctness?

Code quality (idiomatic, maintainable), test discipline (does the agent write/run tests appropriately?), tool use (correct file edits, search, terminal commands), token efficiency (cost per task), latency feel, and user behavior signals (acceptance rate, retry rate, session length, dismissals, subjective ratings).

Can I apply JetBrains' methodology to my own team's agent selection?

Yes, in a scaled-down form. Build a private benchmark of 20-50 real tasks from your team's recent work, score candidate agents on the dimensions you care about, pilot the top two candidates with a subset of your team for 1-2 weeks each, then commit to the winner with a calendar reminder to re-evaluate every 6 months.

Why did Codex win JetBrains' evaluation in June 2026?

The published reasons are: consistent performance across JetBrains' wide language coverage (Java, Kotlin, Python, JS, Go, Rust, C#, PHP, Ruby), competitive token economics in benchmark sweeps, and positive user behavior signals (higher acceptance rate, lower retry rate) in A/B testing. JetBrains uses 'current choice' framing to signal that the recommendation will change as the field evolves.

How often should a team re-evaluate its default AI coding agent?

Every 6 months is a reasonable cadence in 2026. The frontier shifts quarterly with new model releases, and existing agents update their underlying models several times per year. A 6-month re-evaluation cycle balances the cost of switching against the risk of staying on a stale default.

Want to calculate exact costs for your project?