← Back to Blog

xAI Grok Build Ships /goal Mode: What Long-Running Autonomous Coding Actually Costs Per Day

June 23, 2026 · 7 min read

Hourglass on a wooden surface with sunlight streaming through

What /goal Actually Does

xAI shipped a new /goal mode in Grok Build on June 23, 2026. You give it a one-line objective; the agent plans the approach, breaks it into a checklist, and works it down item by item until the goal is reached and verified. You can issue mid-run instructions to redirect or constrain the work; the checklist is filled in as items complete.

That changes the unit of work an AI coding agent bills against. Instead of paying per turn or per file edit, you're effectively buying continuous autonomous time. Anyone running this overnight or for an 8-hour workday needs a different mental model for budget.

The Token Anatomy of an Autonomous Session

An autonomous coding session burns tokens across four buckets:

Plan generation (5-10% of total): Initial decomposition of the goal into a checklist. Single-shot, usually under 5,000 tokens output.

Per-step execution (60-70% of total): Reading code, generating edits, running tests. This scales linearly with the checklist length and the size of files touched.

Verification loops (15-20% of total): The agent checking its own work — running tests, re-reading the changed file, deciding whether to retry. This is where autonomous mode burns the most "invisible" tokens compared to a human-in-the-loop session.

Recovery from failure (5-10% of total): When a step fails, re-planning the next attempt. Heavier on output tokens than input.

A Realistic 8-Hour Cost Model

Assume an 8-hour autonomous session on Grok 4.3 (priced at $3/$15 per M tokens for input/output) on a mid-sized goal — say, "migrate the user-service from Express 4 to Hono." From observed agentic workloads of similar shape:

  • Total input tokens: ~6M (lots of context re-reads and tool calls)
  • Total output tokens: ~1.5M (edits, plans, verification reasoning)
  • Token cost: $18 (input) + $22.50 (output) = ~$40.50

That's the fully utilized case. Most real sessions don't sustain peak token throughput for 8 hours — there's idle time waiting on tests, builds, and external services. A more typical 8-hour session lands at $20-$30 in raw token cost.

Where Costs Quietly Inflate

Three patterns push autonomous-mode bills above expectations:

Verification recursion. The agent runs tests, sees a failure, re-reads the file, edits, runs tests again. On flaky tests this can loop for hours. A single recursive verification loop on a 10K-token file can burn 200K+ tokens before timing out.

Plan staleness. The agent commits to a plan based on its first read of the codebase. If the plan was wrong, mid-run instructions can correct it — but if you don't intervene, the agent will diligently work down a wrong checklist. That's a category of waste autonomous mode introduces that supervised mode doesn't.

Goal definition fuzziness. "Migrate to Hono" is fuzzy. The agent will keep working until it decides the goal is met. A poorly bounded goal can extend a 4-hour task into 12 hours of marginal cleanup that nobody asked for.

Three Cost Controls That Pay For Themselves

Set a token budget per goal. Most autonomous tools (including Grok Build's CLI) accept a max-token-spend flag. Cap each goal at 2-3x your estimated token cost — enough headroom for honest variance, not enough for runaway loops.

Define exit criteria explicitly. Instead of "migrate to Hono," write "migrate to Hono such that all existing tests pass and no route handler exceeds 50 lines." The narrower the exit, the cheaper the run.

Check in every 90 minutes. Autonomous doesn't mean unsupervised. A two-minute review of the checklist every 90 minutes catches plan staleness before it costs you four hours of wrong work.

When /goal Mode Is Worth It

The break-even versus a supervised session is straightforward: autonomous mode wins when the operator's time-cost exceeds the verification overhead. For a senior engineer billing internally at $150/hour, autonomous mode is cheaper as long as it saves ~3 hours per 8-hour run. That's a low bar for migrations, refactors, and large-surface mechanical changes — and a high bar for novel architectural work where supervision adds real value.

Grok Build's /goal mode is the cleanest version of this trade so far, but the same math applies to Claude Code's autonomous loops and Codex CLI's longer-running agents. The unit of work is shifting; the budget should follow.

Frequently Asked Questions

How much does an 8-hour Grok Build /goal session actually cost?

A typical session lands at $20-$30 in raw token cost. A heavily utilized 8-hour run on Grok 4.3 can reach $40-$50, breaking down roughly as $18 in input tokens and $22 in output tokens at $3/$15 per million.

What makes autonomous coding modes more expensive than expected?

Three drivers: verification recursion (the agent loops on flaky tests), plan staleness (committing to a wrong initial plan), and fuzzy goal definitions that let the agent keep working past the actual finish line. Each can double a session's token spend.

How do I cap the cost of a /goal mode session?

Set a max-token-spend flag at 2-3x your estimated cost, define explicit exit criteria (e.g., 'all tests pass and no handler over 50 lines'), and check in every 90 minutes to catch plan staleness early. These three controls keep most sessions within budget.

When is autonomous mode cheaper than supervised coding?

When the operator's time saved exceeds the verification overhead. For a senior engineer at a $150/hour internal rate, autonomous wins if it saves about 3 hours per 8-hour run. That bar is easy to clear for migrations and refactors, harder for novel architectural work.

Want to calculate exact costs for your project?