← Back to Blog

What Is Per-Render Pricing? AI Video, Image, and Voice API Cost Models Explained

June 25, 2026 · 8 min read

Camera lens reflecting colorful light with bokeh effect in the background

What Is Per-Render Pricing?

Per-render pricing is a billing model in which AI generation APIs charge a fixed (or duration-scaled) price per output, rather than charging based on the number of tokens consumed. A "render" is one generated artifact — a video clip, an image, a voice clip, or sometimes a 3D model.

Unlike per-token pricing (where output length drives cost), per-render pricing is largely independent of token count and instead scales with: duration of generated content (e.g., per second of video), resolution or quality tier, number of inference steps used, and any add-on features (style transfer, audio sync, slow-motion).

Why Per-Render Replaced Per-Token for Media

Per-token pricing made sense for text because tokens correlate with both compute (each token requires a forward pass) and value to the user (longer output is more useful). For generative media, neither correlation holds:

  • Compute is dominated by the diffusion or flow-matching steps, not by output length
  • Output value is per-artifact, not per-byte (a 5-second video is one usable thing, even if it's "less data" than a 60-second one)
  • Users plan budgets in artifacts, not bytes ("I need 100 product videos this month")

Per-render pricing aligns billing with the customer's mental model and the provider's actual cost driver.

Typical Per-Render Pricing Ranges (2026)

Pricing ranges for generative-media APIs in mid-2026 (always verify with the provider's current pricing page):

  • AI Video (5 seconds, 1080p): $0.30-$1.50 per render across Runway, Sora, Pika, FastWan
  • AI Video (5 seconds, 4K): $1.50-$5.00 per render
  • Image (1024x1024): $0.005-$0.05 per render across DALL-E, Midjourney API, Stable Diffusion variants
  • Image (4K, photographic quality): $0.05-$0.30 per render
  • Voice clone (10 seconds): $0.04-$0.20 per render across ElevenLabs, OpenAI Realtime, Grok Voice
  • Voice TTS (1 minute): $0.10-$0.40 per render

These ranges don't include subscription discounts, which can drop per-render rates 30-60% for committed-volume customers.

How Cost Scales With Quality

Per-render pricing usually steps up by quality tier, not linearly:

  • 1080p → 4K is typically 3-5× more expensive (more inference steps, larger latents)
  • 5 seconds → 30 seconds of video is typically 4-7× more expensive (not 6×, because of model batching)
  • Cheap-tier image → photo-realistic image is 5-10× more expensive
  • Plain TTS → voice clone is 2-4× more expensive

The sub-linear pricing for duration is interesting: providers can amortize setup cost across longer outputs, so "per second" effectively gets cheaper as duration grows. For high-volume use, optimizing toward fewer-but-longer renders is a real cost lever.

Per-Render vs Per-Token: When Each Wins

Most coding-related media work pays for itself faster on per-render pricing because of predictability:

Per-render wins when: you need fixed-cost-per-artifact accounting, your users care about artifact count rather than token count, you want to hard-cap monthly spend by capping render count, and your generation has consistent inference compute (most diffusion-based media).

Per-token wins when: you need granular cost based on output length, you're generating wildly variable-length outputs (e.g., short prompts producing both 1-line and 100-line responses), your inference compute genuinely correlates with output length (text generation, code generation), or you want to use prompt caching to discount input tokens.

Hybrid Pricing You'll Encounter

Some providers blend the two:

Per-render with token-priced prompt. You pay a base price per video render, plus per-token charges for the text prompt that defines it. The token side is small ($0.001-$0.01 typically) but adds up at scale.

Per-second voice with per-token transcript. Voice synthesis bills per second of audio, but the input transcript is also tokenized and billed. For very long voiceovers, the transcript cost can be 10-20% of the total.

Per-frame video. Newer video APIs charge per generated frame, then sum across the clip. Effectively per-second pricing with finer granularity, useful for variable-frame-rate use cases.

Budgeting for Per-Render Workloads

Three rules for predicting your monthly per-render bill:

1. Plan for 30-50% rejection rate. Most workflows generate, review, reject, and re-generate. Budget assuming you'll pay for 1.4-2× the renders you actually keep. Skipping this leads to surprise bills.

2. Price tiered ahead of time. Decide whether each artifact needs to be 4K or 1080p, photo-real or stylized, before generating. Mid-task quality changes are how budgets blow out.

3. Use the cheapest tier as the draft. Generate 5-10 cheap renders to find your composition, then promote one to expensive-tier final render. This pattern ("draft cheap, finalize expensive") cuts media bills 40-70% in real workflows.

Bottom Line

Per-render pricing is the right model for generative media — predictable, aligned with user mental models, and clean for budgeting. The pitfalls are quality-tier confusion, rejection-rate underestimation, and hybrid pricing that hides token costs in your render bill. Build budgets in artifacts-per-month, multiply by your real-world rejection rate, and check whether the provider charges per-token on the prompt side. Doing all three keeps per-render workloads predictable.

Frequently Asked Questions

What is per-render pricing?

A billing model where generative-media APIs charge per output artifact (video clip, image, voice clip) rather than per token. Cost scales with duration, resolution, or quality tier — not with how many tokens were consumed during generation.

Why is per-render pricing better than per-token for AI media?

Compute for diffusion-based generation is dominated by inference steps, not output length. Users plan budgets in artifacts ("100 product videos") not tokens. Per-render aligns billing with both the customer's mental model and the provider's actual cost driver.

How much do typical AI media renders cost in 2026?

AI video (5 sec, 1080p): $0.30-$1.50; 4K: $1.50-$5.00. Images (1024x1024): $0.005-$0.05; 4K photo-quality: $0.05-$0.30. Voice clone (10 sec): $0.04-$0.20. Voice TTS (1 min): $0.10-$0.40. Subscription discounts can drop these 30-60%.

How do I budget accurately for per-render API spend?

Plan for 30-50% rejection rate (most workflows re-generate before keeping a render). Pick quality tier ahead of time, not mid-task. Draft cheap, finalize expensive — generate 5-10 budget renders to find composition, then one premium-tier final. This cuts bills 40-70%.

Want to calculate exact costs for your project?