Voice APIs Converge on Per-Minute Billing
Fourteen of fifteen voice and audio AI companies in the corpus bill on media-minutes as a primary unit. The shift from per-character (batch TTS era) to per-minute (agent call era) reflects the dominant use case moving from content generation to real-time conversational AI.
What's happening — and why
What's happening: the standard billing unit for voice AI has converged on media-minutes — per-minute or per-hour — rather than per-character or per-request. Fourteen of fifteen corpus voice companies use it as a primary unit.
Why: the dominant voice AI use case shifted. In the batch TTS era, producers turned text into audio files; the natural unit was the characters being spoken. In the agent call era, voice is deployed in real-time phone calls, voicebots, and conversational interfaces — where wall-clock time, not text volume, is the cost driver. Telephony and call-center buyers already think in minutes; per-minute aligns vendor pricing with buyer mental models.
ElevenLabs is the clearest example: it charges per-minute for Conversational AI (agent calls) and per-character for Studio TTS (batch). The two units coexist for the two use cases.
How it works
Evidence over time
14 supporting · 1 counter — hover or tap a point for detail, click to jump to the row.
Evidence
| Company | Date | What happened |
|---|---|---|
| bland-ai | Jun 2024 | Billed entirely on media-minutes; phone-agent model fits per-minute naturally |
| elevenlabs | May 2026 | Conversational AI (agent calls) price cut to per-minute rate; retains per-character for Studio TTS. Both units coexist in billing. |
| cartesia | Feb 2026 | Voice Agents GA at flat per-minute rate; prior API used credits/requests |
| deepgram | Jan 2025 | Transcription and TTS both per-minute; Nova-2 ASR $0.0043/min, Aura TTS $0.0150/min |
| tavus | Jan 2025 | Entire model is hybrid access fee + pay-as-you-go video minutes; per-minute is the only consumption unit |
| speechmatics | Jun 2025 | Per-hour STT, per-character TTS — both units present; moving toward per-minute for real-time |
| murf-ai | Jun 2026 | Murf API launched with per-character and per-minute lanes; Studio plans cap on minutes |
| rev-ai | Jan 2025 | Pure usage per-minute; transcription billed in 15-second increments |
| krisp | Jun 2025 | Call Center product bills on accent-minutes; per-agent seats plus minute consumption |
| synthesia | May 2026 | Video-minute credits drive all plan tiers; minutes are the primary consumption signal |
| twelve-labs | Jun 2025 | Video understanding billed per video-minute indexed; minutes is the primary query unit |
| wellsaid | Jun 2026 | Annual download quotas expressed as minutes per plan tier; per-seat+minutes model |
| hedra | Dec 2025 | Credits map to video/audio seconds; effectively per-minute billing abstracted through credits |
| fal-ai | Jun 2025 | Audio/video models billed per second of output; effectively per-minute at scale |
Counterexamples
- lmnt · — — Charges per character for TTS only — no per-minute lane. Serves batch text-to-speech, not agent calls.
- wellsaid · — — Per-seat + annual quota model dilutes the pure per-minute signal; enterprise customers are quota-capped, not metered.
- descript · Jun 2025 — Media hours billed at tier level, not granularly per-minute; subscription model with hour pools
For buyers
Budget voice workloads in minutes, not characters. For batch content generation, characters may still be the efficient unit (WellSaid, LMNT). For agent calls and real-time voice, per-minute is the standard — model your cost on expected call durations and call volumes, not script length.
For vendors
If you are building a voice AI product, per-minute pricing aligns with the call-center and telephony mental model your buyers already use. If you serve both batch TTS and agent use cases, maintain both units (ElevenLabs' model): per-character for Studio, per-minute for Conversational.
Outlook — what to watch
As agent voice becomes the dominant voice AI use case, per-minute will further displace per-character. The holdout (LMNT, characters-only) is a batch-focused product. Watch for per-second granularity appearing in cost-sensitive high-volume deployments.
Bottom line
Voice AI billing has converged on media-minutes. Fourteen of fifteen corpus companies use it as a primary unit, driven by the shift from batch TTS to real-time agent calls.
FAQ
How do voice AI APIs charge for usage?
Almost universally per-minute or per-hour of audio. Fourteen of fifteen voice companies in the corpus use media-minutes. LMNT is the exception, charging per character for batch TTS.
Why per-minute instead of per-character?
Real-time agent calls — the dominant voice use case — are bounded by wall-clock time, not text volume. Per-minute aligns with telephony buyer mental models and the actual cost driver.
Does ElevenLabs charge per minute or per character?
Both. Conversational AI (agent calls) is priced per minute; Studio TTS (batch text-to-speech) is priced per character. The two billing units coexist for the two use cases.