What is model inference pricing?
Model inference pricing is the billing structure a vendor applies when a customer calls an AI model to generate output — whether that is text tokens from an LLM, an image from a diffusion model, a speech clip from a TTS model, or an embedding vector from an embedding model. “Inference” refers to the forward pass through a trained model: you send an input, the model generates an output, you pay for the compute consumed.
Fifty-five companies in the UsagePricing corpus tag model inference as a primary use case, making it the largest single use-case category. They include frontier model labs (Anthropic, OpenAI, Google), open-model inference platforms (Groq, Fireworks, Together, DeepInfra, Novita), managed serving platforms (Baseten, Anyscale, Modal, Replicate), and API-first specialists (Cohere, Mistral, DeepSeek).
The dominant billing units
Inference pricing uses three primary units, each reflecting a different cost driver:
Tokens (input + output, separately priced) — the standard for LLM text APIs. Input tokens are typically cheaper than output tokens because generation is more compute-intensive than encoding. Frontier examples: OpenAI GPT-5.5 at $5/$30 per 1M input/output, Anthropic Claude 3.5 Sonnet at $3/$15, DeepSeek V3 at $0.27/$1.10.
GPU-hours / compute seconds — the standard for dedicated deployments, fine-tuned models, and GPU-cloud platforms. Baseten, Anyscale, Modal, RunPod, and Replicate all meter GPU time directly. On-demand vs reserved rates differ by 20-40%.
API calls / requests — the standard for image generation (per image), embedding (per query), and some search APIs. Fal.ai, Clipdrop, Replicate (some models), and Browserbase use per-request metering.
Key structural discounts
Batch processing (~50% off): Asynchronous, latency-tolerant workloads earn roughly half off the synchronous rate across Anthropic, OpenAI, Google, Fireworks, Groq, Mistral, and Together. If any meaningful portion of your workload is async, batch API can halve the token bill with no model change.
Cached-input discounts (50-80% off): Nine corpus vendors publish a discounted rate for repeated input context. Anthropic charges $3.75/1M for cached Claude 3.5 Sonnet input vs $15/1M uncached (75% off); OpenAI offers 50% off cached input; Google 75% off Gemini cached input. Workloads with stable system prompts or large RAG context benefit most.
Volume tiers: Most inference vendors offer volume discounts at a monthly spend or token threshold, usually unlocking at $1k-$10k/month of usage.
Free tiers and onboarding
88% of pure-usage inference vendors offer a free tier — the highest free-tier rate in the corpus. Free tiers typically provide 200K-1M tokens or $5-$10 in credits, sufficient to test models before committing. Exceptions include DeepInfra (no free tier) and some dedicated-GPU platforms.
Pricing transparency
Inference providers are the most transparent segment in the corpus: nearly all publish per-token or per-call rate cards publicly. The exceptions are vendors adding dedicated deployment or enterprise commitment tiers, where rates are sales-quoted. Gated pricing in inference is a signal of either (a) a new product not yet publicly priced or (b) dedicated capacity that requires utilization negotiation.
What to watch
Token prices continue to fall at each model generation — GPT-4o launched 50% cheaper than GPT-4 Turbo; Anthropic rebased Opus ~3x inside the 4.x line; DeepSeek V3 launched at frontier-class performance for $0.27/1M input. Re-baseline your cost model at least twice a year and evaluate whether a cheaper new model meets your quality bar before assuming current rates are fixed.
| Company | Product | Pricing model | Billing units | Free tier | Verified |
|---|---|---|---|---|---|
| Abacus.AI | AI super-assistant (ChatLLM) plus an enterprise agentic AI platform | seat-basedsubscription | seatscredits | No | 2026-06-02 |
| Anthropic | Claude API (token-based) + Claude.ai consumer subscriptions (Free/Pro/Team/Enterprise) | freemiumsubscriptionseat-based+1 | tokensseatsapi-calls | Yes | 2026-05-29 |
| Anyscale | Managed Ray platform for distributed AI training, inference, and batch processing (RayTurbo, Anyscale Compute Units) | pure-usagecommitmenthybrid | gpu-hourscpu-hourscredits | Yes | 2026-05-29 |
| AssemblyAI | Speech-to-Text & Audio AI APIs | pure-usage | api-callstokens | Yes | 2026-05-29 |
| Baseten | ML inference infrastructure — dedicated GPU deployments, Model APIs, and Truss framework | pure-usagehybridcommitment | gpu-hourstokensrequests | Yes | 2026-05-29 |
| Bland AI | AI phone call automation platform — inbound and outbound voice agents at scale | hybridpure-usagesubscription | api-callscreditsmedia-minutes | Yes | 2026-05-29 |
| Cartesia | Real-time voice AI platform (Sonic TTS, voice cloning, voice agents) | freemiumsubscriptionhybrid+1 | creditsrequestsapi-calls+1 | Yes | 2026-05-29 |
| Cerebras | Wafer-scale AI inference cloud and WSE hardware systems | pure-usagesubscriptioncommitment | tokensapi-callsgpu-hours | Yes | 2026-05-30 |
| Character.ai | Consumer AI companion and roleplay chat platform | subscriptionfreemium | active-users | Yes | 2026-05-29 |
| Clipdrop | AI image-editing and generation tools (background removal, upscaling, text-to-image), now part of Jasper | freemiumsubscription | requestscreditsapi-calls | Yes | 2026-06-05 |
| Cohere | Command, Embed, Rerank APIs | pure-usage | tokensapi-callsrequests | Yes | 2026-05-29 |
| Deepgram | Usage-based speech-to-text, text-to-speech, and voice agent APIs | pure-usagefreemium | media-minutestokenscredits+1 | Yes | 2026-05-31 |
| DeepInfra | Serverless inference cloud — per-token LLM/embedding APIs, per-image and per-minute media models, per-hour on-demand GPU containers, and reserved DeepCluster GPU clusters | pure-usagecommitment | tokensgpu-hoursrequests+1 | No | 2026-06-02 |
| DeepSeek | DeepSeek API (V4-Flash + V4-Pro models, 1M context) with token-based pricing and aggressive cache discounts | freemiumpure-usage | tokensapi-calls | Yes | 2026-06-05 |
| Descript | AI-powered audio and video editing | hybridfreemium | seatscreditsmedia-minutes | Yes | 2026-05-31 |
| ElevenLabs | Voice AI platform across ElevenCreative, ElevenAgents, and ElevenAPI | subscriptionpure-usagehybrid | characterscreditsmedia-minutes+1 | Yes | 2026-05-28 |
| Fal | Generative-media inference platform — serverless per-output model APIs plus dedicated GPU compute | pure-usage | gpu-hoursrequestsmedia-minutes | No | 2026-06-01 |
| Fireworks AI | Generative AI inference platform — serverless per-token, on-demand GPU, fine-tuning, batch API | pure-usagehybridcommitment | tokensgpu-hoursrequests | Yes | 2026-05-30 |
| Freepik | AI creative suite — image, video, audio generation plus a 200M+ stock library | subscriptionhybridpure-usage+1 | seatscreditsapi-calls | Yes | 2026-06-05 |
| Gemini API & AI Studio | pure-usagefreemium | tokensrequestsapi-calls | Yes | 2026-05-29 | |
| Groq | GroqCloud — LPU-based ultra-low-latency inference API for Llama, GPT-OSS, Qwen, Whisper, and Mixtral | pure-usagehybridcommitment | tokensrequestsapi-calls | Yes | 2026-05-29 |
| Hedra | AI video, avatar, image, and audio generation platform (Hedra Studio + API) | subscriptionfreemium | creditsmedia-minutescharacters+1 | Yes | 2026-06-04 |
| HeyGen | AI avatar and video generation platform | subscriptionfreemium | creditsseatsapi-calls | Yes | 2026-05-30 |
| Higgsfield | AI video and image generation platform with a credit-metered subscription | subscriptionfreemium | creditsseats | Yes | 2026-06-06 |
| Ideogram | Text-aware AI image generation platform | freemiumsubscriptionhybrid | creditsapi-calls | Yes | 2026-05-31 |
| Jina AI | Search Foundation API (Embeddings, Reranker, Reader, DeepSearch, Classifier) | pure-usagefreemium | tokensrequestsapi-calls | Yes | 2026-06-03 |
| Lightning AI | Cloud GPU/CPU Studio compute platform for building, training, and serving AI models, billed by the second with a credit pool. | hybridfreemiumpure-usage | gpu-hourscpu-hourscredits+3 | Yes | 2026-06-02 |
| LMNT | Low-latency AI text-to-speech (TTS) API with voice cloning | freemiumsubscriptionhybrid | characterscredits | Yes | 2026-06-04 |
| Midjourney | AI image and video generation via subscription with GPU-hour metering | subscription | gpu-hourscredits | No | 2026-05-29 |
| Mistral AI | Open and commercial LLM APIs | pure-usagefreemium | tokensseatsapi-calls+2 | Yes | 2026-05-31 |
| Modal | Serverless compute and GPU platform — per-second billing for Python functions, batch jobs, and model serving | pure-usagefreemiumsubscription+1 | gpu-hourscpu-hoursgb-hours+2 | Yes | 2026-05-29 |
| Murf AI | AI voice / text-to-speech platform (Murf Studio app + Murf API) | subscriptionpure-usagefreemium | media-minutesseatscredits | Yes | 2026-06-01 |
| Nomic | Nomic Platform (AEC agentic workflows) + Atlas data-exploration app + Nomic Embed embedding/Developer API | hybridseat-basedcommitment+1 | seatstokenscredits+2 | Yes | 2026-06-04 |
| Novita AI | Pay-as-you-go AI cloud: 200+ model inference APIs, on-demand GPUs, and per-second agent sandboxes under one API | pure-usagefreemium | tokensgpu-hourscpu-hours+2 | Yes | 2026-06-02 |
| OpenAI | ChatGPT consumer subscriptions + GPT-5.x API with token-based usage billing | freemiumsubscriptionseat-based+1 | tokensseatsapi-calls+1 | Yes | 2026-05-30 |
| OpenPipe | OpenPipe fine-tuning and hosted inference platform (small specialized models / RL for agents) | pure-usage | tokenscpu-hours | Yes | 2026-06-04 |
| Perplexity AI | AI-native answer engine with citations and multi-model search | freemiumsubscriptionseat-based+1 | seatstokensrequests+1 | Yes | 2026-05-29 |
| Playground | AI image generation and graphic-design studio with a monthly credit pool | freemiumsubscriptionhybrid | creditsapi-calls | Yes | 2026-06-04 |
| Recraft | AI image and vector generation studio plus a per-image generation API | freemiumsubscriptionhybrid | creditsapi-callsseats | Yes | 2026-06-01 |
| Replicate | Cloud platform for running, fine-tuning, and deploying AI models via REST API | pure-usagehybridcommitment | gpu-hourstokensrequests | Yes | 2026-05-30 |
| Rev AI | Pay-as-you-go speech-to-text, transcription, and audio-intelligence APIs | pure-usagefreemium | media-minutescreditsapi-calls | Yes | 2026-06-04 |
| Roboflow | Computer-vision platform (dataset management, model training, deployment) | hybridfreemium | creditsseatsgpu-hours | Yes | 2026-06-02 |
| RunPod | GPU cloud marketplace — Secure Cloud and Community Cloud Pods, Serverless endpoints, and persistent storage | pure-usagehybridcommitment | gpu-hoursstorage-gb | No | 2026-05-30 |
| Runway | Video generation and AI editing | subscriptionfreemium | creditsseats | Yes | 2026-05-31 |
| Speechmatics | Speech-to-text and text-to-speech APIs with per-hour usage pricing | pure-usagefreemium | media-minutescharacters | Yes | 2026-06-04 |
| Suno | AI music generation | subscriptionfreemium | credits | Yes | 2026-05-31 |
| Synthesia | Enterprise AI video generation | subscriptionfreemium | creditsmedia-minutesseats | Yes | 2026-05-31 |
| Together AI | AI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning | pure-usagehybridcommitment | tokensgpu-hourscpu-hours+1 | Yes | 2026-05-29 |
| turbopuffer | Serverless vector and full-text search database on object storage | pure-usagecommitment | storage-gbvectors-indexedgb-hours+1 | No | 2026-06-04 |
| Twelve Labs | Video understanding foundation models (Marengo for search/embeddings, Pegasus for analysis) delivered as a usage-metered API | pure-usagefreemiumcommitment | media-minutestokensrequests | Yes | 2026-06-02 |
| Upstash | Upstash (Redis, Vector, QStash, Search, Workflow) | pure-usagefreemiumhybrid | requestsapi-callsvectors-indexed+3 | Yes | 2026-06-03 |
| Vast.ai | GPU rental marketplace — on-demand, interruptible (spot), and reserved cloud GPUs plus autoscaling serverless inference | pure-usagecommitment | gpu-hoursstorage-gbbandwidth-gb | No | 2026-06-02 |
| Vectara | Enterprise RAG-as-a-Service and agent platform for trusted, grounded, auditable AI | commitmentsubscription | creditsrequestsstorage-gb | No | 2026-06-02 |
| Voyage AI | Embedding and reranker models (text, code, multimodal) for retrieval and RAG | pure-usagefreemium | tokensstorage-gb | Yes | 2026-06-04 |
| You.com | Web search, contents, research, and finance-research APIs for AI systems | pure-usagefreemium | api-callsrequestspages-rendered | Yes | 2026-06-01 |
FAQ
What is model inference pricing?
Model inference pricing is the billing structure a vendor charges when a customer calls an AI model to generate output — text tokens, images, embeddings, or audio. You pay for the compute consumed during the forward pass through the model, typically quoted per million tokens, per image, or per GPU-hour.
What is cached-input pricing and how much does it save?
Cached-input pricing is a discounted rate that applies when your prompt reuses a previously processed input prefix (like a fixed system prompt or large RAG context). Discounts range from 50-80% off the standard input rate: Anthropic 75% off, OpenAI 50% off, Google 75% off, DeepSeek 74% off. Workloads with stable system prompts benefit most.
When should I use batch pricing instead of real-time inference?
Batch processing (async, non-time-sensitive workloads) earns roughly 50% off the synchronous rate across Anthropic, OpenAI, Google, Fireworks, and Mistral. If any portion of your workload can tolerate a minutes-to-hours latency window — evaluation pipelines, document processing, data enrichment — batch API can halve the cost with no model change.
Are inference token prices rising or falling?
Falling at each model generation. GPT-4o launched 50% cheaper than GPT-4 Turbo. Anthropic rebased Opus ~3x within the 4.x family. DeepSeek V3 landed at $0.27/1M for frontier-class performance. Re-baseline your cost model at least twice a year before assuming current rates are fixed.
Related use cases
- AI Coding Tools PricingPricing for AI-native developer tools — code editors, completion engines, and agent platforms that write or modify code.
- Code Generation PricingPricing for AI services whose primary output is generated source code, typically measured in tokens, requests, or completed tasks.
- AI Agents PricingPricing for AI agent platforms — products that perform multi-step autonomous tasks on the user's behalf.
- Data Pipeline PricingPricing for data collection, scraping, and pipeline services — platforms that extract, transform, and deliver web data, typically billed per request, per GB, or per record.
- Customer Support AI PricingPricing for AI products that automate customer service — chatbots, ticket triage, and autonomous resolution agents.
- Web Hosting PricingPricing for platforms that host web applications, typically billed across multiple dimensions — bandwidth, requests, compute, and storage.
- Serverless Functions PricingPricing for serverless function platforms, billed per invocation plus compute time consumed.
- AI UI Generation PricingPricing for AI products that generate UI components or full pages from prompts — typically billed per credit or generation.