Model Inference Pricing: Examples & Companies

55 companies in the corpus Updated full analysis
Definition

Model Inference Pricing is Pricing for AI model inference services — APIs and platforms that run trained models on user inputs, typically billed per token, per request, or per GPU-hour.

Also known as: AI Inference PricingLLM API Pricing

What is model inference pricing?

Model inference pricing is the billing structure a vendor applies when a customer calls an AI model to generate output — whether that is text tokens from an LLM, an image from a diffusion model, a speech clip from a TTS model, or an embedding vector from an embedding model. “Inference” refers to the forward pass through a trained model: you send an input, the model generates an output, you pay for the compute consumed.

Fifty-five companies in the UsagePricing corpus tag model inference as a primary use case, making it the largest single use-case category. They include frontier model labs (Anthropic, OpenAI, Google), open-model inference platforms (Groq, Fireworks, Together, DeepInfra, Novita), managed serving platforms (Baseten, Anyscale, Modal, Replicate), and API-first specialists (Cohere, Mistral, DeepSeek).

The dominant billing units

Inference pricing uses three primary units, each reflecting a different cost driver:

Tokens (input + output, separately priced) — the standard for LLM text APIs. Input tokens are typically cheaper than output tokens because generation is more compute-intensive than encoding. Frontier examples: OpenAI GPT-5.5 at $5/$30 per 1M input/output, Anthropic Claude 3.5 Sonnet at $3/$15, DeepSeek V3 at $0.27/$1.10.

GPU-hours / compute seconds — the standard for dedicated deployments, fine-tuned models, and GPU-cloud platforms. Baseten, Anyscale, Modal, RunPod, and Replicate all meter GPU time directly. On-demand vs reserved rates differ by 20-40%.

API calls / requests — the standard for image generation (per image), embedding (per query), and some search APIs. Fal.ai, Clipdrop, Replicate (some models), and Browserbase use per-request metering.

Key structural discounts

Batch processing (~50% off): Asynchronous, latency-tolerant workloads earn roughly half off the synchronous rate across Anthropic, OpenAI, Google, Fireworks, Groq, Mistral, and Together. If any meaningful portion of your workload is async, batch API can halve the token bill with no model change.

Cached-input discounts (50-80% off): Nine corpus vendors publish a discounted rate for repeated input context. Anthropic charges $3.75/1M for cached Claude 3.5 Sonnet input vs $15/1M uncached (75% off); OpenAI offers 50% off cached input; Google 75% off Gemini cached input. Workloads with stable system prompts or large RAG context benefit most.

Volume tiers: Most inference vendors offer volume discounts at a monthly spend or token threshold, usually unlocking at $1k-$10k/month of usage.

Free tiers and onboarding

88% of pure-usage inference vendors offer a free tier — the highest free-tier rate in the corpus. Free tiers typically provide 200K-1M tokens or $5-$10 in credits, sufficient to test models before committing. Exceptions include DeepInfra (no free tier) and some dedicated-GPU platforms.

Pricing transparency

Inference providers are the most transparent segment in the corpus: nearly all publish per-token or per-call rate cards publicly. The exceptions are vendors adding dedicated deployment or enterprise commitment tiers, where rates are sales-quoted. Gated pricing in inference is a signal of either (a) a new product not yet publicly priced or (b) dedicated capacity that requires utilization negotiation.

What to watch

Token prices continue to fall at each model generation — GPT-4o launched 50% cheaper than GPT-4 Turbo; Anthropic rebased Opus ~3x inside the 4.x line; DeepSeek V3 launched at frontier-class performance for $0.27/1M input. Re-baseline your cost model at least twice a year and evaluate whether a cheaper new model meets your quality bar before assuming current rates are fixed.

Company Product Pricing modelBilling unitsFree tier Verified
Abacus.AIAI super-assistant (ChatLLM) plus an enterprise agentic AI platform
seat-basedsubscription
seatscredits
No2026-06-02
AnthropicClaude API (token-based) + Claude.ai consumer subscriptions (Free/Pro/Team/Enterprise)
freemiumsubscriptionseat-based+1
tokensseatsapi-calls
Yes2026-05-29
AnyscaleManaged Ray platform for distributed AI training, inference, and batch processing (RayTurbo, Anyscale Compute Units)
pure-usagecommitmenthybrid
gpu-hourscpu-hourscredits
Yes2026-05-29
AssemblyAISpeech-to-Text & Audio AI APIs
pure-usage
api-callstokens
Yes2026-05-29
BasetenML inference infrastructure — dedicated GPU deployments, Model APIs, and Truss framework
pure-usagehybridcommitment
gpu-hourstokensrequests
Yes2026-05-29
Bland AIAI phone call automation platform — inbound and outbound voice agents at scale
hybridpure-usagesubscription
api-callscreditsmedia-minutes
Yes2026-05-29
CartesiaReal-time voice AI platform (Sonic TTS, voice cloning, voice agents)
freemiumsubscriptionhybrid+1
creditsrequestsapi-calls+1
Yes2026-05-29
CerebrasWafer-scale AI inference cloud and WSE hardware systems
pure-usagesubscriptioncommitment
tokensapi-callsgpu-hours
Yes2026-05-30
Character.aiConsumer AI companion and roleplay chat platform
subscriptionfreemium
active-users
Yes2026-05-29
ClipdropAI image-editing and generation tools (background removal, upscaling, text-to-image), now part of Jasper
freemiumsubscription
requestscreditsapi-calls
Yes2026-06-05
CohereCommand, Embed, Rerank APIs
pure-usage
tokensapi-callsrequests
Yes2026-05-29
DeepgramUsage-based speech-to-text, text-to-speech, and voice agent APIs
pure-usagefreemium
media-minutestokenscredits+1
Yes2026-05-31
DeepInfraServerless inference cloud — per-token LLM/embedding APIs, per-image and per-minute media models, per-hour on-demand GPU containers, and reserved DeepCluster GPU clusters
pure-usagecommitment
tokensgpu-hoursrequests+1
No2026-06-02
DeepSeekDeepSeek API (V4-Flash + V4-Pro models, 1M context) with token-based pricing and aggressive cache discounts
freemiumpure-usage
tokensapi-calls
Yes2026-06-05
DescriptAI-powered audio and video editing
hybridfreemium
seatscreditsmedia-minutes
Yes2026-05-31
ElevenLabsVoice AI platform across ElevenCreative, ElevenAgents, and ElevenAPI
subscriptionpure-usagehybrid
characterscreditsmedia-minutes+1
Yes2026-05-28
FalGenerative-media inference platform — serverless per-output model APIs plus dedicated GPU compute
pure-usage
gpu-hoursrequestsmedia-minutes
No2026-06-01
Fireworks AIGenerative AI inference platform — serverless per-token, on-demand GPU, fine-tuning, batch API
pure-usagehybridcommitment
tokensgpu-hoursrequests
Yes2026-05-30
FreepikAI creative suite — image, video, audio generation plus a 200M+ stock library
subscriptionhybridpure-usage+1
seatscreditsapi-calls
Yes2026-06-05
GoogleGemini API & AI Studio
pure-usagefreemium
tokensrequestsapi-calls
Yes2026-05-29
GroqGroqCloud — LPU-based ultra-low-latency inference API for Llama, GPT-OSS, Qwen, Whisper, and Mixtral
pure-usagehybridcommitment
tokensrequestsapi-calls
Yes2026-05-29
HedraAI video, avatar, image, and audio generation platform (Hedra Studio + API)
subscriptionfreemium
creditsmedia-minutescharacters+1
Yes2026-06-04
HeyGenAI avatar and video generation platform
subscriptionfreemium
creditsseatsapi-calls
Yes2026-05-30
HiggsfieldAI video and image generation platform with a credit-metered subscription
subscriptionfreemium
creditsseats
Yes2026-06-06
IdeogramText-aware AI image generation platform
freemiumsubscriptionhybrid
creditsapi-calls
Yes2026-05-31
Jina AISearch Foundation API (Embeddings, Reranker, Reader, DeepSearch, Classifier)
pure-usagefreemium
tokensrequestsapi-calls
Yes2026-06-03
Lightning AICloud GPU/CPU Studio compute platform for building, training, and serving AI models, billed by the second with a credit pool.
hybridfreemiumpure-usage
gpu-hourscpu-hourscredits+3
Yes2026-06-02
LMNTLow-latency AI text-to-speech (TTS) API with voice cloning
freemiumsubscriptionhybrid
characterscredits
Yes2026-06-04
MidjourneyAI image and video generation via subscription with GPU-hour metering
subscription
gpu-hourscredits
No2026-05-29
Mistral AIOpen and commercial LLM APIs
pure-usagefreemium
tokensseatsapi-calls+2
Yes2026-05-31
ModalServerless compute and GPU platform — per-second billing for Python functions, batch jobs, and model serving
pure-usagefreemiumsubscription+1
gpu-hourscpu-hoursgb-hours+2
Yes2026-05-29
Murf AIAI voice / text-to-speech platform (Murf Studio app + Murf API)
subscriptionpure-usagefreemium
media-minutesseatscredits
Yes2026-06-01
NomicNomic Platform (AEC agentic workflows) + Atlas data-exploration app + Nomic Embed embedding/Developer API
hybridseat-basedcommitment+1
seatstokenscredits+2
Yes2026-06-04
Novita AIPay-as-you-go AI cloud: 200+ model inference APIs, on-demand GPUs, and per-second agent sandboxes under one API
pure-usagefreemium
tokensgpu-hourscpu-hours+2
Yes2026-06-02
OpenAIChatGPT consumer subscriptions + GPT-5.x API with token-based usage billing
freemiumsubscriptionseat-based+1
tokensseatsapi-calls+1
Yes2026-05-30
OpenPipeOpenPipe fine-tuning and hosted inference platform (small specialized models / RL for agents)
pure-usage
tokenscpu-hours
Yes2026-06-04
Perplexity AIAI-native answer engine with citations and multi-model search
freemiumsubscriptionseat-based+1
seatstokensrequests+1
Yes2026-05-29
PlaygroundAI image generation and graphic-design studio with a monthly credit pool
freemiumsubscriptionhybrid
creditsapi-calls
Yes2026-06-04
RecraftAI image and vector generation studio plus a per-image generation API
freemiumsubscriptionhybrid
creditsapi-callsseats
Yes2026-06-01
ReplicateCloud platform for running, fine-tuning, and deploying AI models via REST API
pure-usagehybridcommitment
gpu-hourstokensrequests
Yes2026-05-30
Rev AIPay-as-you-go speech-to-text, transcription, and audio-intelligence APIs
pure-usagefreemium
media-minutescreditsapi-calls
Yes2026-06-04
RoboflowComputer-vision platform (dataset management, model training, deployment)
hybridfreemium
creditsseatsgpu-hours
Yes2026-06-02
RunPodGPU cloud marketplace — Secure Cloud and Community Cloud Pods, Serverless endpoints, and persistent storage
pure-usagehybridcommitment
gpu-hoursstorage-gb
No2026-05-30
RunwayVideo generation and AI editing
subscriptionfreemium
creditsseats
Yes2026-05-31
SpeechmaticsSpeech-to-text and text-to-speech APIs with per-hour usage pricing
pure-usagefreemium
media-minutescharacters
Yes2026-06-04
SunoAI music generation
subscriptionfreemium
credits
Yes2026-05-31
SynthesiaEnterprise AI video generation
subscriptionfreemium
creditsmedia-minutesseats
Yes2026-05-31
Together AIAI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning
pure-usagehybridcommitment
tokensgpu-hourscpu-hours+1
Yes2026-05-29
turbopufferServerless vector and full-text search database on object storage
pure-usagecommitment
storage-gbvectors-indexedgb-hours+1
No2026-06-04
Twelve LabsVideo understanding foundation models (Marengo for search/embeddings, Pegasus for analysis) delivered as a usage-metered API
pure-usagefreemiumcommitment
media-minutestokensrequests
Yes2026-06-02
UpstashUpstash (Redis, Vector, QStash, Search, Workflow)
pure-usagefreemiumhybrid
requestsapi-callsvectors-indexed+3
Yes2026-06-03
Vast.aiGPU rental marketplace — on-demand, interruptible (spot), and reserved cloud GPUs plus autoscaling serverless inference
pure-usagecommitment
gpu-hoursstorage-gbbandwidth-gb
No2026-06-02
VectaraEnterprise RAG-as-a-Service and agent platform for trusted, grounded, auditable AI
commitmentsubscription
creditsrequestsstorage-gb
No2026-06-02
Voyage AIEmbedding and reranker models (text, code, multimodal) for retrieval and RAG
pure-usagefreemium
tokensstorage-gb
Yes2026-06-04
You.comWeb search, contents, research, and finance-research APIs for AI systems
pure-usagefreemium
api-callsrequestspages-rendered
Yes2026-06-01

FAQ

What is model inference pricing?

Model inference pricing is the billing structure a vendor charges when a customer calls an AI model to generate output — text tokens, images, embeddings, or audio. You pay for the compute consumed during the forward pass through the model, typically quoted per million tokens, per image, or per GPU-hour.

What is cached-input pricing and how much does it save?

Cached-input pricing is a discounted rate that applies when your prompt reuses a previously processed input prefix (like a fixed system prompt or large RAG context). Discounts range from 50-80% off the standard input rate: Anthropic 75% off, OpenAI 50% off, Google 75% off, DeepSeek 74% off. Workloads with stable system prompts benefit most.

When should I use batch pricing instead of real-time inference?

Batch processing (async, non-time-sensitive workloads) earns roughly 50% off the synchronous rate across Anthropic, OpenAI, Google, Fireworks, and Mistral. If any portion of your workload can tolerate a minutes-to-hours latency window — evaluation pipelines, document processing, data enrichment — batch API can halve the cost with no model change.

Are inference token prices rising or falling?

Falling at each model generation. GPT-4o launched 50% cheaper than GPT-4 Turbo. Anthropic rebased Opus ~3x within the 4.x family. DeepSeek V3 landed at $0.27/1M for frontier-class performance. Re-baseline your cost model at least twice a year before assuming current rates are fixed.

Related use cases

Back to companies