Model Inference Pricing: Examples & Companies

What is it

Model inference pricing is pricing for AI model inference services — APIs and platforms that run trained models on user inputs, typically billed per token, per request, or per GPU-hour. “Inference” is the forward pass through an already-trained model: you send an input, the model produces an output — text, an image, a transcript, an embedding vector — and you pay for the compute that output consumed.

Model inference is the largest single use-case category in the UsagePricing corpus: 114 companies tag it as a primary use case. They span the full stack, from frontier API labs like OpenAI and Anthropic, to discounted open-model platforms like DeepInfra and Together AI, to raw GPU clouds like RunPod and Lambda, to modality specialists in voice, image, and embeddings.

The defining feature of this category is its price spread — and its speed of deflation. Discounted open-weight models start near DeepInfra’s Llama-3.1-8B at $0.02/1M input tokens; frontier models like Claude Opus 4.8 sit at $5/$25 and OpenAI’s GPT-5.5 at $5/$30 per 1M input/output. Between those poles are dozens of hosted-model platforms competing on the same open weights, which pushes the floor lower with every model generation. Because the underlying rates change quarterly, inference is the one category where a cost model built six months ago is almost always wrong.

The other defining feature is transparency. Inference is the most publicly-priced segment in the corpus — nearly every vendor here ships a public per-token or per-GPU-hour rate card, because their buyers are engineers who compare rates before they ever talk to sales. Gated pricing in inference almost always signals dedicated-capacity or enterprise-commitment tiers, not the standard self-serve API.

Per 1M tokens · output is the expensive half

How it works

Inference pricing uses three primary billing units, each mapping to a different underlying cost driver.

Billing unit	Where it’s used	Representative rates (2026)
Input + output tokens	LLM text APIs (priced separately)	Claude Opus 4.8 $5/$25 · GPT-5.5 $5/$30 · DeepSeek V4-Flash ~$0.14/$0.28 · Cohere Command A $2.50/$10
Per request / per output	Image, embedding, transcription, rerank	Fal Seedream V4 $0.03/image · Voyage voyage-4 $0.06/1M · Cohere Rerank $2/1,000 queries · Groq Whisper Turbo $0.04/hr audio
GPU-hour / GPU-second	Dedicated deployments, GPU clouds	RunPod H100 $2.89/hr · Lambda H100 SXM $3.99/hr · Baseten H100 $0.10833/min (~$6.50/hr) · Replicate H100 $0.001525/sec

The token model carries a structural asymmetry: output tokens cost 4–10× more than input tokens on the same model, because generation is far more compute-intensive than encoding the prompt. On Cohere Command A that ratio is exactly 4× ($2.50 in / $10 out); on Claude Opus 4.8 it is 5× ($5 / $25). This is why a chat workload heavy on long completions costs far more per call than a classification workload that emits a single token.

A worked example. Suppose a RAG assistant processes 2,000 input tokens (a retrieved context block) and emits 500 output tokens per query, running 1M queries a month. On Claude Sonnet 4.6 at $3/1M in and $15/1M out, that is (2,000 × 1M ÷ 1M × $3) + (500 × 1M ÷ 1M × $15) = $6,000 input + $7,500 output = $13,500/month. Route the same workload to DeepInfra’s DeepSeek-V3.1 at $0.21/$0.79 and it drops to $420 + $395 = $815/month — a 16× swing driven purely by model choice, before any batch or cache discount applies. Model those tradeoffs on a DeepSeek pricing calculator or an OpenAI pricing calculator against your real prompt shape.

Two structural discounts sit on top of the base rate on most platforms:

Cached-input pricing (50–80% off): when a prompt reuses a processed prefix — a fixed system prompt, a repeated document, a stable RAG context — the reused tokens bill at a cache-hit rate. Fireworks AI bills cached input at 50% of list; Baseten quotes 50–80% off; DeepSeek’s V4-Flash cache-hit input drops to $0.0028/1M, one-tenth of its miss rate.
Batch processing (~50% off): asynchronous, latency-tolerant jobs run at roughly half the synchronous rate at Fireworks and Together, and at 33% off on Voyage AI’s embedding batch API. Fireworks stacks the two — cached input at 50%, then another 50% off for batch.

Companies using this

The 114 companies below all tag model inference as a primary use case. The list runs from frontier LLM labs and multi-model routing platforms through GPU clouds, vector databases, and voice, image, and embedding specialists — use the table’s filters to narrow by pricing model, billing unit, or free-tier availability.

Patterns observed

Output-premium token pricing is nearly universal. Every LLM-API vendor in this set prices input and output tokens separately, with output 4–10× higher. Cohere Command A is 4× ($2.50/$10), Claude Sonnet 4.6 and Opus 4.8 are 5×, and OpenAI’s GPT-5.5 is 6× ($5/$30). The asymmetry is the single most important thing to model, because it means your bill is dominated by how much the model writes, not how much you send.

Hosted platforms cluster tightly around the open-weight floor, and it keeps dropping. Together AI, Fireworks AI, DeepInfra, Groq, SambaNova, and Cerebras all serve overlapping open models (Llama, Qwen, GPT-OSS, DeepSeek, Kimi) at nearly indistinguishable rates — GPT-OSS-120B runs $0.15/$0.60 at Groq and Together, $0.22/$0.59 at SambaNova, and $0.35/$0.75 at Cerebras. When the model is a commodity, competition moves to throughput (Groq’s LPU silicon, Cerebras’s wafer-scale) and to structural discounts rather than the headline rate.

GPU clouds meter time, at ever-finer granularity. RunPod (H100 $2.89/hr), Lambda (H100 SXM $3.99/hr), Replicate (H100 $0.001525/sec), and Baseten (H100 $0.10833/min, ~$6.50/hr) all bill raw GPU time rather than tokens — and increasingly quote per-second or per-minute, which makes scale-to-zero economics visible. Baseten explicitly markets per-minute pricing so a 4-minute warm burst costs a legible $0.43, granularity that Bedrock and Vertex AI hide behind hourly abstractions.

Free tiers are the onboarding default, and they’re mostly credit grants. Free access is more common in inference than in any other corpus category. Most vendors — DeepSeek, Fireworks AI, Groq, Cerebras — offer either a free web tier or a starter credit grant. Voyage AI gives the first 200M embedding tokens free; SambaNova hands you $5 of credits that expire in 30 days — a deliberate nudge from trial to pay-as-you-go. The pattern reflects a self-serve, developer-led sales motion where the goal is to get an API key generating tokens as fast as possible.

Pass-through and marketplace models are emerging. OpenRouter routes to 400+ models at each provider’s exact per-token rate with no markup, monetizing instead through a 5.5% credit-purchase fee — and by May 2026 processed ~100 trillion tokens/month. Hugging Face Inference Providers does the same, charging the underlying provider’s rate with no HF markup. This layer competes on breadth and routing, not price, since price is by definition identical to going direct.

Counterexamples & variants

Per-request and per-pixel billing where tokens don’t fit. Not everything reduces to a token. Fal bills generative media per output — Seedream V4 at $0.03/image, Veo 3 video at $0.40/second — and per-megapixel for models like Qwen ($0.02/MP). Cohere’s Rerank API bills per query ($2/1,000 queries), not per token, because reranking cost tracks the number of documents scored. Voyage AI’s multimodal models bill on two meters simultaneously — $0.12/1M text tokens and $0.60/1B pixels — with each image clamped between $0.00003 and $0.0012. Any cost model that assumes “inference = tokens” will misprice these entirely.

GPU-time vendors invert the free-tier norm. The near-universal free-tier pattern breaks on dedicated compute. Lambda has no free GPU tier at all — you pay per-minute for any instance you launch, and persistent storage keeps billing (~$0.20/GiB/month) even on a stopped instance, so orphaned volumes are a real hidden cost. Hugging Face Inference Endpoints require an active subscription and a card on file with no free dedicated GPU. When the marginal cost of serving is a reserved GPU rather than a shared token pool, “free” stops being economical.

Hosted middlemen that price near first-party, and those that don’t. The usual expectation is that a hosted-inference reseller adds a 30–50% markup. Baseten is a notable counterexample — its Model APIs price DeepSeek V3.1 at $0.50/$1.50, within 5–10% of DeepSeek’s own first-party rates. Meanwhile OpenRouter and Hugging Face formalize zero-markup pass-through as the product itself. The variant to watch is the opposite: dedicated-deployment tiers where the token rate disappears entirely and pricing becomes GPU-utilization negotiation, as on Baseten’s per-minute dedicated deployments or Lambda’s committed 1-Click Clusters.

GPU list prices can rise, against the deflation narrative. Token prices fall each generation, but raw GPU rates don’t always follow. Lambda raised its on-demand H100 SXM rate from $2.99 to $3.99/GPU/hr between 2025 and 2026 as hyperscaler and superintelligence demand outstripped capacity — a reminder that the “everything gets cheaper” assumption applies to model efficiency, not to the underlying silicon supply.

What this means for buyers vs vendors

For buyers

Never compare list prices at the model level — compare cost per real workload. The output-token premium means a chatty, long-completion workload and a terse classification workload with the same “price” can differ 10× in practice, so plug your actual input/output token shape into a DeepSeek pricing calculator or OpenAI pricing calculator before committing. Then ask three procurement questions of any inference vendor: (1) does cached-input pricing apply to my prompt structure, and at what discount — a stable system prompt or large RAG context can cut input cost 50–80% at Fireworks AI or Baseten; (2) can any portion of my workload run async for the ~50% batch discount; and (3) if I outgrow the shared API, what does a dedicated deployment cost in GPU-hours, and is that cheaper than tokens at my volume? Re-baseline the whole model at least twice a year — DeepSeek and open-weight platforms routinely reset the floor, and a cheaper new model may already meet your quality bar. Grounding these tradeoffs in the fundamentals of usage-based pricing models and in how to choose the right usage metric makes the comparison far cleaner.

For vendors

Transparency is the price of entry here — engineers will not adopt an API whose rate card they can’t read, so publish per-token or per-GPU-hour rates and let the self-serve motion do the qualifying. Differentiate on structure, not headline: because open-weight models are a commodity served at near-identical rates by Together AI, Groq, and SambaNova alike, your leverage is throughput, cached-input and batch discounts, and billing granularity (Baseten’s per-minute meter is a genuine differentiator). Free credits or a free web tier are effectively table stakes for onboarding — but structure them as an expiring grant, as SambaNova does with its 30-day $5 credit, to convert trial into pay-as-you-go rather than subsidize idle experimentation. If you resell hosted models, decide deliberately whether you compete on markup or on pass-through: OpenRouter’s zero-markup model with a 5.5% credit fee is a distinct strategy from a marked-up rebrand, and buyers can tell the difference. Whichever you choose, prepaid credits are the dominant funding mechanism in this category — the prepaid credits model guide covers how to structure grants, top-ups, and expiry.

Company	Product	Pricing model	Billing units	Free tier	Verified
01.AI	Yi open-weight models + Yi API + enterprise vertical solutions	pure-usage freemium	tokens api-calls	Yes	2026-06-11
Abacus.AI	AI super-assistant (ChatLLM) plus an enterprise agentic AI platform	seat-based subscription	seats credits	No	2026-06-02
AI21 Labs	Jamba foundation models, Maestro orchestration & enterprise AI	pure-usage freemium	tokens api-calls	Yes	2026-06-11
Aleph Alpha	PhariaAI sovereign-AI platform, specialized models & professional services	commitment subscription	seats tokens credits	No	2026-06-11
Anthropic	Claude API (token-based) + Claude.ai consumer subscriptions (Free/Pro/Team/Enterprise)	freemium subscription seat-based	tokens seats api-calls	Yes	2026-07-06
Anyscale	Managed Ray platform for distributed AI training, inference, and batch processing (RayTurbo, Anyscale Compute Units)	pure-usage commitment hybrid	gpu-hours cpu-hours credits	Yes	2026-05-29
Arize AI	AI & LLM observability (Arize AX + Phoenix OSS)	freemium hybrid	trace-spans gb-ingested	Yes	2026-06-09
AssemblyAI	Speech-to-Text & Audio AI APIs	pure-usage	api-calls tokens	Yes	2026-07-06
Autodesk (Flow Studio, formerly Wonder Dynamics)	AI VFX automation platform (Flow Studio)	subscription freemium hybrid	credits media-minutes seats	Yes	2026-06-16
Baichuan AI	Baichuan & medical M-series LLM APIs	pure-usage freemium	tokens api-calls	Yes	2026-06-11
Baseten	ML inference infrastructure — dedicated GPU deployments, Model APIs, and Truss framework	pure-usage hybrid commitment	gpu-hours tokens requests	Yes	2026-05-29
BentoML	BentoCloud — managed model-serving & inference platform	pure-usage freemium commitment	gpu-hours cpu-hours	Yes	2026-06-15
Bland AI	AI phone call automation platform — inbound and outbound voice agents at scale	hybrid pure-usage subscription	api-calls credits media-minutes	Yes	2026-05-29
Braintrust	LLM evaluation & observability platform	hybrid	tokens storage-gb scores	Yes	2026-07-14
Cartesia	Real-time voice AI platform (Sonic TTS, voice cloning, voice agents)	freemium subscription hybrid	credits requests api-calls	Yes	2026-05-29
Cerebras	Wafer-scale AI inference cloud and WSE hardware systems	pure-usage subscription commitment	tokens api-calls gpu-hours	Yes	2026-05-30
Character.ai	Consumer AI companion and roleplay chat platform	subscription freemium	active-users	Yes	2026-05-29
Chroma	Open-source vector database + Chroma Cloud	pure-usage freemium	storage-gb bandwidth-gb api-calls	Yes	2026-06-09
Clipdrop	AI image-editing and generation tools (background removal, upscaling, text-to-image), now part of Jasper	freemium subscription	requests credits api-calls	Yes	2026-06-05
Cohere	Command, Embed, Rerank APIs	pure-usage	tokens api-calls requests	Yes	2026-05-29
CoreWeave	GPU cloud & AI compute infrastructure	pure-usage commitment	gpu-hours cpu-hours storage-gb	No	2026-06-15
Daily	Real-time voice and video WebRTC APIs (Video SDK + Pipecat Cloud)	pure-usage	media-minutes api-calls	Yes	2026-07-14
Databricks (Mosaic AI)	Mosaic AI — enterprise GenAI & ML on the Data Intelligence Platform	pure-usage commitment	units tokens gpu-hours	Yes	2026-06-15
Deepgram	Usage-based speech-to-text, text-to-speech, and voice agent APIs	pure-usage freemium	media-minutes tokens credits	Yes	2026-05-31
DeepInfra	Serverless inference cloud — per-token LLM/embedding APIs, per-image and per-minute media models, per-hour on-demand GPU containers, and reserved DeepCluster GPU clusters	pure-usage commitment	tokens gpu-hours requests	No	2026-07-14
DeepL	AI translation, writing, and translation API	subscription pure-usage hybrid	characters seats documents	Yes	2026-06-16
DeepSeek	DeepSeek API (V4-Flash + V4-Pro models, 1M context) with token-based pricing and aggressive cache discounts	freemium pure-usage	tokens api-calls	Yes	2026-06-05
Descript	AI-powered audio and video editing	hybrid freemium	seats credits media-minutes	Yes	2026-05-31
ElevenLabs	Voice AI platform across ElevenCreative, ElevenAgents, and ElevenAPI	subscription pure-usage hybrid	characters credits media-minutes	Yes	2026-06-30
Essential AI	Enterprise foundation models & data-workflow automation	commitment	units	No	2026-06-11
Fal	Generative-media inference platform — serverless per-output model APIs plus dedicated GPU compute	pure-usage	gpu-hours requests media-minutes	No	2026-06-01
Fireworks AI	Generative AI inference platform — serverless per-token, on-demand GPU, fine-tuning, batch API	pure-usage hybrid commitment	tokens gpu-hours requests	Yes	2026-05-30
Freepik	AI creative suite — image, video, audio generation plus a 200M+ stock library	subscription hybrid pure-usage	seats credits api-calls	Yes	2026-06-05
Gladia	Speech-to-text & audio intelligence API	pure-usage freemium commitment	media-minutes requests	Yes	2026-06-09
Google	Gemini API & AI Studio	pure-usage freemium	tokens requests api-calls	Yes	2026-07-14
Grok	xAI's consumer and business AI assistant	subscription hybrid seat-based	seats tokens messages	Yes	2026-06-16
Groq	GroqCloud — LPU-based ultra-low-latency inference API for Llama, GPT-OSS, Qwen, Whisper transcription, and Orpheus text-to-speech	pure-usage hybrid commitment	tokens requests api-calls	Yes	2026-07-14
Hedra	AI video, avatar, image, and audio generation platform (Hedra Studio + API)	subscription freemium	credits media-minutes characters	Yes	2026-06-04
Helicone	Open-source LLM observability & AI gateway	hybrid freemium	requests logs storage-gb	Yes	2026-06-09
HeyGen	AI avatar and video generation platform	subscription freemium	credits seats api-calls	Yes	2026-05-30
Higgsfield	AI video and image generation platform with a credit-metered subscription	subscription freemium	credits seats	Yes	2026-06-06
Hugging Face	AI model hub, inference endpoints & compute	hybrid seat-based pure-usage	seats gpu-hours cpu-hours	Yes	2026-06-15
Humanloop	LLM evals, prompt management & observability	hybrid freemium	logs datapoints seats	Yes	2026-06-09
Hume AI	Empathic Voice Interface (EVI) + Octave TTS + expression-measurement APIs	hybrid freemium	media-minutes characters api-calls	Yes	2026-06-30
Hyperbolic	GPU cloud marketplace & serverless AI inference	pure-usage commitment	gpu-hours tokens images	Yes	2026-06-15
Ideogram	Text-aware AI image generation platform	freemium subscription hybrid	credits api-calls	Yes	2026-06-15
Inflection AI	Enterprise foundation models (Inflection 3.0) + Pi assistant	pure-usage subscription	tokens gpu-hours seats	No	2026-06-11
Janitor AI	Consumer AI character chat / roleplay platform	freemium subscription	tokens messages	Yes	2026-06-16
Jina AI	Search Foundation API (Embeddings, Reranker, Reader, DeepSearch, Classifier)	pure-usage freemium	tokens requests api-calls	Yes	2026-06-03
Lambda	GPU cloud & AI compute infrastructure	pure-usage commitment	gpu-hours	No	2026-06-09
LanceDB	AI-native multimodal lakehouse	freemium pure-usage commitment	storage-gb vectors-indexed gpu-hours	Yes	2026-06-09
Lightning AI	Cloud GPU/CPU Studio compute platform for building, training, and serving AI models, billed by the second with a credit pool.	hybrid freemium pure-usage	gpu-hours cpu-hours credits	Yes	2026-06-02
LiveKit	Open-source real-time (WebRTC) communications, LiveKit Cloud & Agents framework	hybrid freemium pure-usage	media-minutes credits bandwidth-gb	Yes	2026-06-30
LMNT	Low-latency AI text-to-speech (TTS) API with voice cloning	freemium subscription hybrid	characters credits	Yes	2026-06-04
Midjourney	AI image and video generation via subscription with GPU-hour metering	subscription	gpu-hours credits	No	2026-05-29
Milvus	Vector database (OSS) + Zilliz Cloud (managed)	pure-usage freemium commitment	gpu-hours storage-gb vectors-indexed	Yes	2026-06-09
MiniMax	Foundation models, Hailuo video & per-token API	pure-usage freemium	tokens seats credits	Yes	2026-06-11
Mistral AI	Open and commercial LLM APIs	pure-usage freemium	tokens seats api-calls	Yes	2026-07-06
Modal	Serverless compute and GPU platform — per-second billing for Python functions, batch jobs, and model serving	pure-usage freemium subscription	gpu-hours cpu-hours gb-hours	Yes	2026-07-14
Moonshot AI	Kimi assistant + Kimi/Moonshot open-weight LLM API	pure-usage freemium	tokens seats api-calls	Yes	2026-06-11
Murf AI	AI voice / text-to-speech platform (Murf Studio app + Murf API)	subscription pure-usage freemium	media-minutes seats credits	Yes	2026-06-01
Nebius	AI cloud & GPU compute infrastructure	pure-usage commitment	gpu-hours cpu-hours storage-gb	No	2026-06-15
Netlify	Web development & deployment platform (Agent Runners / AI)	freemium hybrid pure-usage	credits builds gb-hours	Yes	2026-07-14
Nomic	Nomic Platform (AEC agentic workflows) + Atlas data-exploration app + Nomic Embed embedding/Developer API	hybrid seat-based commitment	seats tokens credits	Yes	2026-06-04
Novita AI	Pay-as-you-go AI cloud: 200+ model inference APIs, on-demand GPUs, and per-second agent sandboxes under one API	pure-usage freemium	tokens gpu-hours cpu-hours	Yes	2026-07-06
OctoAI	Generative AI inference platform (acquired by NVIDIA, sunset Oct 2024)	pure-usage	tokens images generations	No	2026-06-15
OpenAI	ChatGPT consumer subscriptions + GPT-5.x API with token-based usage billing	freemium subscription seat-based	tokens seats api-calls	Yes	2026-06-30
OpenPipe	OpenPipe fine-tuning and hosted inference platform (small specialized models / RL for agents)	pure-usage	tokens cpu-hours	Yes	2026-06-04
OpenRouter	Multi-model LLM API routing marketplace	pure-usage freemium	tokens credits requests	Yes	2026-07-14
Paige AI	FDA-cleared AI for cancer pathology — clinical diagnostics + pharma/life-sciences foundation models	subscription hybrid	slides cases	No	2026-06-10
Perplexity AI	AI-native answer engine with citations and multi-model search	freemium subscription seat-based	seats tokens requests	Yes	2026-05-29
Physical Intelligence	Robotics foundation models (Vision-Language-Action policies for robots)	commitment	units	No	2026-06-14
Pinecone	Managed vector database (serverless)	pure-usage hybrid	requests storage-gb vectors-indexed	Yes	2026-06-09
Playground	AI image generation and graphic-design studio with a monthly credit pool	freemium subscription hybrid	credits api-calls	Yes	2026-06-04
PlayHT	Text-to-speech & voice cloning API (PlayAI)	subscription freemium pure-usage	characters words api-calls	Yes	2026-06-09
Poe	Multi-model AI chat subscription (by Quora)	subscription hybrid pure-usage	credits seats messages	Yes	2026-06-16
Portkey	AI gateway & LLMOps governance platform	hybrid freemium	requests logs	Yes	2026-06-10
Predibase	Fine-tuning & serving platform for open-source LLMs	pure-usage freemium	tokens gpu-hours	Yes	2026-06-15
Qdrant	Open-source vector database + Qdrant Cloud	pure-usage freemium	cpu-hours gb-hours storage-gb	Yes	2026-06-09
Recraft	AI image and vector generation studio plus a per-image generation API	freemium subscription hybrid	credits api-calls seats	Yes	2026-07-14
Recursion	AI-enabled drug discovery platform (Recursion OS) — pharma partnerships, internal pipeline & NVIDIA-powered compute	outcome-based commitment	milestones outcomes	No	2026-06-10
Reka AI	Natively multimodal models (Spark, Edge, Flash, Core) + Research & Vision APIs	pure-usage freemium	tokens api-calls requests	Yes	2026-06-11
Replicate	Cloud platform for running, fine-tuning, and deploying AI models via REST API	pure-usage hybrid commitment	gpu-hours tokens requests	Yes	2026-05-30
Resemble AI	AI deepfake detection & watermarking + voice generation APIs	pure-usage	credits media-minutes seats	No	2026-07-14
Retell AI	Conversational voice-agent API platform	pure-usage hybrid	media-minutes messages seats	No	2026-07-14
Rev AI	Pay-as-you-go speech-to-text, transcription, and audio-intelligence APIs	pure-usage freemium	media-minutes credits api-calls	Yes	2026-06-04
Rewind.ai (the original Rewind AI rebranded to Limitless, acquired by Meta)	AI tools aggregator (token-balance) — on the domain once home to the Rewind personal-memory app	freemium pure-usage subscription	tokens credits seats	Yes	2026-06-15
Roboflow	Computer-vision platform (dataset management, model training, deployment)	hybrid freemium	credits seats gpu-hours	Yes	2026-07-14
RunPod	GPU cloud marketplace — Secure Cloud and Community Cloud Pods, Serverless endpoints, and persistent storage	pure-usage hybrid commitment	gpu-hours storage-gb	No	2026-07-14
Runway	Video generation and AI editing	subscription freemium	credits seats	Yes	2026-06-24
SambaNova	SambaNova Cloud inference API & RDU AI systems	pure-usage subscription commitment	tokens	Yes	2026-06-15
Sarvam AI	Sovereign Indic LLM, speech & translation APIs	pure-usage freemium	tokens characters media-minutes	Yes	2026-06-11
Snowflake Cortex	AI functions and model APIs on Snowflake	pure-usage commitment	credits tokens pages-rendered	Yes	2026-07-06
Speechmatics	Speech-to-text and text-to-speech APIs with per-hour usage pricing	pure-usage freemium	media-minutes characters	Yes	2026-07-06
Stability AI	Brand Studio creative platform and open generative media models	subscription hybrid	credits images	Yes	2026-06-11
Suno	AI music generation	subscription freemium	credits	Yes	2026-05-31
Synthesia	Enterprise AI video generation	subscription freemium	credits media-minutes seats	Yes	2026-05-31
Synthflow AI	No-code AI voice-agent builder	hybrid	media-minutes seats	No	2026-06-24
Tempus	Precision-medicine platform — genomic diagnostics, multimodal clinical data licensing & oncology AI apps (NASDAQ: TEM)	hybrid commitment	tests data-licensing	No	2026-06-10
Together AI	AI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning	pure-usage hybrid commitment	tokens gpu-hours cpu-hours	Yes	2026-07-14
turbopuffer	Serverless vector and full-text search database on object storage	pure-usage commitment	storage-gb vectors-indexed gb-hours	No	2026-07-14
Twelve Labs	Video understanding foundation models (Marengo for search/embeddings, Pegasus for analysis) delivered as a usage-metered API	pure-usage freemium commitment	media-minutes tokens requests	Yes	2026-06-02
Unstructured	Document ingestion / ETL API	pure-usage freemium	pages-rendered documents	Yes	2026-07-14
Upstash	Upstash (Redis, Vector, QStash, Search, Workflow)	pure-usage freemium hybrid	requests api-calls vectors-indexed	Yes	2026-07-14
Vapi	Voice AI infrastructure for developers	pure-usage hybrid	media-minutes messages seats	No	2026-06-09
Vast.ai	GPU rental marketplace — on-demand, interruptible (spot), and reserved cloud GPUs plus autoscaling serverless inference	pure-usage commitment	gpu-hours storage-gb bandwidth-gb	No	2026-07-14
Vectara	Enterprise RAG-as-a-Service and agent platform for trusted, grounded, auditable AI	commitment subscription	credits requests storage-gb	No	2026-06-02
Vellum	Personal AI assistant (ex LLM application development platform)	hybrid freemium	credits storage-gb	Yes	2026-06-10
Voyage AI	Embedding and reranker models (text, code, multimodal) for retrieval and RAG	pure-usage freemium	tokens storage-gb	Yes	2026-06-04
Weaviate	AI-native vector database (open-source core + Weaviate Cloud managed serverless, dedicated/Enterprise Cloud, BYOC)	pure-usage hybrid commitment	vectors-indexed tokens api-calls	Yes	2026-07-06
Weights & Biases	MLOps experiment tracking, W&B Weave LLM observability/evals, Models registry, and Serverless Inference	freemium hybrid seat-plus-usage	seats storage-gb traces	Yes	2026-07-14
Writer	Enterprise agentic AI platform (Palmyra models, WRITER Agent)	seat-based seat-plus-usage subscription	seats credits tokens	No	2026-06-15
xAI	Grok API and agentic AI stack	pure-usage freemium	tokens api-calls seats	Yes	2026-07-14
You.com	Web search, contents, research, and finance-research APIs for AI systems	pure-usage freemium	api-calls requests pages-rendered	Yes	2026-06-01
Zhipu AI	GLM foundation models, per-token API, and GLM Coding Plan	pure-usage freemium subscription	tokens api-calls seats	Yes	2026-06-11

Explore this theme in the knowledge graph

FAQ

What is model inference pricing?

Model inference pricing is the billing structure a vendor charges when a customer calls a trained AI model to generate output — text tokens, images, embeddings, transcripts, or audio. You pay for the compute consumed during the forward pass, typically quoted per million tokens, per request or image, or per GPU-hour.

How much do inference tokens cost per million?

The spread is enormous and still falling. Open-weight models on discounted platforms start near DeepInfra's Llama-3.1-8B at $0.02/1M input, while frontier models like Claude Opus 4.8 run $5/$25 and OpenAI GPT-5.5 runs $5/$30 per 1M input/output. Output tokens are typically priced 4–10× higher than input tokens on the same model.

What is cached-input pricing and how much does it save?

Cached-input pricing is a discounted rate for prompts that reuse a previously processed input prefix, such as a fixed system prompt or a large RAG context. Discounts run 50–80% off the standard input rate. Fireworks bills cached input at 50% of list, Baseten quotes 50–80% off, and DeepSeek's V4-Flash cache-hit input drops to $0.0028/1M — one-tenth of its miss rate.

When should I use batch pricing instead of real-time inference?

Batch (asynchronous, latency-tolerant) processing earns roughly 50% off the synchronous rate at Fireworks, Together, and other platforms, and 33% off at Voyage AI's embedding API. If any part of your workload — evaluation pipelines, document processing, data enrichment — can tolerate a minutes-to-hours window, batch API can halve the token bill with no model change.

Are inference prices billed per token, per request, or per GPU-hour?

All three, depending on modality. LLM text APIs bill per input and output token; image, embedding, and transcription APIs often bill per request or per output (Fal's Seedream at $0.03/image, Cohere Rerank at $2/1,000 queries); and GPU clouds like RunPod, Lambda, and Baseten bill per GPU-hour or GPU-minute for dedicated deployments.

Related use cases

Back to companies