Sharpens 16 companies · First observed November 2023 · Updated July 2026 Explore in the graph

Token prices fall as access tiers rise

Quick answer

AI API token prices fall with almost every model generation — frontier rates dropped roughly 10× from the GPT-4 era to the GPT-5 era — while the price of human-facing access climbs toward a $200/mo ceiling. Compute deflates; access inflates.

~10× cheaper per token, GPT-4 → GPT-5 era

What's happening — and why

What's happening: the cost of calling an AI model through its API keeps dropping. Each new model generation tends to launch cheaper than the one it replaces, and vendors add further discounts — prompt caching, batch processing — in between releases.

Why: raw inference is becoming a commodity. Open-weight models set aggressive price floors, serving efficiency improves, and competition is intense, so vendors pass the savings to developers to win volume. They keep their pricing power for human-facing subscriptions instead, where willingness-to-pay is far higher than for a raw token.

How it works

Per-token API price falls each model generation (blue); human-facing access climbs (amber).

Evidence over time

24 supporting · 0 counter — hover or tap a point for detail, click to jump to the row.

supporting evidence counterexample

Evidence

Company	Date	What happened
Anthropic	May 2026	Opus rebased from the legacy $15/$75 band to $5/$25 per 1M and Haiku to $1/$5 — a ~3× cut on the flagship tier within the same product line.
Novita AI	Apr 2026	Kimi K2.6 launched at $0.95/$4.00 then trimmed to $0.8/$3.4 by the 2026-06-02 capture — a downward revision within ~6 weeks, mid-generation.
OpenAI	Nov 2023	GPT-4 Turbo launched ~10× cheaper than the original GPT-4.
OpenAI	May 2024	GPT-4o launched ~50% cheaper than GPT-4 Turbo; 4o-mini (2024-07) pushed frontier-ish quality to commodity price.
Google	Nov 2024	Gemini 1.5 token pricing cut ~50–65%.
DeepSeek	May 2024	V2 at $0.14/1M input — a market-shock price that reset expectations.
DeepSeek	Dec 2024	V3 at $0.27/1M — frontier-class performance at commodity rates.
Anthropic	Aug 2024	Prompt caching cut input cost by up to 80% without a model change.
Hume AI	Jun 2026	EVI 1 launched at $0.102/min (March 2024); EVI 2 cut to $0.072/min (Sept 2024); EVI 3 now $0.07/min; EVI 4 MINI $0.04/min — ~60% per-minute deflation across two years of model generations. A clear per-minute token-deflation example beyond text tokens.
Together AI	Jun 2026	Cut managed GPU Cluster rates twice in one week — on-demand HGX H100 $4.79→$3.99/hr and reserved 91–180d H100 to a $3.09/hr floor — after a 2026-06-24 round that also cut serverless model rates and began publishing cached-input discounts. Weekly-cadence deflation at the managed-cluster layer.
Vercel	Jun 2026	Cut its v0 Max Fast model ~3× ($30/$150 → $10/$50 per 1M input/output; cache read $1/1M) and retired the legacy Ultra plan — generational token deflation reaching the application layer's own model lineup.
Hume AI	Jun 2026	Added an in-table model toggle: EVI 4 MINI runs ~half the EVI 3 per-minute overage ($0.035–$0.02 vs $0.07–$0.04) and Octave 2 ~halves Octave 1's per-character overage while doubling minutes per character — a quiet per-unit price cut at every paid tier without moving headline prices.
Anthropic	Jul 2026	Claude Sonnet 5 launched at introductory $2/$10 per 1M through Aug 31, 2026 (batch $1/$5), BELOW the $3/$15 band Sonnet held across four generations since Claude 3 — the standard rate reverts to $3/$15 on Sep 1. The first downward move on the production-standard Sonnet anchor: generational deflation reaching the tier that had been the corpus's most durable price constant.
Speechmatics	Jul 2026	Cut real-time enhanced STT -23% ($0.56→$0.43/hr) — the SKU powering live voice agents — for the third self-serve rate cut since 2023, and added a multilingual Batch Melia 1 floor at $0.129/hr (the new 'from' headline). Per-minute audio-model deflation beyond text tokens.
Lambda Labs	Jun 2026	Raised on-demand GPU rates significantly: H100 SXM from $2.99 to $3.99/hr (+33%), A100 SXM 80GB from $1.79 to $2.79/hr (+56%). GPU on-demand cloud access inflating as demand surges — the opposite of token-API deflation.
You.com	Sep 2025	Top consumer tier repriced ~6.7× — $30 Team replaced by $200 Max.
Character.ai	Feb 2026	Introduced mid-chat ads for free users — monetisation rising, not falling.
Anthropic	Jun 2026	Opened a new top API band — Claude Fable 5 / Mythos 5 at $10/$50 per 1M, double Opus 4.8's $5/$25 — while leaving the existing Opus/Sonnet/Haiku bands unchanged. The frontier ceiling re-inflated 2× even as the rest of the ladder held; deflation is per-tier, and the most-capable tier resets upward.
Mistral AI	Jul 2026	The corpus's first frontier-lab token increase: Mistral Small 4 rose 50% on input / 100% on output ($0.1/$0.3 → $0.15/$0.6 per 1M) and OCR doubled to $4/1K pages — a cost-sensitive model repriced UP mid-life. Flagship Medium 3.5 ($1.5/$7.5) and Large 3 ($0.5/$1.5) held. Deflation is not monotonic per SKU: a vendor can raise a specific under-priced model even as the generational trend falls.
xAI	Jul 2026	Shipped grok-4.5 at $2 in / $6 out per 1M — ABOVE the still-live grok-4.3 ($1.25/$2.50), the first time xAI priced a new flagship higher than its predecessor rather than continuing its multi-year per-token markdown. It keeps grok-4.3 available and reserves the premium for higher intelligence — the frontier ceiling re-inflating while the prior tier holds.
DeepInfra	Jul 2026	The deflation poster child reversed: raised dedicated GPU-hour rates 16-32% (H100 +23%, H200 +23%, B200 +32%, B300 +16%) and lifted DeepSeek-V3.1 tokens from $0.21/$0.79 to $0.25/$0.95 per 1M — the first across-the-board increase from a vendor whose reputation in cost-sensitive communities was built on relentless public price cuts. Even a commodity-inference vendor can move its whole card up when Blackwell supply and demand tighten.
Google	Jul 2026	Added the AlphaEvolve agent to Vertex AI priced as the base Gemini model rate PLUS a 2x agent surcharge (3x all-in) — Gemini 3.1 Pro Preview $6/$36 per 1M vs the $2/$12 base. Headline Gemini token rates were unchanged; the increase is an explicit agent-premium multiplier layered on top, the frontier re-inflating via an agent surcharge rather than a token hike.
Together AI	Jul 2026	Deflation continues at the infra layer: cut on-demand Dedicated Inference — HGX H100 $6.49→$5.49/hr (-15%) and HGX B200 $11.95→$8.99/hr (-25%) — while launching a Provisioned Throughput (PTU) SKU at $0.05/PTU-min. Managed-GPU access keeps falling even as the model-intelligence ceiling rises elsewhere.
RunPod	Jul 2026	Trimmed two Secure Cloud Pod rates — H100 SXM (80GB) $3.29→$2.99/hr (-9%) and RTX Pro 6000 (96GB) — pressing its cost-leader positioning as Blackwell-generation supply expands. Bare GPU-hour deflation persists at the discount-cloud layer.

Trivia

DeepSeek V2's May 2024 launch at $0.14 per million input tokens reset the entire industry's price-floor expectations for coding-capable models — within 18 months every major frontier vendor had cut per-token prices at least once. No prior software category has experienced a 10-100× cost compression within two years of reaching mainstream adoption, making token deflation structurally unlike any historical software pricing precedent.
Anthropic's August 2024 prompt-caching launch produced the largest single-day effective price reduction in the corpus (up to 80% off cached input) without any model change — demonstrating that token deflation arrives through infrastructure optimisation (caching, batching) as well as through model-generation cuts, and that the two mechanisms compound. A buyer using both batch (50% off) and caching (up to 80% off cached portions) can reduce effective input cost by 90%+ on workloads with stable system prompts.
The scissors dynamic — token API prices falling while consumer subscription ceilings rise — is the defining pricing paradox of the 2024-2026 AI market. You.com's 6.7× consumer repricing (from $30 Team to $200 Max) and Character.ai's ad insertion for free users both happened while the underlying token rates kept deflating. The divergence shows that AI products have successfully decoupled human-facing access pricing from the compute costs beneath them.
Anthropic's June 2026 Fable 5 / Mythos 5 launch is the corpus's clearest proof that token deflation is per-tier, not per-vendor: the company opened a new $10/$50 flagship band — exactly double Opus 4.8's $5/$25 — without cutting or raising any existing tier. Three weeks earlier (2026-05-30) it had rebased Opus itself ~3× downward from the legacy $15/$75. So inside one product line and one month, the established tiers deflated while the 'best-available' ceiling re-inflated 2×, confirming that what falls is the price of a fixed capability level, not the price of the frontier.
Together AI cut its on-demand H100 rate twice inside a single week ending 2026-06-30 ($4.79 → $3.99/hr), and its $3.09/hr reserved floor now sits BELOW Lambda Labs' $3.99 on-demand rate logged just three weeks earlier — the first time in the corpus that a managed-cluster vendor's reserved price has undercut a peer's on-demand price, evidence that the 'GPU scarcity inflates access' counter-current is breaking down under managed-cluster competition.
The $3/$15 Claude Sonnet band was the single most durable price constant in the corpus — held flat across four model generations from Claude 3 through Sonnet 4.6 while capability climbed. On 2026-07-06 Sonnet 5 finally bent it, launching at an introductory $2/$10 through Aug 31 before reverting to $3/$15. The anchor didn't break permanently — it dipped — but even a temporary intro cut on the corpus's most stable price is the clearest sign that generational deflation now reaches the tiers vendors most want to keep stable.
The same 2026-07-06 batch that pushed Sonnet's anchor down also produced the corpus's first frontier-lab token *increase*: Mistral raised Small 4 by 50% on input and 100% on output. Two frontier labs moved their token prices in opposite directions on the same date — the cleanest proof that token deflation is a dominant trend, not a law: a specific under-priced SKU can still be repriced up mid-life.
DeepInfra broke its own brand on 2026-07-14: the vendor whose reputation in cost-sensitive communities was built on relentless public price CUTS raised dedicated GPU-hour rates 16-32% across the board (H100 +23%, B200 +32%) and lifted its popular DeepSeek-V3.1 tokens from $0.21/$0.79 to $0.25/$0.95 — the corpus's first case of a commodity-inference vendor raising its WHOLE card rather than one under-priced SKU, evidence that even the deflation poster child can be forced up when Blackwell-generation supply and demand tighten.
xAI priced a new flagship ABOVE its predecessor for the first time on 2026-07-14: grok-4.5 launched at $2/$6 per 1M versus the still-live grok-4.3 at $1.25/$2.50 — reversing xAI's multi-year per-token markdown and keeping the cheaper model live so the premium is reserved for the higher-intelligence tier. Combined with Google's AlphaEvolve 3×-all-in agent surcharge the same day, the frontier ceiling re-inflated via BOTH a higher flagship and an agent multiplier while Together (−15%/−25%) and RunPod kept cutting GPU beneath them.

See all pricing trivia

For buyers

Re-baseline token-cost assumptions at least twice a year — a model you priced six months ago is usually cheaper now, and a newer model may be cheaper and better. Don't extrapolate the deflation to seats: model the workload as raw metered tokens (falling) separately from per-seat access (rising), and pick the cheaper envelope deliberately.

For vendors

To ride deflation you need per-model rate cards that can change without breaking contracts, plus structural discounts (caching, batch) to cut effective price without touching headline rates. To monetise access, a premium seat tier with priority or uncapped usage captures the value that falling tokens leave on the table.

Outlook — what to watch

Expect another leg down: open-weight models (DeepSeek, Llama-class) keep resetting the floor, and caching/batch discounts are spreading. The deflation only stalls if frontier capability stops commoditising — watch whether any vendor can hold a premium token price on a genuinely differentiated model. Access tiers still have room to climb.

Bottom line

Compute is deflating relentlessly while access inflates. Every frontier vendor in the corpus has cut per-token prices at least once; the value is migrating from the token to the seat.

FAQ

Are AI API prices going up or down?

Down. Every frontier vendor in the corpus has cut per-token API prices at least once — usually at each model release — and layered caching and batch discounts on top. Consumer subscriptions are the exception; those are rising.

Why are ChatGPT and Claude subscriptions getting more expensive if tokens are cheaper?

The token (raw compute) and the seat (human access) sit on different curves. Vendors pass compute deflation to API developers but capture value from power users through premium ~$200/mo access tiers.

How often should I re-check my model costs?

At least twice a year. A model you priced six months ago is usually cheaper now, and a newer one is often cheaper and better — Claude 3.5 Sonnet beat Opus at a fraction of the price.

All trends