How much does Groq cost per month?

Groq has no monthly subscription fee — you pay only for the tokens you consume on serverless inference, plus per-hour audio transcription, per-use built-in tools, and per-session agentic infrastructure. A small chat application using Llama 3.1 8B at 30M input + 10M output tokens would cost roughly $2.30/month — Groq's per-token rates are among the lowest in the market for small-model inference.

What are Groq's per-token rates for popular models?

Per 1M tokens (input/output): Llama 3.1 8B Instant $0.05/$0.08 at 840 TPS; GPT OSS 20B $0.075/$0.30 at 1,000 TPS; Llama 4 Scout $0.11/$0.34 at 594 TPS; Qwen3 32B $0.29/$0.59 at 662 TPS; GPT OSS 120B $0.15/$0.60 at 500 TPS; Llama 3.3 70B Versatile $0.59/$0.79 at 394 TPS. Throughput numbers are published alongside prices to make the latency-cost trade-off transparent.

Does Groq have a free tier?

Yes — Groq's free tier offers rate-capped access to most models for experimentation. Throughput limits apply (low requests-per-minute and tokens-per-minute caps), with sustained production usage requiring a paid plan. Free tier credit-card-on-file is not required for evaluation.

How does Groq's LPU differ from GPU-based inference?

Groq's LPU (Language Processing Unit) is bespoke silicon designed exclusively for inference. Unlike GPUs which use HBM (high-bandwidth memory) and branch prediction, the LPU uses on-die SRAM and deterministic execution. The trade-off is lower per-chip memory capacity but vastly higher single-stream throughput — enabling 800+ tokens/second on Llama 3 8B and 500+ TPS on 120B-class models.

What does Groq charge for audio transcription and text-to-speech?

Transcription is billed per hour: Whisper Large v3 (higher accuracy) at $0.111 per hour and Whisper Large v3 Turbo (faster, slightly lower accuracy) at $0.04 per hour, both with a 10-second minimum per request and a 2.8× spread reflecting the accuracy-versus-speed trade-off. Text-to-speech arrived in July 2026 and is billed per 1 million characters instead: Canopy Labs Orpheus V1 English at $22.00 per 1M characters and Orpheus Arabic (Saudi) at $40.00 per 1M characters — pairing with Whisper to give Groq a full speech-in, speech-out audio round-trip on the LPU.

How are Groq's built-in tools priced?

Per-use pricing for agentic tools: web search at $5–$8 per 1,000 requests (depending on depth), website visits at $1 per 1,000 requests, code execution at $0.18/hour of compute, and browser automation at $0.08/hour (added July 2026). Cached input on serverless gets a 50% discount; Batch API gets a 50% discount for asynchronous workloads. The transparent per-use tool pricing is one of the most legible in the market.

Groq Pricing

AI Summary

Groq runs a pure-usage per-token serverless inference API on its proprietary LPU silicon, with published rates including Llama 3.1 8B Instant at $0.05 input / $0.08 output per 1M tokens (840 TPS), GPT OSS 20B at $0.075/$0.30, Llama 4 Scout at $0.11/$0.34, Qwen3 32B at $0.29/$0.59, GPT OSS 120B at $0.15/$0.60, and Llama 3.3 70B Versatile at $0.59/$0.79.
Audio transcription priced per hour: Whisper Large v3 at $0.111/hour (higher accuracy), Whisper Large v3 Turbo at $0.04/hour (faster), with 10-second minimum billing per request.
Cached input tokens get a 50% discount versus standard rates with no extra fee. Batch API offers 50% lower cost for asynchronous large-scale workloads.
Built-in agentic tools are priced per-use: web search at $5–$8 per 1,000 requests, website visits at $1 per 1,000 requests, code execution at $0.18/hour, and browser automation at $0.08/hour. Text-to-speech launched as a new SKU billed per 1M characters — Canopy Labs Orpheus V1 English at $22.00 and Orpheus Arabic (Saudi) at $40.00 per 1M characters — completing a full audio round-trip (Whisper ASR in, Orpheus TTS out) on the LPU. Built-in agentic tools are priced per-use: web search at $5–$8 per 1,000 requests, website visits at $1 per 1,000 requests, code execution at $0.18/hour, and browser automation at $0.08/hour. Text-to-speech launched as a new SKU billed per 1M characters — Canopy Labs Orpheus V1 English at $22.00 and Orpheus Arabic (Saudi) at $40.00 per 1M characters — completing a full audio round-trip (Whisper ASR in, Orpheus TTS out) on the LPU.
Marketing emphasizes 'linear and predictable' pricing with no hidden costs or infrastructure surprises. Free tier offers limited rate-capped access; Enterprise plans add custom SLAs and capacity reservations.
Founded 2016 by Jonathan Ross (original Google TPU engineer); raised $640M Series D in August 2024 led by BlackRock at $2.8B valuation; Saudi PIF-led round in 2025 brought valuation to $6.9B+. Built bespoke LPU silicon for inference.

Pricing summary

Groq 2026 — LPU per-token inference + per-hour audio + per-use tools

Free rate-capped tier; serverless from $0.05/$0.08 (Llama 3.1 8B); Whisper Turbo $0.04/hr; 50% Batch + cached

Free tier

$0 + rate caps

Developers, evaluation, low-volume prototypes

Developer

Per token (varies)

Latency-sensitive production AI apps

Annual commit

Enterprise

Custom

Sustained high-throughput workloads

Per-token serverless

From $0.05 /1M (Llama 8B)

Production token inference

Audio + TTS + Tools

From $0.04 /hr (Whisper Turbo)

Transcription, speech, agentic tools

LPU silicon enables 800+ TPS on Llama 3 8B and 500+ TPS on 120B-class models. Cached input (50%) and Batch API (50%) discounts apply on top of base rates. Whisper Turbo is 2.8× cheaper than Large v3; Orpheus text-to-speech bills per 1M characters.

About

Groq is a Mountain View-based AI inference company founded in April 2016 by Jonathan Ross — the original engineer behind Google’s Tensor Processing Unit (TPU). The product is GroqCloud, an inference API built on Groq’s proprietary LPU (Language Processing Unit) silicon. The LPU was designed exclusively for inference workloads: deterministic execution, on-die SRAM (no HBM), and architectural choices that prioritize single-stream throughput over GPU-style parallel batch efficiency. The result is published throughput of 800+ tokens/second on Llama 3 8B and 500+ TPS on 120B-class models, with per-token pricing that holds competitive with GPU-based serverless inference.

By 2026 Groq serves a mix of latency-sensitive production AI customers (voice assistants, real-time agentic systems, customer-facing chatbots) and high-throughput batch workloads. The company raised a $640M Series D in August 2024 led by BlackRock at $2.8B valuation, with Tiger Global, Samsung Catalyst, KDDI, and others participating; a 2025 round led by Saudi PIF brought valuation to $6.9B. The Saudi partnership was paired with a strategic commitment to build Groq’s largest data center in Saudi Arabia, signaling the company’s intent to compete on global inference capacity, not just per-token economics.

Groq competes with Fireworks AI, Together AI, Baseten, and Replicate for the managed-inference market, plus first-party providers (OpenAI, Anthropic) for general-purpose API customers. Its differentiation is bespoke LPU silicon (the only commercial inference platform built on inference-specific hardware), TPU-pioneer founder credibility (Jonathan Ross designed the original TPU), and the highest published throughput-per-token in the category — making Groq the canonical choice for latency-critical AI workloads where time-to-first-token matters more than raw cost.

Pricing summary : How Groq’s LPU + per-token + tools stack works

Groq runs a per-token serverless inference API priced by model, with throughput (TPS) published alongside each rate so customers can evaluate the latency-cost trade-off in a single view. The model catalog spans Llama (3.1 8B Instant at $0.05/$0.08 input/output per 1M tokens, 3.3 70B Versatile at $0.59/$0.79, 4 Scout at $0.11/$0.34), GPT OSS (20B at $0.075/$0.30, 120B at $0.15/$0.60, plus the new Safeguard 20B at $0.075/$0.30), Qwen3 (32B at $0.29/$0.59, 3.6 27B at $0.60/$3.00), Kimi K2 ($1.00/$3.00), Whisper for audio transcription, and — new in 2026 — Canopy Labs Orpheus text-to-speech billed per 1M characters. Audio transcription is priced per hour transcribed with a 10-second minimum per request; TTS is priced per 1M characters. Enterprise-only models (Minimax M2.5, Qwen3-VL 32B) are contact-us.

Cached input tokens get a 50% discount on standard rates (e.g. GPT OSS 20B cached input at $0.0375 vs $0.075 uncached), and the Batch API offers a 50% discount for asynchronous workloads. Built-in agentic tools have transparent per-use pricing: web search at $5–$8 per 1,000 requests, website visits at $1 per 1,000 requests, code execution at $0.18/hour, and browser automation at $0.08/hour. Free tier provides rate-capped access for evaluation; Enterprise plans add custom SLAs, dedicated LPU capacity reservations, volume discount commits, and VPC deployment.

This single-SKU plus per-tool structure — token / hour-audio / per-use-tool — is one of the most legible usage-based rate cards in AI inference middleware. The marketing positioning explicitly emphasizes “linear and predictable” pricing as a core value proposition.

What makes this different: Groq’s LPU silicon delivers throughput that GPU-based competitors cannot match at the same price point. Llama 3.1 8B at 840 TPS is roughly 4–10× faster than Llama 8B on Fireworks or Together at comparable prices — and the latency advantage compounds for multi-turn agentic workflows where time-to-first-token determines user experience.

Pricing by product

Serverless per-token inference (with published TPS)

Model	Input ($/1M)	Output ($/1M)	Throughput
Llama 3.1 8B Instant 128k	$0.05	$0.08	840 TPS
GPT OSS 20B 128k	$0.075	$0.30	1,000 TPS
GPT OSS Safeguard 20B	$0.075	$0.30	1,000 TPS
Llama 4 Scout (17Bx16E) 128k	$0.11	$0.34	594 TPS
GPT OSS 120B 128k	$0.15	$0.60	500 TPS
Qwen3 32B 131k	$0.29	$0.59	662 TPS
Llama 3.3 70B Versatile 128k	$0.59	$0.79	394 TPS
Qwen 3.6 27B 131k	$0.60	$3.00	500 TPS
Kimi K2 Instruct (moonshotai)	$1.00	$3.00	—

Enterprise-only models (contact sales for pricing): Minimax M2.5, Qwen3-VL 32B.

Text-to-speech (per 1M characters)

Model	Rate per 1M characters	Characters/s
Canopy Labs Orpheus V1 English	$22.00	100
Canopy Labs Orpheus Arabic (Saudi)	$40.00	100

Audio transcription (per hour, 10-second minimum)

Model	Rate per hour	Speed factor	Notes
Whisper Large v3	$0.111	217x	Higher accuracy, slower
Whisper Large v3 Turbo	$0.04	228x	Faster, slightly lower accuracy

Built-in agentic tools

Tool	Rate	Unit
Web search (basic)	$5	per 1,000 requests
Web search (advanced)	$8	per 1,000 requests
Website visits	$1	per 1,000 requests
Code execution	$0.18	per hour of compute
Browser automation	$0.08	per hour

Prompt caching (cached input discount)

Model	Uncached input ($/1M)	Cached input ($/1M)	Output ($/1M)
Kimi K2 Instruct 0905	$1.00	$0.50	$3.00
GPT OSS 120B	$0.15	$0.075	$0.60
GPT OSS 20B	$0.075	$0.0375	$0.30

Discount mechanics

Mechanism	Discount	Applies to
Cached input tokens	50% off input rate	Serverless models with prefix re-use (no extra fee; only on cache hit)
Batch API	50% off standard rate	Asynchronous workloads (24-hour to 7-day processing window)

Sales motions across products: PLG / self-serve for Free and Developer tier; sales-led for Enterprise annual commits, dedicated LPU capacity, VPC deployments, and enterprise-only models. All prices accessed 2026-07-14 from groq.com/pricing.

Hidden costs : What Groq customers actually pay beyond the per-token rate

Archetype A: Voice assistant startup on Llama 3.1 8B Instant

A latency-critical voice assistant startup serving ~200K requests/day at average 1K input + 250 output tokens, using Whisper Turbo for upstream audio transcription:

Line item	Monthly cost
Input tokens (6M/day × 30 = 180M, Llama 8B at $0.05/1M)	$9
Output tokens (1.5M/day × 30 = 45M, Llama 8B at $0.08/1M)	$3.60
Whisper Turbo transcription (avg 4 hours audio/day × 30 × $0.04)	$4.80
Web search tool (occasional, 5K requests/mo × $0.005)	$25
Estimated total	~$42/month

For latency-critical voice workloads, Groq’s per-token rates plus Whisper Turbo deliver one of the lowest published total costs in the category. The same workload on a comparable GPU-based platform would typically cost 2–4× more — and would not match the latency.

Archetype B: Mid-market team running mixed Llama 70B + GPT OSS 120B with agentic tools

A 50-person team building an internal AI assistant using Llama 3.3 70B Versatile for general queries and GPT OSS 120B for reasoning-heavy tasks, with frequent agentic tool usage:

Line item	Monthly cost
Llama 3.3 70B (30M input + 10M output)	$17.70 + $7.90 = $25.60
GPT OSS 120B (10M input + 5M output)	$1.50 + $3.00 = $4.50
Cached input savings (40% cache hit on system prompts)	-$7
Web search ($5/1K × 40K requests/mo)	$200
Code execution (50 hours × $0.18)	$9
Estimated total	~$232/month

Built-in tool usage dominates the bill at this scale — and the transparency of the tool rate card lets finance teams forecast tool spend accurately. Most platforms either bundle tools into opaque tier pricing or force customers onto third-party tool providers, making forecasting harder.

Want to estimate your own Groq bill? Use the Groq pricing calculator to model token spend by model, audio transcription hours, and built-in tool usage.

Pricing evolution : Groq’s pricing history from LPU prototype to commercial inference category

Cadence

Quarter	Price changes	Product / SKU additions	Notes
2016 Q2	0	1	Groq founded; LPU silicon design begins
2023 Q1	0	1	GroqCloud public launch with Llama 2 + Mistral
2024 Q1	0	1	Llama 3 day-one support; published TPS rates
2024 Q3	0	0	Series D ($640M) at $2.8B valuation
2025 Q1	0	1	Whisper Large v3 + Turbo audio SKUs launched
2025 Q2	1	1	Batch API + cached input 50% discounts launched
2025 Q3	0	0	Saudi PIF investment at $6.9B valuation
2026 Q1	0	1	Built-in tools per-use pricing published
2026 Q3	0	3	Orpheus text-to-speech SKU (per 1M chars), Browser Automation tool ($0.08/hr), new models (GPT OSS Safeguard 20B, Qwen 3.6 27B, Kimi K2 + enterprise-only Minimax M2.5, Qwen3-VL 32B)

Tracked range: 2016 Q2–2026 Q3. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).

Notable changes

2023-03-22 — GroqCloud public launch with Llama 2 + Mistral; LPU silicon as the structural differentiator.
2024-02-19 — Llama 3 day-one support at 800+ TPS for the 8B variant; established Groq as the throughput leader.
2025-03-04 — Whisper Large v3 ($0.111/hr) and Whisper Large v3 Turbo ($0.04/hr) for audio transcription with 10-second minimum billing.
2025-05-13 — Batch API and cached input 50% discounts launched; brought Groq to parity with Fireworks and Anthropic on discount mechanics.
2026-01-22 — Built-in tools per-use pricing published: web search $5–$8/1K, website visits $1/1K, code execution $0.18/hour.
2026-07-14 — Text-to-speech launched as a new billing dimension: Canopy Labs Orpheus V1 English at $22.00/1M characters, Orpheus Arabic (Saudi) at $40.00/1M characters — pairing with Whisper ASR to close a full audio round-trip on the LPU. A Browser Automation built-in tool joined the per-use rate card at $0.08/hour, and the model catalog grew (GPT OSS Safeguard 20B $0.075/$0.30, Qwen 3.6 27B $0.60/$3.00, Kimi K2 $1.00/$3.00 with $0.50 cached, plus enterprise-only Minimax M2.5 and Qwen3-VL 32B). Core per-token rates for Llama, GPT OSS, and Qwen3 held stable — a packaging-and-catalog expansion, not a repricing.

What’s unique : Groq’s distinctive pricing mechanics

1. Throughput (TPS) published alongside per-token rates. Groq is one of the only inference platforms that publishes throughput (tokens per second) next to every per-token rate. For latency-critical workloads (voice agents, real-time interactive UIs), this transparency lets customers evaluate the latency-cost trade-off in a single view rather than running benchmark calls. Most competitors hide TPS behind documentation pages or marketing claims.

2. Bespoke LPU silicon as the structural cost moat. The LPU is the only commercial inference platform built on inference-specific silicon (rather than repurposed GPU). This delivers throughput that GPU-based competitors cannot match at comparable prices — making Groq’s $0.05/$0.08 Llama 3.1 8B rate genuinely hard to replicate without similar silicon investment.

3. Per-use agentic tool pricing in line-item form. Web search ($5–$8/1K), website visits ($1/1K), and code execution ($0.18/hour) are priced individually rather than bundled into opaque tiers. For finance teams forecasting agent-loop costs, the transparent tool rate card reduces budget uncertainty in a way that bundled-tools competitors cannot match.

4. Full audio round-trip (Whisper ASR in, Orpheus TTS out) on one silicon platform. The 2.8× price spread ($0.111/hr vs $0.04/hr) between Whisper Large v3 and Whisper Large v3 Turbo lets customers pick accuracy-vs-speed explicitly on transcription. As of 2026-07-14, the Orpheus text-to-speech launch ($22.00/1M chars English, $40.00/1M chars Arabic) closes the loop — speech-to-text and text-to-speech now run on the same LPU, so a low-latency voice agent can do both legs without leaving Groq or bolting on a second vendor. Note that TTS introduces a fourth billing unit (per 1M characters) alongside per-token, per-hour, and per-request — each modality is metered in its own native unit rather than being force-fit into tokens.

5. Cached input + Batch API discount mechanics in standard rate card. Both discounts apply at 50% off base rates, bringing Groq to parity with Fireworks, OpenAI, and Anthropic. The discount mechanics are the same across all eligible models — making them legible to customers without requiring per-model discount lookup. This discount-as-default architecture is becoming canonical for token-priced inference.

Strengths & weaknesses

Strengths	Weaknesses
TPS published alongside per-token rates — best latency-cost transparency in category	Discounts (50% cached, 50% batch) don’t stack — pick one
Bespoke LPU silicon delivers 4–10× higher TPS than GPU competitors at similar rates	Free tier rate caps may be restrictive for early-stage evaluation
Per-use agentic tool pricing in line-item form (web search, visits, code exec)	Audio transcription only via Whisper (no other ASR model options)
Two-variant audio (Large v3 vs Turbo) lets customers pick accuracy-vs-speed	Model catalog smaller than Fireworks/Together (no dedicated GPU or per-image SKUs)
Full audio round-trip on one platform since 2026-07-14 (Whisper ASR in, Orpheus TTS out) — no second speech vendor for voice agents	TTS is Orpheus-only and ships just two voices/languages (English + Arabic-Saudi) — narrower than dedicated TTS providers
TPU-pioneer founder credibility (Jonathan Ross)	LPU silicon production scaling is a structural risk — supply must keep pace with demand
”Linear and predictable” pricing positioning is genuinely accurate	No fine-tuning SKU published — customers must use other platforms for custom-trained models

Billing UX : Groq’s account controls and payment experience

Self-serve signup — Sign up at console.groq.com with email; free tier access enabled immediately, no credit card required.
Per-token usage console — Real-time view of per-model token consumption (input, output, cached) and per-request latency metrics.
TPS transparency in console — Per-request TPS shown alongside latency, letting developers verify throughput SLAs in real time.
Spend alerts — Configurable email alerts at $X spend per period; hard spend caps available on Developer credit-card plans.
Payment methods — Credit card on self-serve Developer; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
Cached input transparency — API responses include cached input token count separately, letting customers verify cache-hit rates per request.
Annual commit pricing — Enterprise customers receive volume discounts in exchange for annual usage commitments and dedicated LPU capacity.
Multi-region availability — US regions standard; international expansion (including Saudi Arabia data center) underway.
Audit logging + RBAC — Workspace-level RBAC on Pro+; SOC 2 audit-log exports on Enterprise via webhook or S3 delivery.

Strategic wins : Why Groq’s pricing decisions worked

1. Bespoke LPU silicon as the structural cost moat

By investing in inference-specific silicon rather than repurposing GPU, Groq created a structural cost advantage that GPU-based competitors cannot replicate without comparable silicon investment. The LPU’s 800+ TPS on Llama 3 8B at $0.05/$0.08 is not just a marketing claim — it is the canonical proof that inference-specific silicon outperforms training-derived GPU silicon for sustained inference workloads.

2. TPS published alongside per-token rates removed evaluation friction

By publishing throughput (TPS) next to every per-token rate, Groq made the latency-cost trade-off legible without benchmark calls. For latency-critical workloads (voice agents, interactive UIs), this transparency converts more self-serve customers and reduces sales-led overhead. Most competitors hide TPS behind benchmark blog posts or marketing claims.

3. TPU-pioneer founder credibility as the silicon-trust anchor

Jonathan Ross’s role as the original Google TPU engineer gives Groq unusual credibility on silicon architecture — making the LPU’s throughput claims believable in a way that pure-engineering teams cannot replicate. For enterprise procurement leaders evaluating bespoke-silicon inference, the founder credentials act as a trust multiplier that distinguishes Groq from generic inference middleware.

4. Per-use agentic tool pricing preempted bundled-tool competitor lock-in

By publishing per-use rates for web search, website visits, and code execution as line items on the rate card, Groq let customers forecast agent-loop costs without negotiating contracts or accepting opaque tool bundling. This transparent agentic pricing is a meaningful competitive advantage as agentic workflows scale and tool usage becomes the dominant cost line.

Areas to improve : Gaps in Groq’s pricing approach

1. Cached input + Batch API discounts should stack

Fireworks lets cached input (50%) and Batch API (50%) discounts stack — making batched cached workloads land at 25% of standard. Groq’s discounts do not stack — customers must pick one. Adding stacking would close a meaningful competitive gap with Fireworks for RAG and agent-loop workloads at scale.

2. No published fine-tuning SKU

Customers wanting custom-trained models must use Fireworks, Together, Anyscale, or first-party model providers. Adding a fine-tuning SKU (even at limited model coverage initially) would let Groq capture the model-customization workload that currently goes elsewhere. The LPU’s deterministic execution may make fine-tuning architecturally awkward, but customers want the option.

3. Free tier rate caps may be too restrictive

Free tier rate limits are tight enough that meaningful evaluation requires a credit card and Developer-tier upgrade. Loosening free-tier caps (or offering a $5–$10 trial credit) would let self-serve developers reach proof-of-value without paid commitment and likely accelerate self-serve revenue.

4. Audio transcription only via Whisper

Whisper is the dominant open-source ASR model, but customers wanting Deepgram-comparable accuracy or specialized vertical models (medical, legal) must use other providers. Adding a second ASR model option (or a transcription quality tier above Whisper) would broaden the audio TAM and reduce vendor sprawl for audio-heavy workloads.

Monetization stack & signals : how Groq builds & buys its revenue engine

Buys 0 Builds 3

The read — where the monetization investment is going

Groq builds the revenue engine behind its own price: the token meter, progressive/region-aware usage billing with org-wide spend caps, and project-level cost allocation are all first-party in the GroqCloud console — no metering or FinOps vendor named. The only unconfirmed piece is the PSP behind self-serve card/ACH/SEPA checkout.

Stack — build vs buy

Builds in-house · 3

In-house token metering Metering Docs Jun 2026

“Groq meters consumption by tokens across models; customers "monitor your usage and charges in near real-time directly within your Groq Cloud dashboard" (spend tracking refreshes "every 10-15 minutes" per the spend-limits docs) — a first-party meter, no third-party metering vendor named.”
In-house usage billing (progressive thresholds + spend caps) Billing Docs 1 Docs 2 Jun 2026

“Bespoke billing logic: an invoice "is automatically triggered and payment is deducted when your cumulative usage reaches specific thresholds: $1, $10, $100, $500, and $1,000," after which the account moves to monthly arrears; India uses "$1, $10, and then $100 recurring," with a $0.50 minimum — region-specific rules a generic billing SKU does not ship.”
In-house project-level cost allocation Cost & FinOps Docs Jun 2026

“Projects give per-project cost centers: "track spending, usage patterns, and resource consumption at the project level. This granular visibility enables accurate cost allocation, budget management, and ROI analysis" — first-party showback/chargeback, not a bought FinOps tool.”

Unconfirmed · 1

Payments Payments inferred Docs Jun 2026

“Self-serve checkout accepts "credit cards (Visa, MasterCard, American Express, Discover), United States bank accounts, and SEPA debit accounts" — card + ACH + SEPA rails plus India recurring thresholds imply a payment orchestrator, but no PSP (Stripe etc.) is named anywhere in the docs.”

Signals reviewed Jun 2026 · derived from product docs

Key takeaways

Bespoke silicon as the structural moat. Groq’s LPU is the only commercial inference platform built on inference-specific silicon. The throughput advantage (800+ TPS on Llama 3 8B) is genuinely hard to replicate without comparable silicon investment — a durable competitive moat that pure-software optimization cannot match.
Throughput transparency converts latency-critical buyers. Publishing TPS alongside per-token rates makes the latency-cost trade-off legible for voice-agent, interactive-UI, and real-time agentic workloads where time-to-first-token determines user experience. Inference platforms targeting latency-sensitive workloads should publish TPS on the rate card.
Per-use agentic tool pricing is the next transparency frontier. As agent loops scale, web search and code execution costs become the dominant bill line. Platforms that publish per-tool rates in line-item form win finance-team trust over competitors that bundle tools into opaque tiers.
A full audio round-trip on one platform. Two-variant Whisper transcription (Large v3 vs Turbo, a 2.8× spread wide enough to matter and narrow enough to be a real choice) captures both quality- and cost-sensitive ASR workloads; the July 2026 Orpheus text-to-speech launch adds the return leg, so a voice agent can transcribe and synthesize on the same LPU without a second speech vendor. The TTS SKU is metered per 1M characters — its own native unit rather than a token proxy.
TPU-pioneer founder credibility is the most durable silicon-platform trust anchor. Jonathan Ross’s role as the original Google TPU engineer gives Groq unusual credibility on architectural claims — distinguishing it from generic inference middleware and converting silicon-skeptical enterprise procurement leaders.

UBP implications

Throughput (TPS) as a published metric on token-priced rate cards lets usage-based platforms compete on latency-cost rather than just absolute per-token price. Inference platforms that hide TPS behind documentation lose latency-sensitive buyers who cannot self-qualify.
Per-use line-item tool pricing is becoming the canonical structure for agentic UBP. Bundled-tool pricing creates forecasting uncertainty that finance teams reject; transparent per-tool rates win procurement trust. The July 2026 additions (browser automation at $0.08/hour, Orpheus TTS at $22–$40 per 1M characters) show the model scaling cleanly to new modalities — Groq now meters four distinct billing units (tokens, hours, requests, characters), each in its native unit, without repricing the existing catalog. Metering each modality in its own unit rather than normalizing everything to tokens keeps every SKU independently forecastable.
Bespoke silicon as a UBP moat is the structural cost advantage that pure-software platforms cannot replicate. Groq’s LPU is the canonical example — but the precedent suggests that other inference-specific silicon investments (Cerebras, Tenstorrent, Etched) will reshape competitive dynamics over the coming decade.

Sources

Groq pricing page (accessed 2026-05-29)
Groq docs — models catalog (accessed 2026-05-29)
Groq blog — Llama 3 launch (accessed 2026-05-29)
Groq blog — Series D announcement (accessed 2026-05-29)
Groq Saudi Arabia partnership (accessed 2026-05-29)
Related infra blueprint — Cerebras
Related infra blueprint — Fireworks AI
Blueprint corpus index

Bottom line

Groq priced its inference API around three structural ideas: bespoke LPU silicon that delivers throughput GPU-based competitors cannot match (800+ TPS on Llama 3 8B at $0.05/$0.08), TPU-pioneer founder credibility (Jonathan Ross designed the original Google TPU) that justifies the silicon investment, and transparent per-use agentic tool pricing (web search at $5–$8/1K, website visits at $1/1K, code execution at $0.18/hour) that makes Groq one of the most legible commercial inference platforms in the market. Marketing positioning explicitly emphasizes “linear and predictable” pricing — and the rate card structure makes the claim genuinely accurate.

For AI engineering teams running latency-critical voice agents, real-time interactive UIs, and high-throughput batch workloads, Groq delivers throughput and per-token economics that GPU-based platforms cannot match. The remaining gaps (no fine-tuning SKU, audio limited to Whisper, restrictive free tier, non-stacking discounts) are TAM-expansion problems rather than structural pricing flaws.

Compare with peers via the blueprint corpus, or model your own spend with the Groq pricing calculator.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Text-to-Speech SKU + Browser Automation Tool + New Models

Jul 2026

Groq added a text-to-speech product line billed per 1M characters (Canopy Labs Orpheus V1 English at $22.00/1M chars, Orpheus Arabic Saudi at $40.00/1M chars), a new Browser Automation built-in tool at $0.08/hour, and several new models (GPT OSS Safeguard 20B $0.075/$0.30, Qwen 3.6 27B $0.60/$3.00, Kimi K2 $1.00/$3.00 with $0.50 cached input) plus enterprise-only Minimax M2.5 and Qwen3-VL 32B. Core per-token rates for Llama, GPT OSS, and Qwen3 held stable.

Text-to-Speech SKU + Browser Automation Tool + New Models screenshot 1

Text-to-Speech SKU + Browser Automation Tool + New Models screenshot 2

Built-In Tools Pricing: Web Search, Visits, Code Execution

Jan 2026

Groq published per-use pricing for built-in agentic tools: web search at $5–$8 per 1,000 requests, website visits at $1 per 1,000 requests, code execution at $0.18/hour. Made Groq one of the few platforms with transparent line-item billing on agentic tool usage.

Saudi PIF Investment at $6.9B Valuation

Aug 2025

Groq raised a Saudi PIF-led round at a $6.9B post-money valuation, with the funding earmarked for international data center expansion and additional LPU production. The Saudi investment was paired with a strategic partnership to build Groq's largest data center in Saudi Arabia.

Batch API + Cached Input Discounts

May 2025

Groq launched a Batch API at 50% discount for asynchronous workloads and added cached input token discounts at 50% off standard rates. Brought Groq into parity with Fireworks, OpenAI, and Anthropic on standard inference discount mechanics.

Whisper Large v3 + Turbo Variants at Per-Hour Pricing

Mar 2025

Groq added Whisper Large v3 ($0.111/hour transcribed) and Whisper Large v3 Turbo ($0.04/hour transcribed) for audio transcription. Per-hour billing with 10-second minimum per request established the audio SKU structure that remains canonical.

Series D ($640M) at $2.8B Valuation

Aug 2024

Groq raised a $640M Series D led by BlackRock at a $2.8B post-money valuation. Tiger Global, Samsung Catalyst, KDDI, and others participated. The round funded LPU silicon production scaling and the GroqCloud Enterprise tier launch.

Llama 3 Day-One Support at Industry-Leading TPS

Feb 2024

Groq launched Llama 3 support on day one with throughput exceeding 800 tokens/second for the 8B variant — the fastest published inference rate for any commercial Llama 3 endpoint at launch. Per-token pricing held competitive with serverless GPU competitors.

GroqCloud Public Launch

Mar 2023

Groq launched GroqCloud — its inference API — with Llama 2 and Mistral models at competitive per-token rates. The launch positioning emphasized speed-of-inference (hundreds of TPS) rather than absolute cheapness, marketing LPU silicon as a structural advantage over GPU-based inference middleware.

Groq Founded

Apr 2016

Jonathan Ross (original engineer behind Google's TPU) founded Groq to build bespoke inference silicon. The company spent 2016–2022 designing the LPU (Language Processing Unit) architecture before launching GroqCloud — the commercial inference API — in early 2023.

Trivia

· Groq's Llama 3.1 8B Instant at 840 tokens/second is one of the fastest published throughput rates for any 8B-class model in commercial inference — the LPU silicon architecture is designed specifically for inference rather than training, which is what makes the speed and pricing combination possible.
· Groq was founded in 2016 by Jonathan Ross, the original engineer behind Google's TPU (Tensor Processing Unit) — making it the rare commercial AI inference platform built on bespoke silicon designed by the same engineer who pioneered modern ML accelerators.
· Groq's LPU (Language Processing Unit) deliberately avoids the GPU model: each chip has deterministic execution, no HBM (uses on-die SRAM), and no GPU-style branch prediction — the trade-off is lower per-chip memory but vastly higher single-stream throughput.

Questions & answers

How much does Groq cost per month?: Groq has no monthly subscription fee — you pay only for the tokens you consume on serverless inference, plus per-hour audio transcription, per-use built-in tools, and per-session agentic infrastructure. A small chat application using Llama 3.1 8B at 30M input + 10M output tokens would cost roughly $2.30/month — Groq's per-token rates are among the lowest in the market for small-model inference.
What are Groq's per-token rates for popular models?: Per 1M tokens (input/output): Llama 3.1 8B Instant $0.05/$0.08 at 840 TPS; GPT OSS 20B $0.075/$0.30 at 1,000 TPS; Llama 4 Scout $0.11/$0.34 at 594 TPS; Qwen3 32B $0.29/$0.59 at 662 TPS; GPT OSS 120B $0.15/$0.60 at 500 TPS; Llama 3.3 70B Versatile $0.59/$0.79 at 394 TPS. Throughput numbers are published alongside prices to make the latency-cost trade-off transparent.
Does Groq have a free tier?: Yes — Groq's free tier offers rate-capped access to most models for experimentation. Throughput limits apply (low requests-per-minute and tokens-per-minute caps), with sustained production usage requiring a paid plan. Free tier credit-card-on-file is not required for evaluation.
How does Groq's LPU differ from GPU-based inference?: Groq's LPU (Language Processing Unit) is bespoke silicon designed exclusively for inference. Unlike GPUs which use HBM (high-bandwidth memory) and branch prediction, the LPU uses on-die SRAM and deterministic execution. The trade-off is lower per-chip memory capacity but vastly higher single-stream throughput — enabling 800+ tokens/second on Llama 3 8B and 500+ TPS on 120B-class models.
What does Groq charge for audio transcription and text-to-speech?: Transcription is billed per hour: Whisper Large v3 (higher accuracy) at $0.111 per hour and Whisper Large v3 Turbo (faster, slightly lower accuracy) at $0.04 per hour, both with a 10-second minimum per request and a 2.8× spread reflecting the accuracy-versus-speed trade-off. Text-to-speech arrived in July 2026 and is billed per 1 million characters instead: Canopy Labs Orpheus V1 English at $22.00 per 1M characters and Orpheus Arabic (Saudi) at $40.00 per 1M characters — pairing with Whisper to give Groq a full speech-in, speech-out audio round-trip on the LPU.
How are Groq's built-in tools priced?: Per-use pricing for agentic tools: web search at $5–$8 per 1,000 requests (depending on depth), website visits at $1 per 1,000 requests, code execution at $0.18/hour of compute, and browser automation at $0.08/hour (added July 2026). Cached input on serverless gets a 50% discount; Batch API gets a 50% discount for asynchronous workloads. The transparent per-use tool pricing is one of the most legible in the market.