All companies
technology

Groq pricing

groq.com facts checked analysis reviewed
Quick summary
Product
GroqCloud — LPU-based ultra-low-latency inference API for Llama, GPT-OSS, Qwen, Whisper, and Mixtral
Industry
technology
Commits
Available (annual)
In this page
AI Summary
  • Groq runs a pure-usage per-token serverless inference API on its proprietary LPU silicon, with published rates including Llama 3.1 8B Instant at $0.05 input / $0.08 output per 1M tokens (840 TPS), GPT OSS 20B at $0.075/$0.30, Llama 4 Scout at $0.11/$0.34, Qwen3 32B at $0.29/$0.59, GPT OSS 120B at $0.15/$0.60, and Llama 3.3 70B Versatile at $0.59/$0.79.
  • Audio transcription priced per hour: Whisper Large v3 at $0.111/hour (higher accuracy), Whisper Large v3 Turbo at $0.04/hour (faster), with 10-second minimum billing per request.
  • Cached input tokens get a 50% discount versus standard rates with no extra fee. Batch API offers 50% lower cost for asynchronous large-scale workloads.
  • Built-in agentic tools are priced per-use: web search at $5–$8 per 1,000 requests, website visits at $1 per 1,000 requests, code execution at $0.18/hour. Makes Groq one of the few platforms with transparent line-item pricing on agentic tool usage.
  • Marketing emphasizes 'linear and predictable' pricing with no hidden costs or infrastructure surprises. Free tier offers limited rate-capped access; Enterprise plans add custom SLAs and capacity reservations.
  • Founded 2016 by Jonathan Ross (original Google TPU engineer); raised $640M Series D in August 2024 led by BlackRock at $2.8B valuation; Saudi PIF-led round in 2025 brought valuation to $6.9B+. Built bespoke LPU silicon for inference.
Pricing summary
Groq 2026 — LPU per-token inference + per-hour audio + per-use tools
Free rate-capped tier; serverless from $0.05/$0.08 (Llama 3.1 8B); Whisper Turbo $0.04/hr; 50% Batch + cached
Free tier
$0 + rate caps
Developers, evaluation, low-volume prototypes
Annual commit
Enterprise
Custom
Sustained high-throughput workloads
Per-token serverless
From $0.05 /1M (Llama 8B)
Production token inference
Audio + Tools
From $0.04 /hr (Whisper Turbo)
Transcription + agentic tools
LPU silicon enables 800+ TPS on Llama 3 8B and 500+ TPS on 120B-class models. Cached input (50%) and Batch API (50%) discounts apply on top of base rates. Whisper Turbo is 2.8× cheaper than Large v3 for lower-accuracy use cases.

About

Groq is a Mountain View-based AI inference company founded in April 2016 by Jonathan Ross — the original engineer behind Google’s Tensor Processing Unit (TPU). The product is GroqCloud, an inference API built on Groq’s proprietary LPU (Language Processing Unit) silicon. The LPU was designed exclusively for inference workloads: deterministic execution, on-die SRAM (no HBM), and architectural choices that prioritize single-stream throughput over GPU-style parallel batch efficiency. The result is published throughput of 800+ tokens/second on Llama 3 8B and 500+ TPS on 120B-class models, with per-token pricing that holds competitive with GPU-based serverless inference.

By 2026 Groq serves a mix of latency-sensitive production AI customers (voice assistants, real-time agentic systems, customer-facing chatbots) and high-throughput batch workloads. The company raised a $640M Series D in August 2024 led by BlackRock at $2.8B valuation, with Tiger Global, Samsung Catalyst, KDDI, and others participating; a 2025 round led by Saudi PIF brought valuation to $6.9B. The Saudi partnership was paired with a strategic commitment to build Groq’s largest data center in Saudi Arabia, signaling the company’s intent to compete on global inference capacity, not just per-token economics.

Groq competes with Fireworks AI, Together AI, Baseten, and Replicate for the managed-inference market, plus first-party providers (OpenAI, Anthropic) for general-purpose API customers. Its differentiation is bespoke LPU silicon (the only commercial inference platform built on inference-specific hardware), TPU-pioneer founder credibility (Jonathan Ross designed the original TPU), and the highest published throughput-per-token in the category — making Groq the canonical choice for latency-critical AI workloads where time-to-first-token matters more than raw cost.


Pricing summary : How Groq’s LPU + per-token + tools stack works

Groq runs a per-token serverless inference API priced by model, with throughput (TPS) published alongside each rate so customers can evaluate the latency-cost trade-off in a single view. The model catalog spans Llama (3.1 8B Instant at $0.05/$0.08 input/output per 1M tokens, 3.3 70B Versatile at $0.59/$0.79, 4 Scout at $0.11/$0.34), GPT OSS (20B at $0.075/$0.30, 120B at $0.15/$0.60), Qwen3 (32B at $0.29/$0.59), Mixtral, and Whisper for audio transcription. Audio is priced per hour transcribed with a 10-second minimum per request.

Cached input tokens get a 50% discount on standard rates, and the Batch API offers a 50% discount for asynchronous workloads. Built-in agentic tools have transparent per-use pricing: web search at $5–$8 per 1,000 requests, website visits at $1 per 1,000 requests, code execution at $0.18/hour. Free tier provides rate-capped access for evaluation; Enterprise plans add custom SLAs, dedicated LPU capacity reservations, volume discount commits, and VPC deployment.

This single-SKU plus per-tool structure — token / hour-audio / per-use-tool — is one of the most legible usage-based rate cards in AI inference middleware. The marketing positioning explicitly emphasizes “linear and predictable” pricing as a core value proposition.

What makes this different: Groq’s LPU silicon delivers throughput that GPU-based competitors cannot match at the same price point. Llama 3.1 8B at 840 TPS is roughly 4–10× faster than Llama 8B on Fireworks or Together at comparable prices — and the latency advantage compounds for multi-turn agentic workflows where time-to-first-token determines user experience.


Pricing by product

Serverless per-token inference (with published TPS)

ModelInput ($/1M)Output ($/1M)Throughput
Llama 3.1 8B Instant 128k$0.05$0.08840 TPS
GPT OSS 20B$0.075$0.301,000 TPS
Llama 4 Scout$0.11$0.34594 TPS
GPT OSS 120B 128k$0.15$0.60500 TPS
Qwen3 32B$0.29$0.59662 TPS
Llama 3.3 70B Versatile 128k$0.59$0.79394 TPS

Audio transcription (per hour, 10-second minimum)

ModelRate per hourNotes
Whisper Large v3$0.111Higher accuracy, slower
Whisper Large v3 Turbo$0.04Faster, slightly lower accuracy

Built-in agentic tools

ToolRateUnit
Web search$5 – $8per 1,000 requests (depth-dependent)
Website visits$1per 1,000 requests
Code execution$0.18per hour of compute

Discount mechanics

MechanismDiscountApplies to
Cached input tokens50% off input rateServerless models with prefix re-use
Batch API50% off standard rateAsynchronous, time-flexible workloads

Sales motions across products: PLG / self-serve for Free and Developer tier; sales-led for Enterprise annual commits, dedicated LPU capacity, and VPC deployments. All prices accessed 2026-05-29 from groq.com/pricing.


Hidden costs : What Groq customers actually pay beyond the per-token rate

Archetype A: Voice assistant startup on Llama 3.1 8B Instant

A latency-critical voice assistant startup serving ~200K requests/day at average 1K input + 250 output tokens, using Whisper Turbo for upstream audio transcription:

Line itemMonthly cost
Input tokens (6M/day × 30 = 180M, Llama 8B at $0.05/1M)$9
Output tokens (1.5M/day × 30 = 45M, Llama 8B at $0.08/1M)$3.60
Whisper Turbo transcription (avg 4 hours audio/day × 30 × $0.04)$4.80
Web search tool (occasional, 5K requests/mo × $0.005)$25
Estimated total~$42/month

For latency-critical voice workloads, Groq’s per-token rates plus Whisper Turbo deliver one of the lowest published total costs in the category. The same workload on a comparable GPU-based platform would typically cost 2–4× more — and would not match the latency.

Archetype B: Mid-market team running mixed Llama 70B + GPT OSS 120B with agentic tools

A 50-person team building an internal AI assistant using Llama 3.3 70B Versatile for general queries and GPT OSS 120B for reasoning-heavy tasks, with frequent agentic tool usage:

Line itemMonthly cost
Llama 3.3 70B (30M input + 10M output)$17.70 + $7.90 = $25.60
GPT OSS 120B (10M input + 5M output)$1.50 + $3.00 = $4.50
Cached input savings (40% cache hit on system prompts)-$7
Web search ($5/1K × 40K requests/mo)$200
Code execution (50 hours × $0.18)$9
Estimated total~$232/month

Built-in tool usage dominates the bill at this scale — and the transparency of the tool rate card lets finance teams forecast tool spend accurately. Most platforms either bundle tools into opaque tier pricing or force customers onto third-party tool providers, making forecasting harder.

Want to estimate your own Groq bill? Use the Groq pricing calculator to model token spend by model, audio transcription hours, and built-in tool usage.


Pricing evolution : Groq’s pricing history from LPU prototype to commercial inference category

Cadence

QuarterPrice changesProduct / SKU additionsNotes
2016 Q201Groq founded; LPU silicon design begins
2023 Q101GroqCloud public launch with Llama 2 + Mistral
2024 Q101Llama 3 day-one support; published TPS rates
2024 Q300Series D ($640M) at $2.8B valuation
2025 Q101Whisper Large v3 + Turbo audio SKUs launched
2025 Q211Batch API + cached input 50% discounts launched
2025 Q300Saudi PIF investment at $6.9B valuation
2026 Q101Built-in tools per-use pricing published

Tracked range: 2016 Q2–2026 Q1. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).

Notable changes

  • 2023-03-22 — GroqCloud public launch with Llama 2 + Mistral; LPU silicon as the structural differentiator.
  • 2024-02-19 — Llama 3 day-one support at 800+ TPS for the 8B variant; established Groq as the throughput leader.
  • 2025-03-04 — Whisper Large v3 ($0.111/hr) and Whisper Large v3 Turbo ($0.04/hr) for audio transcription with 10-second minimum billing.
  • 2025-05-13 — Batch API and cached input 50% discounts launched; brought Groq to parity with Fireworks and Anthropic on discount mechanics.
  • 2026-01-22 — Built-in tools per-use pricing published: web search $5–$8/1K, website visits $1/1K, code execution $0.18/hour.

What’s unique : Groq’s distinctive pricing mechanics

1. Throughput (TPS) published alongside per-token rates. Groq is one of the only inference platforms that publishes throughput (tokens per second) next to every per-token rate. For latency-critical workloads (voice agents, real-time interactive UIs), this transparency lets customers evaluate the latency-cost trade-off in a single view rather than running benchmark calls. Most competitors hide TPS behind documentation pages or marketing claims.

2. Bespoke LPU silicon as the structural cost moat. The LPU is the only commercial inference platform built on inference-specific silicon (rather than repurposed GPU). This delivers throughput that GPU-based competitors cannot match at comparable prices — making Groq’s $0.05/$0.08 Llama 3.1 8B rate genuinely hard to replicate without similar silicon investment.

3. Per-use agentic tool pricing in line-item form. Web search ($5–$8/1K), website visits ($1/1K), and code execution ($0.18/hour) are priced individually rather than bundled into opaque tiers. For finance teams forecasting agent-loop costs, the transparent tool rate card reduces budget uncertainty in a way that bundled-tools competitors cannot match.

4. Two-variant audio transcription (Whisper Large v3 vs Turbo). The 2.8× price spread ($0.111/hr vs $0.04/hr) between Whisper Large v3 and Whisper Large v3 Turbo lets customers pick accuracy-vs-speed explicitly. For high-volume transcription workloads (call centers, podcast processing), the Turbo rate is meaningfully cheaper than first-party Whisper API rates while preserving acceptable accuracy.

5. Cached input + Batch API discount mechanics in standard rate card. Both discounts apply at 50% off base rates, bringing Groq to parity with Fireworks, OpenAI, and Anthropic. The discount mechanics are the same across all eligible models — making them legible to customers without requiring per-model discount lookup. This discount-as-default architecture is becoming canonical for token-priced inference.


Strengths & weaknesses

StrengthsWeaknesses
TPS published alongside per-token rates — best latency-cost transparency in categoryDiscounts (50% cached, 50% batch) don’t stack — pick one
Bespoke LPU silicon delivers 4–10× higher TPS than GPU competitors at similar ratesFree tier rate caps may be restrictive for early-stage evaluation
Per-use agentic tool pricing in line-item form (web search, visits, code exec)Audio transcription only via Whisper (no other ASR model options)
Two-variant audio (Large v3 vs Turbo) lets customers pick accuracy-vs-speedModel catalog smaller than Fireworks/Together (no dedicated GPU or per-image SKUs)
TPU-pioneer founder credibility (Jonathan Ross)LPU silicon production scaling is a structural risk — supply must keep pace with demand
”Linear and predictable” pricing positioning is genuinely accurateNo fine-tuning SKU published — customers must use other platforms for custom-trained models

Billing UX : Groq’s account controls and payment experience

  • Self-serve signup — Sign up at console.groq.com with email; free tier access enabled immediately, no credit card required.
  • Per-token usage console — Real-time view of per-model token consumption (input, output, cached) and per-request latency metrics.
  • TPS transparency in console — Per-request TPS shown alongside latency, letting developers verify throughput SLAs in real time.
  • Spend alerts — Configurable email alerts at $X spend per period; hard spend caps available on Developer credit-card plans.
  • Payment methods — Credit card on self-serve Developer; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
  • Cached input transparency — API responses include cached input token count separately, letting customers verify cache-hit rates per request.
  • Annual commit pricing — Enterprise customers receive volume discounts in exchange for annual usage commitments and dedicated LPU capacity.
  • Multi-region availability — US regions standard; international expansion (including Saudi Arabia data center) underway.
  • Audit logging + RBAC — Workspace-level RBAC on Pro+; SOC 2 audit-log exports on Enterprise via webhook or S3 delivery.

Strategic wins : Why Groq’s pricing decisions worked

1. Bespoke LPU silicon as the structural cost moat

By investing in inference-specific silicon rather than repurposing GPU, Groq created a structural cost advantage that GPU-based competitors cannot replicate without comparable silicon investment. The LPU’s 800+ TPS on Llama 3 8B at $0.05/$0.08 is not just a marketing claim — it is the canonical proof that inference-specific silicon outperforms training-derived GPU silicon for sustained inference workloads.

2. TPS published alongside per-token rates removed evaluation friction

By publishing throughput (TPS) next to every per-token rate, Groq made the latency-cost trade-off legible without benchmark calls. For latency-critical workloads (voice agents, interactive UIs), this transparency converts more self-serve customers and reduces sales-led overhead. Most competitors hide TPS behind benchmark blog posts or marketing claims.

3. TPU-pioneer founder credibility as the silicon-trust anchor

Jonathan Ross’s role as the original Google TPU engineer gives Groq unusual credibility on silicon architecture — making the LPU’s throughput claims believable in a way that pure-engineering teams cannot replicate. For enterprise procurement leaders evaluating bespoke-silicon inference, the founder credentials act as a trust multiplier that distinguishes Groq from generic inference middleware.

4. Per-use agentic tool pricing preempted bundled-tool competitor lock-in

By publishing per-use rates for web search, website visits, and code execution as line items on the rate card, Groq let customers forecast agent-loop costs without negotiating contracts or accepting opaque tool bundling. This transparent agentic pricing is a meaningful competitive advantage as agentic workflows scale and tool usage becomes the dominant cost line.


Areas to improve : Gaps in Groq’s pricing approach

1. Cached input + Batch API discounts should stack

Fireworks lets cached input (50%) and Batch API (50%) discounts stack — making batched cached workloads land at 25% of standard. Groq’s discounts do not stack — customers must pick one. Adding stacking would close a meaningful competitive gap with Fireworks for RAG and agent-loop workloads at scale.

2. No published fine-tuning SKU

Customers wanting custom-trained models must use Fireworks, Together, Anyscale, or first-party model providers. Adding a fine-tuning SKU (even at limited model coverage initially) would let Groq capture the model-customization workload that currently goes elsewhere. The LPU’s deterministic execution may make fine-tuning architecturally awkward, but customers want the option.

3. Free tier rate caps may be too restrictive

Free tier rate limits are tight enough that meaningful evaluation requires a credit card and Developer-tier upgrade. Loosening free-tier caps (or offering a $5–$10 trial credit) would let self-serve developers reach proof-of-value without paid commitment and likely accelerate self-serve revenue.

4. Audio transcription only via Whisper

Whisper is the dominant open-source ASR model, but customers wanting Deepgram-comparable accuracy or specialized vertical models (medical, legal) must use other providers. Adding a second ASR model option (or a transcription quality tier above Whisper) would broaden the audio TAM and reduce vendor sprawl for audio-heavy workloads.


Key takeaways

  1. Bespoke silicon as the structural moat. Groq’s LPU is the only commercial inference platform built on inference-specific silicon. The throughput advantage (800+ TPS on Llama 3 8B) is genuinely hard to replicate without comparable silicon investment — a durable competitive moat that pure-software optimization cannot match.

  2. Throughput transparency converts latency-critical buyers. Publishing TPS alongside per-token rates makes the latency-cost trade-off legible for voice-agent, interactive-UI, and real-time agentic workloads where time-to-first-token determines user experience. Inference platforms targeting latency-sensitive workloads should publish TPS on the rate card.

  3. Per-use agentic tool pricing is the next transparency frontier. As agent loops scale, web search and code execution costs become the dominant bill line. Platforms that publish per-tool rates in line-item form win finance-team trust over competitors that bundle tools into opaque tiers.

  4. Two-variant audio transcription (Whisper Large v3 vs Turbo) captures both quality-sensitive and cost-sensitive workloads with a single SKU surface. The 2.8× price spread between variants is wide enough to matter and narrow enough to be a real choice rather than a trick.

  5. TPU-pioneer founder credibility is the most durable silicon-platform trust anchor. Jonathan Ross’s role as the original Google TPU engineer gives Groq unusual credibility on architectural claims — distinguishing it from generic inference middleware and converting silicon-skeptical enterprise procurement leaders.


UBP implications

  1. Throughput (TPS) as a published metric on token-priced rate cards lets usage-based platforms compete on latency-cost rather than just absolute per-token price. Inference platforms that hide TPS behind documentation lose latency-sensitive buyers who cannot self-qualify.

  2. Per-use line-item tool pricing is becoming the canonical structure for agentic UBP. Bundled-tool pricing creates forecasting uncertainty that finance teams reject; transparent per-tool rates win procurement trust.

  3. Bespoke silicon as a UBP moat is the structural cost advantage that pure-software platforms cannot replicate. Groq’s LPU is the canonical example — but the precedent suggests that other inference-specific silicon investments (Cerebras, Tenstorrent, Etched) will reshape competitive dynamics over the coming decade.


Sources


Bottom line

Groq priced its inference API around three structural ideas: bespoke LPU silicon that delivers throughput GPU-based competitors cannot match (800+ TPS on Llama 3 8B at $0.05/$0.08), TPU-pioneer founder credibility (Jonathan Ross designed the original Google TPU) that justifies the silicon investment, and transparent per-use agentic tool pricing (web search at $5–$8/1K, website visits at $1/1K, code execution at $0.18/hour) that makes Groq one of the most legible commercial inference platforms in the market. Marketing positioning explicitly emphasizes “linear and predictable” pricing — and the rate card structure makes the claim genuinely accurate.

For AI engineering teams running latency-critical voice agents, real-time interactive UIs, and high-throughput batch workloads, Groq delivers throughput and per-token economics that GPU-based platforms cannot match. The remaining gaps (no fine-tuning SKU, audio limited to Whisper, restrictive free tier, non-stacking discounts) are TAM-expansion problems rather than structural pricing flaws.

Compare with peers via the blueprint corpus, or model your own spend with the Groq pricing calculator.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Built-In Tools Pricing: Web Search, Visits, Code Execution

Groq published per-use pricing for built-in agentic tools: web search at $5–$8 per 1,000 requests, website visits at $1 per 1,000 requests, code execution at $0.18/hour. Made Groq one of the few platforms with transparent line-item billing on agentic tool usage.

Built-In Tools Pricing: Web Search, Visits, Code Execution screenshot 1
Built-In Tools Pricing: Web Search, Visits, Code Execution screenshot 2

Saudi PIF Investment at $6.9B Valuation

Groq raised a Saudi PIF-led round at a $6.9B post-money valuation, with the funding earmarked for international data center expansion and additional LPU production. The Saudi investment was paired with a strategic partnership to build Groq's largest data center in Saudi Arabia.

Batch API + Cached Input Discounts

Groq launched a Batch API at 50% discount for asynchronous workloads and added cached input token discounts at 50% off standard rates. Brought Groq into parity with Fireworks, OpenAI, and Anthropic on standard inference discount mechanics.

Whisper Large v3 + Turbo Variants at Per-Hour Pricing

Groq added Whisper Large v3 ($0.111/hour transcribed) and Whisper Large v3 Turbo ($0.04/hour transcribed) for audio transcription. Per-hour billing with 10-second minimum per request established the audio SKU structure that remains canonical.

Series D ($640M) at $2.8B Valuation

Groq raised a $640M Series D led by BlackRock at a $2.8B post-money valuation. Tiger Global, Samsung Catalyst, KDDI, and others participated. The round funded LPU silicon production scaling and the GroqCloud Enterprise tier launch.

Llama 3 Day-One Support at Industry-Leading TPS

Groq launched Llama 3 support on day one with throughput exceeding 800 tokens/second for the 8B variant — the fastest published inference rate for any commercial Llama 3 endpoint at launch. Per-token pricing held competitive with serverless GPU competitors.

GroqCloud Public Launch

Groq launched GroqCloud — its inference API — with Llama 2 and Mistral models at competitive per-token rates. The launch positioning emphasized speed-of-inference (hundreds of TPS) rather than absolute cheapness, marketing LPU silicon as a structural advantage over GPU-based inference middleware.

Groq Founded

Jonathan Ross (original engineer behind Google's TPU) founded Groq to build bespoke inference silicon. The company spent 2016–2022 designing the LPU (Language Processing Unit) architecture before launching GroqCloud — the commercial inference API — in early 2023.

Trivia
  • · Groq's Llama 3.1 8B Instant at 840 tokens/second is one of the fastest published throughput rates for any 8B-class model in commercial inference — the LPU silicon architecture is designed specifically for inference rather than training, which is what makes the speed and pricing combination possible.
  • · Groq was founded in 2016 by Jonathan Ross, the original engineer behind Google's TPU (Tensor Processing Unit) — making it the rare commercial AI inference platform built on bespoke silicon designed by the same engineer who pioneered modern ML accelerators.
  • · Groq's LPU (Language Processing Unit) deliberately avoids the GPU model: each chip has deterministic execution, no HBM (uses on-die SRAM), and no GPU-style branch prediction — the trade-off is lower per-chip memory but vastly higher single-stream throughput.

Questions & answers

How much does Groq cost per month?
Groq has no monthly subscription fee — you pay only for the tokens you consume on serverless inference, plus per-hour audio transcription, per-use built-in tools, and per-session agentic infrastructure. A small chat application using Llama 3.1 8B at 30M input + 10M output tokens would cost roughly $2.30/month — Groq's per-token rates are among the lowest in the market for small-model inference.
What are Groq's per-token rates for popular models?
Per 1M tokens (input/output): Llama 3.1 8B Instant $0.05/$0.08 at 840 TPS; GPT OSS 20B $0.075/$0.30 at 1,000 TPS; Llama 4 Scout $0.11/$0.34 at 594 TPS; Qwen3 32B $0.29/$0.59 at 662 TPS; GPT OSS 120B $0.15/$0.60 at 500 TPS; Llama 3.3 70B Versatile $0.59/$0.79 at 394 TPS. Throughput numbers are published alongside prices to make the latency-cost trade-off transparent.
Does Groq have a free tier?
Yes — Groq's free tier offers rate-capped access to most models for experimentation. Throughput limits apply (low requests-per-minute and tokens-per-minute caps), with sustained production usage requiring a paid plan. Free tier credit-card-on-file is not required for evaluation.
How does Groq's LPU differ from GPU-based inference?
Groq's LPU (Language Processing Unit) is bespoke silicon designed exclusively for inference. Unlike GPUs which use HBM (high-bandwidth memory) and branch prediction, the LPU uses on-die SRAM and deterministic execution. The trade-off is lower per-chip memory capacity but vastly higher single-stream throughput — enabling 800+ tokens/second on Llama 3 8B and 500+ TPS on 120B-class models.
What does Groq charge for audio transcription?
Whisper Large v3 (higher accuracy) at $0.111 per hour of audio transcribed; Whisper Large v3 Turbo (faster, slightly lower accuracy) at $0.04 per hour transcribed. Both bill with a 10-second minimum per request to ensure profitable short-clip handling. The 2.8× price spread between Turbo and standard reflects the accuracy-versus-speed trade-off.
How are Groq's built-in tools priced?
Per-use pricing for agentic tools: web search at $5–$8 per 1,000 requests (depending on depth), website visits at $1 per 1,000 requests, code execution at $0.18/hour of compute. Cached input on serverless gets a 50% discount; Batch API gets a 50% discount for asynchronous workloads. The transparent per-use tool pricing is one of the most legible in the market.