How much does Baseten cost per month?

Baseten has no monthly fee on the Basic tier — you pay only for the GPU minutes your dedicated deployments run plus per-token usage on Model APIs. A small production deployment running an A10G GPU 24/7 would cost roughly $870/month ($0.02012 × 60 × 24 × 30); the same model on a scale-to-zero pattern serving ~10 requests/hour might cost under $100/month.

What GPU types does Baseten support and at what price?

Baseten supports T4 ($0.01052/min), L4 ($0.01414/min), A10G ($0.02012/min), A100 80GB ($0.06667/min), H100 MIG 40GB ($0.0625/min), H100 80GB ($0.10833/min), and B200 180GB ($0.16633/min). All rates are per-minute and billed only while the replica is serving traffic.

Does Baseten have a free tier?

Yes — the Basic tier carries no monthly fee. New accounts also receive free credits for experimentation. Basic users get SOC 2 Type II and HIPAA-compliant deployments, in-app and email support, and access to all instance types — they only pay for actual GPU minutes consumed.

How do Baseten's Model APIs compare to dedicated deployments?

Model APIs are multi-tenant per-token endpoints for popular open-weight models (DeepSeek, Kimi, Nemotron, Llama) — billed at $0.30–$4.00 per million tokens depending on the model. Dedicated deployments are single-tenant per-minute GPU rentals where you control the model and runtime. Model APIs are cheaper for low-volume use; dedicated deployments are cheaper at sustained high QPS or for proprietary models.

What does Baseten Enterprise / Mission Critical SLA include?

Enterprise tier adds custom SLAs (Mission Critical at 99.95% uptime), self-hosted (BYOC) deployment option in your AWS/GCP/Azure VPC, data-residency controls, advanced RBAC, custom regions, and dedicated Slack/Zoom support. Pricing is quote-based and typically tied to annual usage commitments at volume discounts.

Does Baseten charge for idle replicas?

No — Baseten's pricing is explicitly per active minute, and idle replicas in scale-to-zero state incur no charge. The trade-off is cold-start latency on the first request after idle; the platform offers configurable warm-pool windows for latency-sensitive workloads at the cost of slightly higher idle billing.

Baseten Pricing

AI Summary

Baseten runs a pure-usage GPU-minute billing model for dedicated model deployments plus a separate per-token Model APIs catalog — both pay-as-you-go from the Basic tier with no monthly minimum.
Dedicated deployment per-minute rates as of 2026: T4 $0.01052, L4 $0.01414, A10G $0.02012, A100 80GB $0.06667, H100 MIG 40GB $0.0625, H100 80GB $0.10833, B200 180GB $0.16633 — all billed only for active inference time, with scale-to-zero idle replicas free.
Model APIs charge per million tokens: DeepSeek V3.1 at $0.50/$1.50, DeepSeek V4 at $1.74/$3.48, Kimi K2.6 at $0.95/$4.00, NVIDIA Nemotron 3 Super at $0.30/$0.75 — with cached-input pricing 50–80% off the standard rate for prompts re-using prefixes.
New accounts receive free credits; the Basic tier carries no monthly fee and is SOC 2 Type II + HIPAA-compliant out of the box.
Pro and Enterprise tiers add priority GPU access, dedicated compute reservations, higher rate limits, custom SLAs, BYOC self-hosted deployment, data-residency controls, and Slack/Zoom support — priced via annual usage commitments rather than published tier rates.
Baseten raised a $75M Series C in February 2024 led by IVP at an ~$825M post-money valuation; a Series D round was reported in late 2025 at >$2B.

Pricing summary

Baseten 2026 — Pay-per-minute GPU + per-token Model APIs

Basic (no monthly fee) → Pro (volume discounts) → Enterprise (Mission Critical SLA, BYOC)

Basic

$0 /mo + usage

Developers, startups, prototypes

Quote-based

Pro

Volume discounts

Growth-stage AI products, sustained workloads

Annual commit

Enterprise

Custom

Regulated industries, mission-critical inference

GPU per-minute

From $0.01052 /min (T4)

Dedicated single-tenant deployments

Model APIs

From $0.10 /1M tokens

Per-token multi-tenant endpoints

No monthly fee on Basic. Dedicated GPU billed per minute (active inference only); idle replicas in scale-to-zero state are free. Volume discounts on Pro and Enterprise require quote-based commitments.

About

Baseten is a San Francisco-based ML infrastructure company founded in 2019 by Tuhin Srivastava, Amir Haghighat, and Philip Howes — all formerly at Gumroad. The product is a managed inference platform that lets ML and product teams deploy custom or open-source models behind production-grade endpoints without operating a GPU fleet, Kubernetes cluster, or autoscaler. The interface is opinionated: model code is packaged with Truss (Baseten’s open-source framework), deployed to dedicated single-tenant GPU instances, and exposed via REST or gRPC endpoints with automatic scale-to-zero, configurable warm-pool windows, and per-replica observability.

By 2026 Baseten serves Writer, Descript, Patreon, Robust Intelligence, Picnic Health, and roughly a thousand other paying customers spanning enterprise compliance-sensitive workloads (HIPAA-regulated healthcare AI, financial-services NLP) and high-growth AI-native startups serving sub-second inference SLAs. The company raised a $75M Series C in February 2024 led by IVP at an ~$825M post-money valuation, with a later Series D round at >$2B reported in late 2025.

Baseten competes with hyperscaler inference platforms (AWS Bedrock, Google Vertex AI, Azure ML), specialized inference clouds (Fireworks AI, Together AI, Replicate), and serverless GPU providers (Modal, RunPod). Its differentiation is the combination of per-minute transparent GPU billing, model-agnostic deployment (any PyTorch or TensorFlow model, not just a predefined catalog), and a Truss-based developer experience that handles cold-start optimization, request batching, and concurrency tuning without per-model engineering effort.

Pricing summary : How Baseten’s per-minute GPU + per-token APIs stack works

Baseten runs two parallel pricing surfaces that share the same platform credits balance. Dedicated deployments are billed by GPU-minute on whichever instance type the customer selects (T4 through B200), with no markup for the platform layer beyond the published rate — the price is the same whether the model is open-source, a customer’s proprietary checkpoint, or a fine-tuned variant. The rate card carries a Minute/Hour toggle so the same rate is shown at either granularity, and a separate on-demand Training rate card and CPU-instance rate card sit alongside the GPU tables. Model APIs are multi-tenant per-token endpoints for a curated catalog of popular open-weight models (GLM, Kimi, NVIDIA Nemotron, DeepSeek, GPT OSS), priced per million input/output tokens with a published cache-input column.

The Basic tier carries no monthly subscription — customers pay only for active usage, with free credits on signup to encourage experimentation. Pro and Enterprise tiers do not change the per-minute or per-token rates publicly; instead they unlock volume discounts via annual commitments, priority GPU access, dedicated compute reservations, custom SLAs, and BYOC deployment. This dual-track pricing — transparent self-serve consumption plus quote-based commitments — mirrors the pure-usage + commitment hybrid increasingly common across AI infrastructure companies.

What makes this different: Baseten publishes per-minute pricing rather than per-hour, which makes scale-to-zero economics legible. A model that bursts for 4 minutes per request on a $0.10833/min H100 is billed for exactly those 4 minutes — the kind of granular cost-modeling that AWS Bedrock and Vertex AI deliberately hide behind aggregate per-month bills.

Pricing by product

Dedicated GPU deployments (per-minute, single-tenant)

Baseten’s pricing page ships a Minute / Hour toggle on the dedicated-deployment rate card, so both columns below are published rates read directly from the page — not derived. Billing is always metered to the minute; the per-hour figures are the same rate viewed at hourly granularity.

Instance	VRAM	Per-minute rate	Per-hour rate	Best for
T4	16 GiB	$0.01052	$0.6312	Low-cost embeddings, small model inference
L4	24 GiB	$0.01414	$0.8484	Mid-range models, video inference
A10G	24 GiB	$0.02012	$1.2072	7B–13B model inference, image gen
A100 80GB	80 GiB	$0.06667	$4.00	30B–70B model inference, training
H100 MIG 40GB	40 GiB	$0.0625	$3.75	Cost-optimized H100 partition
H100 80GB	80 GiB	$0.10833	$6.50	Frontier model inference, low-latency serving
B200 180GB	180 GiB	$0.16633	$9.98	Largest open models, multi-model serving

CPU-only instances are also billed per minute for pre/post-processing and CPU model serving: 1×2 (1 vCPU / 2 GiB) $0.00058, 1×4 $0.00086, 2×8 $0.00173, 4×16 $0.00346, 8×32 $0.00691, 16×64 (16 vCPU / 64 GiB) $0.01382 per minute.

Model APIs (per-token, multi-tenant)

Prices are per 1M tokens. The pricing page publishes a dedicated Cache Input column — cached prompt-prefix tokens are billed at a large discount to the standard input rate.

Model	Input ($/1M)	Cache input ($/1M)	Output ($/1M)
GLM 5.2	$1.40	$0.26	$4.40
GLM 5.1	$1.30	$0.26	$4.30
GLM 5	$0.95	$0.20	$3.15
GLM 4.7	$0.60	$0.12	$2.20
Kimi K2.7 Code	$0.95	$0.16	$4.00
Kimi K2.6	$0.95	$0.16	$4.00
Kimi K2.5	$0.60	$0.12	$3.00
NVIDIA Nemotron 3 Ultra	$0.60	$0.12	$2.40
NVIDIA Nemotron 3 Super	$0.30	$0.06	$0.75
DeepSeek V4	$1.74	$0.145	$3.48
GPT OSS 120B	$0.10	—	$0.50

Tier features (non-usage)

Tier	Monthly fee	Volume discount	SLA	Self-host (BYOC)	Support
Basic	$0	None	Best-effort	No	Email + in-app chat
Pro	Quote	Yes (annual commit)	Higher rate limits	No	Slack + Zoom
Enterprise	Quote	Yes (annual commit)	Mission Critical (99.95%)	Yes	24/7 dedicated

Sales motions across products: PLG / self-serve for Basic (per-minute GPU/CPU + per-token Model APIs + on-demand Training), sales-led for Pro and Enterprise commits and BYOC.

Hidden costs : What Baseten customers actually pay beyond the base GPU rate

Archetype A: AI-native startup running one 7B model on A10G with bursty traffic

A growth-stage startup serving ~50,000 requests/day, average 800ms inference time, with traffic concentrated in business hours:

Line item	Monthly cost
A10G compute (4h/day active × 30 × $0.02012 × 60)	$145
Warm-pool retention (avoid cold starts business hours, ~6h/day)	$72
Outbound bandwidth (negligible at this scale)	<$5
Estimated total	~$220/month

The warm-pool premium roughly doubles compute cost — but eliminates 3–8 second cold starts that would push p95 latency above customer SLAs. Most production deployments end up paying a 30–60% warm-pool premium over pure scale-to-zero.

Archetype B: Mid-market team running mixed Model APIs + dedicated H100

A team using Model APIs for general-purpose Q&A and a dedicated H100 for a fine-tuned model:

Line item	Monthly cost
DeepSeek V3.1 Model API (50M input + 15M output)	$47.50
Cached input savings on V3.1 (40% of input cached)	-$5
Dedicated H100 80GB (8h/day × 30 × $0.10833 × 60)	$1,560
Mission Critical SLA add-on (Enterprise)	Quote
Estimated total	~$1,600/month + SLA quote

H100 dedicated compute dominates the bill — and the customer is paying for warm-pool retention to maintain low latency. The cached input discount on Model APIs (50–80% off) is meaningful only for workloads with repetitive prompt prefixes (RAG, agent loops); ad-hoc query workloads see negligible cache hits.

Want to estimate your own Baseten bill? Use the Baseten pricing calculator to model dedicated GPU cost by instance and active minutes, plus Model API spend by token volume.

Pricing evolution : Baseten’s pricing history from no-code serving to per-minute GPU transparency

Cadence

Quarter	Price changes	Product / SKU additions	Notes
2019 Q1	0	1	Company founded; initial no-code model-serving product
2022 Q2	1	1	Series B; Truss open-sourced; pricing shifted to per-minute GPU
2023 Q4	0	1	Model Library launched — one-click open-source model deployments
2024 Q1	0	0	Series C ($75M IVP-led); no public price changes
2024 Q3	0	1	Model APIs (multi-tenant per-token) launched
2024 Q4	1	0	H100 80GB published at $0.10833/min as a transparency play
2025 Q1	0	1	B200 instances added at $0.16633/min
2025 Q3	0	1	Self-host (BYOC) deployment option launched
2026 Q1	1	0	Cached input pricing added to Model APIs (50–80% off standard input)

Tracked range: 2019 Q1–2026 Q1. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).

Notable changes

2022-04-19 — Truss open-sourced; pricing pivoted from seat-based no-code SaaS to per-minute GPU consumption — the defining structural choice that still shapes Baseten’s positioning.
2023-11-10 — Model Library launched with pre-deployed open-source models (Stable Diffusion, Whisper, Llama 2, Mistral); per-minute rate of underlying instance, no model markup.
2024-08-05 — Model APIs launched, adding pure pay-per-token multi-tenant endpoints to complement per-minute dedicated.
2024-11-18 — H100 published at $0.10833/min — Baseten leaned into transparency as differentiation against opaque hyperscaler markup.
2025-09-22 — Self-host / BYOC option launched for enterprises with strict data-residency or compute-cost requirements.
2026-02-14 — Cached input pricing added to Model APIs; brought parity with first-party caching offerings from OpenAI and Anthropic.

What’s unique : Baseten’s distinctive pricing mechanics

1. Per-minute (not per-hour) billing makes scale-to-zero economics visible. Most cloud GPU offerings — AWS Sagemaker, Vertex AI, Azure ML — bill per hour with billing-minute rounding. Baseten’s per-minute granularity means a 90-second burst on an H100 costs $0.16, not a rounded-up $6.50/hour. This makes usage forecasting materially more accurate for bursty workloads and exposes the real cost of warm-pool retention versus pure scale-to-zero.

2. No model markup on dedicated deployments — only the GPU rate. When a customer runs Llama 3.3 70B on an A100 via Baseten’s Model Library, they pay $0.06667/min — the same A100 rate as if they were running their own proprietary model. Most managed inference platforms charge a per-token markup on hosted open-source models on top of the underlying compute cost. Baseten’s flat per-minute rate makes the platform layer free of tier-based pricing distortion.

3. Dual-SKU split: per-minute dedicated AND per-token multi-tenant in the same product. Baseten lets customers use Model APIs (per-token, multi-tenant) for low-volume general-purpose queries and dedicated deployments (per-minute, single-tenant) for sustained high-QPS or proprietary models — in the same workspace, with the same billing balance. Most competitors force a choice between platforms (Replicate vs Fireworks, for example). This hybrid usage model reduces vendor sprawl for AI-native teams.

4. Cached input pricing applies to multi-tenant Model APIs. Cached-input discounts (50–80% off) are normally exclusive to first-party providers (OpenAI, Anthropic) where the cache is implementation-controlled. Baseten ships cached input on hosted DeepSeek and Kimi endpoints — meaning customers using Baseten’s Model APIs as a DeepSeek proxy actually get caching parity with first-party DeepSeek without negotiating a separate contract.

5. Self-host (BYOC) as an enterprise unlock without abandoning the platform. Baseten’s BYOC option lets large enterprises retain Truss-based deployment, scale-to-zero, observability, and Baseten’s control plane — but use their own AWS, GCP, or Azure GPU reservations and their own VPC for data residency. This platform-license + customer-compute model is unusual in inference middleware and addresses both compliance and committed-spend savings simultaneously.

Strengths & weaknesses

Strengths	Weaknesses
Transparent per-minute GPU pricing across all instance types (T4–B200)	Pro and Enterprise tier prices are not published — must contact sales for discounts
No model markup on dedicated deployments — the GPU rate is the price	Per-minute H100 rate ($6.50/hr equiv) is 1.5–2× raw AWS on-demand H100 list
Scale-to-zero with configurable warm-pool eliminates idle billing	Cold-start latency on scale-to-zero (3–8s) requires warm-pool tuning for latency SLAs
Truss framework supports any PyTorch or TensorFlow model, not a fixed catalog	Mixing Model APIs and dedicated deployments requires manual cost-modeling — no unified spend forecast
SOC 2 Type II + HIPAA out of the box on Basic tier	Mission Critical SLA (99.95%) is Enterprise-only — no published SLA on Basic
BYOC option preserves Baseten platform while using customer-owned GPU reservations	Geographic regions limited compared to hyperscalers; APAC residency requires Enterprise + BYOC

Billing UX : Baseten’s account controls and payment experience

Self-serve signup — Sign up at app.baseten.co with email; free credits applied automatically. No credit card required to deploy first model.
Per-minute billing transparency — Console shows per-deployment minute-level usage history; each replica has a real-time cost meter.
Scale-to-zero configuration — Per-deployment settings for minimum replicas (0 = scale-to-zero), maximum replicas (autoscaling cap), and warm-pool retention window (0–60 minutes).
Model APIs unified balance — Token consumption across all Model API models bills against the same workspace credit balance as dedicated deployments.
Spend alerts — Configurable threshold alerts at $X spend per workspace per period; sent via email and webhook.
Payment methods — Credit card and ACH on Basic; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
Annual commit discounting — Pro and Enterprise customers receive volume discounts in exchange for annual usage commitments; over-commitment overages bill at standard per-minute or per-token rates.
Audit logging + RBAC — Workspace-level RBAC on Pro+; SOC 2 audit-log exports on Enterprise via webhook or S3 delivery.
No usage cap on Basic — Customers can run unlimited GPU minutes on Basic with credit card on file; no rate limit on dedicated deployments. Model APIs have published per-minute and per-day request limits per workspace.

Strategic wins : Why Baseten’s pricing decisions worked

1. Per-minute transparency as the GTM wedge against hyperscaler opacity

Baseten’s decision to publish per-minute H100 and B200 rates created a sharp competitive contrast against AWS Bedrock, Vertex AI, and Azure ML — all of which embed inference costs inside aggregate monthly bills with no per-request visibility. For AI engineering leaders forecasting unit economics, Baseten’s transparent rate card lets them model cost-per-inference and amortize against revenue-per-query before signing a contract. This transparency-as-positioning is unusual in B2B infrastructure and likely accelerated mid-market adoption.

2. Truss as an open-source moat for dedicated-deployment pricing

By open-sourcing Truss in 2022, Baseten created a free-to-use packaging framework that became the de-facto standard for model deployment among ML teams. Customers who built on Truss for local development naturally migrated to Baseten for production hosting — making per-minute dedicated pricing the path of least resistance. This PLG via developer tooling drove growth without paid acquisition spend through 2023–2024.

3. Dual SKU (per-minute + per-token) caught both ends of the workload spectrum

Most inference platforms force a choice: pay per-token for multi-tenant (Replicate, Fireworks) or pay per-hour for dedicated (Modal, RunPod). Baseten’s combined offering — per-minute dedicated AND per-token Model APIs in one workspace — captures both bursty low-volume queries (cheap on Model APIs) and sustained high-QPS production (cheap on dedicated). The multi-dimensional usage model maximizes wallet share within each customer.

4. BYOC as the enterprise upsell unlock without abandoning the platform

Baseten’s 2025 launch of self-hosted (BYOC) deployment let large enterprises keep Truss, scale-to-zero, and observability while using their own GPU reservations and VPC. This addresses the two largest enterprise blockers — data residency and committed-spend optimization — without forcing the customer onto a different product. It is a textbook enterprise pricing unlock that captures large customers who would otherwise self-build on Kubernetes.

Areas to improve : Gaps in Baseten’s pricing approach

1. Pro tier pricing should be published

Today the per-minute and per-token rates are identical on Basic, Pro, and Enterprise — the tier difference is volume discount magnitude and support level. By hiding the Pro discount schedule behind a sales call, Baseten loses self-serve growth-stage customers who want to model expansion economics without a sales conversation. Publishing a tiered volume-discount schedule (e.g., 10% off above $5K/mo, 20% off above $25K/mo) would let mid-market customers self-qualify into Pro without sales friction.

2. Cold-start cost is invisible until you hit it

Scale-to-zero is positioned as a cost-saver — and it is — but the latency cost of cold starts (3–8 seconds on a large model) requires customers to either tune warm-pool retention manually or accept latency SLA misses. The pricing page does not show warm-pool premium calculations or cold-start probability curves; customers discover this only after deploying. Adding a cold-start economics calculator to the pricing page would set expectations and likely increase Pro tier conversions for latency-sensitive workloads.

3. No unified spend forecast across Model APIs + dedicated

Customers using both Model APIs and dedicated deployments must manually combine per-token and per-minute forecasts. Baseten’s console shows historical spend by SKU but does not project future spend across mixed workload types. A unified usage forecasting view — token volume + active GPU minutes projected forward — would materially reduce bill-shock anxiety for cost-sensitive AI engineering teams.

4. Pricing page lacks egress and bandwidth detail

For high-volume inference workloads serving large image, audio, or video payloads, network egress can become a meaningful cost line. Baseten’s pricing page does not break out bandwidth pricing — customers learn the rate from invoices. Making egress pricing explicit (and ideally bundling a generous free egress allowance) would reduce a recurring source of surprise bills.

Monetization stack & signals : how Baseten builds & buys its revenue engine

Buys 3 Builds 1 9 open roles

The read — where the monetization investment is going

Baseten builds the meter but buys the billing layer: its in-house Frontier Gateway tracks usage per API key, then ships it out-of-band to an Orb-backed billing system. A growing RevOps/deal-desk org on Salesforce is wiring up the sales-led enterprise commit motion.

Stack — build vs buy

Builds in-house · 1

Frontier Gateway (in-house metering & rate-limiting) In-house build Blog May 2026

“Token or character consumption is tracked per API key. Usage data is sent out-of-band to allow you to plug directly into your billing provider, without affecting inference performance. Enforce token or request-based limits per API key to prevent abuse and protect other users from noisy neighbor effects.”

Buys (vendor) · 3

Orb Billing Job post Feb 2026

“Build and evolve our billing platform and integrations (including Orb), ensuring correctness, auditability, and a high-trust experience for customers and internal teams.”
Salesforce CRM Job post 1 Job post 2 Apr 2026

“Familiarity with Salesforce and how CRM data intersects with comp and territory workflows.”
dbt Data platform inferred Job post May 2026

“Experience building clean, reusable data models and semantic layers in dbt.”

Open roles in the revenue & lifecycle org — 9

View open roles

Software Engineer - Billing and Internal Tooling MonetizationBilling engineering May 12, 2026
Cost Analytics Lead Billing engineering May 12, 2026
Strategic Finance, GTM Deal desk seen Apr 28, 2026
Head of Legal Operations Deal desk seen Apr 28, 2026
GTM Engineer RevOps seen Apr 22, 2026
Field Operations & Incentives Manager RevOps seen Apr 22, 2026
Account Executive - AI Native: Strategic Customer success seen Apr 21, 2026
AI Solutions Engineer Customer success seen Apr 21, 2026
Applied AI Inference - Forward Deployed Engineer Customer success seen Apr 21, 2026
+7 more matched roles

Signals reviewed Jun 2026 · derived from public job posts, engineering blogs

Job postings fill and close over time — once a posting is filled we keep it as a dated citation (the quoted evidence remains); use View open roles for current listings.

Key takeaways

Per-minute billing is the new transparency floor for inference infrastructure. Baseten’s per-minute GPU rate card made cost-modeling tractable for AI engineering teams in a way per-hour hyperscaler billing never did. Inference platforms targeting cost-conscious AI-native customers should publish per-second or per-minute rates as a baseline expectation.
Open-source tooling is the most efficient PLG channel for infrastructure pricing. Baseten’s open-sourcing of Truss in 2022 made it the de-facto packaging framework and seeded production migration to per-minute hosted deployments. For usage-based infrastructure products, seeding adoption through OSS via guides like our intro to UBP is dramatically more cost-efficient than paid acquisition.
Dual-track usage pricing captures both ends of the workload spectrum. By offering per-minute dedicated AND per-token Model APIs in one workspace, Baseten captures bursty low-volume customers (where per-token wins) and sustained high-QPS customers (where per-minute wins) without forcing a platform choice. Most competitors leave one segment uncaptured by force-fitting workloads to a single SKU model.
BYOC as enterprise upsell preserves the customer relationship past procurement. The self-host option converts the data-residency or committed-spend objection into an Enterprise contract — keeping the customer on Baseten’s control plane and Truss tooling rather than losing them to in-house Kubernetes. This is one of the cleanest enterprise pricing architectures in AI infrastructure.
Cached-input pricing is the new table stakes for multi-tenant inference. OpenAI and Anthropic established cached input as a 50–80% discount norm; Baseten extending this to hosted DeepSeek and Kimi endpoints removes a switching-cost objection for customers using these models via first-party APIs. Multi-tenant inference platforms without cached input will lose RAG and agent-loop workloads to those that ship it.

UBP implications

Per-minute granularity is the credible commitment for scale-to-zero economics. Per-hour billing makes scale-to-zero a marketing claim more than a financial reality — rounded-up minimums consume the savings. For usage-based pricing models where idle-state economics are part of the value proposition, billing granularity must match the marketing claim.
Multi-SKU usage billing reduces vendor sprawl but requires unified forecasting. Baseten’s per-minute + per-token dual-SKU model maximizes wallet share within each customer, but customers struggle to forecast mixed-SKU spend. Future usage-based platforms shipping multiple billing dimensions must invest in unified forecasting tools as a first-class product surface, not an afterthought reporting feature.
Cached input is the next default in token-based pricing. When OpenAI, Anthropic, and now hosted-DeepSeek-via-Baseten all offer 50–80% cached-input discounts, the rate card without caching becomes uncompetitive. Token-pricing competitors must either match cached-input rates or differentiate on latency/throughput sufficient to justify the gap.

Sources

Baseten pricing page (accessed 2026-05-29)
Baseten docs — getting started (accessed 2026-05-29)
Baseten Model APIs catalog (accessed 2026-05-29)
Baseten blog — Series C announcement (accessed 2026-05-29)
Truss framework GitHub (accessed 2026-05-29)
Baseten changelog (accessed 2026-05-29)
Related infra blueprint — Fireworks AI
Related infra blueprint — Cerebras

Bottom line

Baseten priced inference infrastructure for the post-hyperscaler era: per-minute GPU rate cards published openly, no markup on hosted open-source models, scale-to-zero with configurable warm pools, and a dual per-minute + per-token model that lets a single workspace cover both low-volume bursty queries and sustained production workloads. The Truss-based developer experience and BYOC enterprise upsell make the platform sticky once adopted.

For AI-native engineering teams cost-modeling production inference, Baseten is the most legible commercial alternative to building on raw Kubernetes — and the transparent rate card is itself a strategic asset. The remaining gaps (hidden Pro discount schedules, cold-start economics not surfaced on pricing pages, no unified mixed-SKU forecasting) are GTM polish problems rather than structural pricing flaws.

Compare with peers via the blueprint corpus, or model your own spend using the Baseten pricing calculator.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Cached Input Pricing on Model APIs

Feb 2026

Baseten added cached-input pricing to Model APIs — prompts re-using prefixes from prior requests within a session window are billed at 50–80% off the standard input rate (e.g., DeepSeek V3.1 input drops from $0.50 to $0.25 cached). This narrows the gap against first-party caching offerings from OpenAI and Anthropic.

Self-Hosted (BYOC) Deployment Option

Sep 2025

Baseten launched a self-hosted deployment option for enterprises requiring data-residency control or compute-cost optimization via their own AWS, GCP, or Azure GPU reservations. Pricing shifts from per-minute usage to a platform license fee plus customer-supplied compute. Quoted as Enterprise-only.

B200 GPU Availability + Mission Critical SLA Tier

Mar 2025

Baseten added NVIDIA B200 (180GB) at $0.16633/min as its first Blackwell-generation instance, and formalized a Mission Critical SLA tier for enterprise inference workloads with 99.95% uptime guarantees and 24/7 incident response.

H100 GPU Pricing Published at $0.10833/min

Nov 2024

Baseten published transparent per-minute pricing for H100 80GB at $0.10833/min (~$6.50/hour), undercutting hosted inference markups by several major hyperscalers while remaining higher than raw AWS on-demand rates. The transparency was positioned as a competitive differentiator over Bedrock and Vertex AI.

Model APIs (Multi-Tenant) Launched

Aug 2024

Baseten introduced Model APIs — multi-tenant per-token endpoints for popular open-weight models (DeepSeek, Llama 3, Mixtral, Whisper). Pricing: per million input/output tokens, comparable to first-party rates. This added a pure pay-per-token SKU to complement the per-minute dedicated deployment offering.

Series C ($75M) — IVP-Led at $825M Valuation

Feb 2024

Baseten raised a $75M Series C led by IVP at an ~$825M post-money valuation. Customers disclosed at the round: Writer, Descript, Patreon, Robust Intelligence, Picnic Health. Series C funded the launch of Model APIs and Mission Critical SLA tier.

Model Library Launched

Nov 2023

Baseten launched its Model Library — a catalog of pre-deployed open-source models (Stable Diffusion, Whisper, Llama 2, Mistral) that customers could deploy with one click. Pricing: per-minute GPU rate of underlying instance type, no model markup. This established the per-minute, per-instance billing model that remains the core SKU today.

Series B ($20M) — Truss Framework Open-Sourced

Apr 2022

Baseten raised a $20M Series B led by Greylock and South Park Commons at a $90M valuation, and open-sourced Truss — its model-packaging framework that became the standard interface for deploying PyTorch and TensorFlow models on Baseten infrastructure. Pricing pivoted to consumption-based GPU minutes.

Baseten Founded

Jan 2019

Tuhin Srivastava, Amir Haghighat, and Philip Howes (ex-Gumroad) founded Baseten with a focus on letting data scientists deploy ML models without DevOps. Initial product was a no-code model serving interface.

Trivia

· Baseten's $0.10833/minute H100 rate works out to ~$6.50/hour — roughly 1.5–2× AWS on-demand H100 list but Baseten markets the spread as the cost of scale-to-zero plus engineer-free ops.
· Truss, Baseten's open-source model-packaging framework, predates the company's Model APIs by three years — Baseten started as a 'bring your weights, we run the serving stack' product before adding hosted multi-tenant model endpoints in 2024.
· Baseten's $75M Series C in February 2024 was led by IVP at a $825M post-money valuation; customers cited at the round included Writer, Descript, Patreon, and Robust Intelligence.

Questions & answers

How much does Baseten cost per month?: Baseten has no monthly fee on the Basic tier — you pay only for the GPU minutes your dedicated deployments run plus per-token usage on Model APIs. A small production deployment running an A10G GPU 24/7 would cost roughly $870/month ($0.02012 × 60 × 24 × 30); the same model on a scale-to-zero pattern serving ~10 requests/hour might cost under $100/month.
What GPU types does Baseten support and at what price?: Baseten supports T4 ($0.01052/min), L4 ($0.01414/min), A10G ($0.02012/min), A100 80GB ($0.06667/min), H100 MIG 40GB ($0.0625/min), H100 80GB ($0.10833/min), and B200 180GB ($0.16633/min). All rates are per-minute and billed only while the replica is serving traffic.
Does Baseten have a free tier?: Yes — the Basic tier carries no monthly fee. New accounts also receive free credits for experimentation. Basic users get SOC 2 Type II and HIPAA-compliant deployments, in-app and email support, and access to all instance types — they only pay for actual GPU minutes consumed.
How do Baseten's Model APIs compare to dedicated deployments?: Model APIs are multi-tenant per-token endpoints for popular open-weight models (DeepSeek, Kimi, Nemotron, Llama) — billed at $0.30–$4.00 per million tokens depending on the model. Dedicated deployments are single-tenant per-minute GPU rentals where you control the model and runtime. Model APIs are cheaper for low-volume use; dedicated deployments are cheaper at sustained high QPS or for proprietary models.
What does Baseten Enterprise / Mission Critical SLA include?: Enterprise tier adds custom SLAs (Mission Critical at 99.95% uptime), self-hosted (BYOC) deployment option in your AWS/GCP/Azure VPC, data-residency controls, advanced RBAC, custom regions, and dedicated Slack/Zoom support. Pricing is quote-based and typically tied to annual usage commitments at volume discounts.
Does Baseten charge for idle replicas?: No — Baseten's pricing is explicitly per active minute, and idle replicas in scale-to-zero state incur no charge. The trade-off is cold-start latency on the first request after idle; the platform offers configurable warm-pool windows for latency-sensitive workloads at the cost of slightly higher idle billing.