All companies
technology

Baseten pricing

baseten.co facts checked analysis reviewed
Quick summary
Product
ML inference infrastructure — dedicated GPU deployments, Model APIs, and Truss framework
Industry
technology
Commits
Available (annual)
In this page
AI Summary
  • Baseten runs a pure-usage GPU-minute billing model for dedicated model deployments plus a separate per-token Model APIs catalog — both pay-as-you-go from the Basic tier with no monthly minimum.
  • Dedicated deployment per-minute rates as of 2026: T4 $0.01052, L4 $0.01414, A10G $0.02012, A100 80GB $0.06667, H100 MIG 40GB $0.0625, H100 80GB $0.10833, B200 180GB $0.16633 — all billed only for active inference time, with scale-to-zero idle replicas free.
  • Model APIs charge per million tokens: DeepSeek V3.1 at $0.50/$1.50, DeepSeek V4 at $1.74/$3.48, Kimi K2.6 at $0.95/$4.00, NVIDIA Nemotron 3 Super at $0.30/$0.75 — with cached-input pricing 50–80% off the standard rate for prompts re-using prefixes.
  • New accounts receive free credits; the Basic tier carries no monthly fee and is SOC 2 Type II + HIPAA-compliant out of the box.
  • Pro and Enterprise tiers add priority GPU access, dedicated compute reservations, higher rate limits, custom SLAs, BYOC self-hosted deployment, data-residency controls, and Slack/Zoom support — priced via annual usage commitments rather than published tier rates.
  • Baseten raised a $75M Series C in February 2024 led by IVP at an ~$825M post-money valuation; a Series D round was reported in late 2025 at >$2B.
Pricing summary
Baseten 2026 — Pay-per-minute GPU + per-token Model APIs
Basic (no monthly fee) → Pro (volume discounts) → Enterprise (Mission Critical SLA, BYOC)
Basic
$0 /mo + usage
Developers, startups, prototypes
Annual commit
Enterprise
Custom
Regulated industries, mission-critical inference
GPU per-minute
From $0.01052 /min (T4)
Dedicated single-tenant deployments
Model APIs
From $0.30 /1M tokens
Per-token multi-tenant endpoints
No monthly fee on Basic. Dedicated GPU billed per minute (active inference only); idle replicas in scale-to-zero state are free. Volume discounts on Pro and Enterprise require quote-based commitments.

About

Baseten is a San Francisco-based ML infrastructure company founded in 2019 by Tuhin Srivastava, Amir Haghighat, and Philip Howes — all formerly at Gumroad. The product is a managed inference platform that lets ML and product teams deploy custom or open-source models behind production-grade endpoints without operating a GPU fleet, Kubernetes cluster, or autoscaler. The interface is opinionated: model code is packaged with Truss (Baseten’s open-source framework), deployed to dedicated single-tenant GPU instances, and exposed via REST or gRPC endpoints with automatic scale-to-zero, configurable warm-pool windows, and per-replica observability.

By 2026 Baseten serves Writer, Descript, Patreon, Robust Intelligence, Picnic Health, and roughly a thousand other paying customers spanning enterprise compliance-sensitive workloads (HIPAA-regulated healthcare AI, financial-services NLP) and high-growth AI-native startups serving sub-second inference SLAs. The company raised a $75M Series C in February 2024 led by IVP at an ~$825M post-money valuation, with a later Series D round at >$2B reported in late 2025.

Baseten competes with hyperscaler inference platforms (AWS Bedrock, Google Vertex AI, Azure ML), specialized inference clouds (Fireworks AI, Together AI, Replicate), and serverless GPU providers (Modal, RunPod). Its differentiation is the combination of per-minute transparent GPU billing, model-agnostic deployment (any PyTorch or TensorFlow model, not just a predefined catalog), and a Truss-based developer experience that handles cold-start optimization, request batching, and concurrency tuning without per-model engineering effort.


Pricing summary : How Baseten’s per-minute GPU + per-token APIs stack works

Baseten runs two parallel pricing surfaces that share the same platform credits balance. Dedicated deployments are billed by GPU-minute on whichever instance type the customer selects (T4 through B200), with no markup for the platform layer beyond the published per-minute rate — the price is the same whether the model is open-source, a customer’s proprietary checkpoint, or a fine-tuned variant. Model APIs are multi-tenant per-token endpoints for a curated catalog of popular open-weight models (DeepSeek, Kimi, Nemotron, Llama, Whisper), priced per million input/output tokens with cached-input discounts.

The Basic tier carries no monthly subscription — customers pay only for active usage, with free credits on signup to encourage experimentation. Pro and Enterprise tiers do not change the per-minute or per-token rates publicly; instead they unlock volume discounts via annual commitments, priority GPU access, dedicated compute reservations, custom SLAs, and BYOC deployment. This dual-track pricing — transparent self-serve consumption plus quote-based commitments — mirrors the pure-usage + commitment hybrid increasingly common across AI infrastructure companies.

What makes this different: Baseten publishes per-minute pricing rather than per-hour, which makes scale-to-zero economics legible. A model that bursts for 4 minutes per request on a $0.10833/min H100 costs $0.43 per burst — the kind of granular cost-modeling that AWS Bedrock and Vertex AI deliberately hide behind aggregate per-month bills.


Pricing by product

Dedicated GPU deployments (per-minute, single-tenant)

InstanceVRAMPer-minute ratePer-hour equivBest for
T416 GiB$0.01052$0.63Low-cost embeddings, small model inference
L424 GiB$0.01414$0.85Mid-range models, video inference
A10G24 GiB$0.02012$1.217B–13B model inference, image gen
A100 80GB80 GiB$0.06667$4.0030B–70B model inference, training
H100 MIG 40GB40 GiB$0.0625$3.75Cost-optimized H100 partition
H100 80GB80 GiB$0.10833$6.50Frontier model inference, low-latency serving
B200 180GB180 GiB$0.16633$9.98Largest open models, multi-model serving

Model APIs (per-token, multi-tenant)

ModelInput ($/1M)Cached input ($/1M)Output ($/1M)
DeepSeek V3.1$0.50$0.25$1.50
DeepSeek V4$1.74$0.145$3.48
Kimi K2.6$0.95$0.16$4.00
NVIDIA Nemotron 3 Super$0.30$0.06$0.75

Tier features (non-usage)

TierMonthly feeVolume discountSLASelf-host (BYOC)Support
Basic$0NoneBest-effortNoEmail + in-app chat
ProQuoteYes (annual commit)Higher rate limitsNoSlack + Zoom
EnterpriseQuoteYes (annual commit)Mission Critical (99.95%)Yes24/7 dedicated

Sales motions across products: PLG / self-serve for Basic (per-minute + per-token), sales-led for Pro and Enterprise commits and BYOC. All prices accessed 2026-05-29 from baseten.co/pricing.


Hidden costs : What Baseten customers actually pay beyond the base GPU rate

Archetype A: AI-native startup running one 7B model on A10G with bursty traffic

A growth-stage startup serving ~50,000 requests/day, average 800ms inference time, with traffic concentrated in business hours:

Line itemMonthly cost
A10G compute (4h/day active × 30 × $0.02012 × 60)$145
Warm-pool retention (avoid cold starts business hours, ~6h/day)$72
Outbound bandwidth (negligible at this scale)<$5
Estimated total~$220/month

The warm-pool premium roughly doubles compute cost — but eliminates 3–8 second cold starts that would push p95 latency above customer SLAs. Most production deployments end up paying a 30–60% warm-pool premium over pure scale-to-zero.

Archetype B: Mid-market team running mixed Model APIs + dedicated H100

A team using Model APIs for general-purpose Q&A and a dedicated H100 for a fine-tuned model:

Line itemMonthly cost
DeepSeek V3.1 Model API (50M input + 15M output)$47.50
Cached input savings on V3.1 (40% of input cached)-$5
Dedicated H100 80GB (8h/day × 30 × $0.10833 × 60)$1,560
Mission Critical SLA add-on (Enterprise)Quote
Estimated total~$1,600/month + SLA quote

H100 dedicated compute dominates the bill — and the customer is paying for warm-pool retention to maintain low latency. The cached input discount on Model APIs (50–80% off) is meaningful only for workloads with repetitive prompt prefixes (RAG, agent loops); ad-hoc query workloads see negligible cache hits.

Want to estimate your own Baseten bill? Use the Baseten pricing calculator to model dedicated GPU cost by instance and active minutes, plus Model API spend by token volume.


Pricing evolution : Baseten’s pricing history from no-code serving to per-minute GPU transparency

Cadence

QuarterPrice changesProduct / SKU additionsNotes
2019 Q101Company founded; initial no-code model-serving product
2022 Q211Series B; Truss open-sourced; pricing shifted to per-minute GPU
2023 Q401Model Library launched — one-click open-source model deployments
2024 Q100Series C ($75M IVP-led); no public price changes
2024 Q301Model APIs (multi-tenant per-token) launched
2024 Q410H100 80GB published at $0.10833/min as a transparency play
2025 Q101B200 instances added at $0.16633/min
2025 Q301Self-host (BYOC) deployment option launched
2026 Q110Cached input pricing added to Model APIs (50–80% off standard input)

Tracked range: 2019 Q1–2026 Q1. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).

Notable changes

  • 2022-04-19 — Truss open-sourced; pricing pivoted from seat-based no-code SaaS to per-minute GPU consumption — the defining structural choice that still shapes Baseten’s positioning.
  • 2023-11-10 — Model Library launched with pre-deployed open-source models (Stable Diffusion, Whisper, Llama 2, Mistral); per-minute rate of underlying instance, no model markup.
  • 2024-08-05 — Model APIs launched, adding pure pay-per-token multi-tenant endpoints to complement per-minute dedicated.
  • 2024-11-18 — H100 published at $0.10833/min — Baseten leaned into transparency as differentiation against opaque hyperscaler markup.
  • 2025-09-22 — Self-host / BYOC option launched for enterprises with strict data-residency or compute-cost requirements.
  • 2026-02-14 — Cached input pricing added to Model APIs; brought parity with first-party caching offerings from OpenAI and Anthropic.

What’s unique : Baseten’s distinctive pricing mechanics

1. Per-minute (not per-hour) billing makes scale-to-zero economics visible. Most cloud GPU offerings — AWS Sagemaker, Vertex AI, Azure ML — bill per hour with billing-minute rounding. Baseten’s per-minute granularity means a 90-second burst on an H100 costs $0.16, not a rounded-up $6.50/hour. This makes usage forecasting materially more accurate for bursty workloads and exposes the real cost of warm-pool retention versus pure scale-to-zero.

2. No model markup on dedicated deployments — only the GPU rate. When a customer runs Llama 3.3 70B on an A100 via Baseten’s Model Library, they pay $0.06667/min — the same A100 rate as if they were running their own proprietary model. Most managed inference platforms charge a per-token markup on hosted open-source models on top of the underlying compute cost. Baseten’s flat per-minute rate makes the platform layer free of tier-based pricing distortion.

3. Dual-SKU split: per-minute dedicated AND per-token multi-tenant in the same product. Baseten lets customers use Model APIs (per-token, multi-tenant) for low-volume general-purpose queries and dedicated deployments (per-minute, single-tenant) for sustained high-QPS or proprietary models — in the same workspace, with the same billing balance. Most competitors force a choice between platforms (Replicate vs Fireworks, for example). This hybrid usage model reduces vendor sprawl for AI-native teams.

4. Cached input pricing applies to multi-tenant Model APIs. Cached-input discounts (50–80% off) are normally exclusive to first-party providers (OpenAI, Anthropic) where the cache is implementation-controlled. Baseten ships cached input on hosted DeepSeek and Kimi endpoints — meaning customers using Baseten’s Model APIs as a DeepSeek proxy actually get caching parity with first-party DeepSeek without negotiating a separate contract.

5. Self-host (BYOC) as an enterprise unlock without abandoning the platform. Baseten’s BYOC option lets large enterprises retain Truss-based deployment, scale-to-zero, observability, and Baseten’s control plane — but use their own AWS, GCP, or Azure GPU reservations and their own VPC for data residency. This platform-license + customer-compute model is unusual in inference middleware and addresses both compliance and committed-spend savings simultaneously.


Strengths & weaknesses

StrengthsWeaknesses
Transparent per-minute GPU pricing across all instance types (T4–B200)Pro and Enterprise tier prices are not published — must contact sales for discounts
No model markup on dedicated deployments — the GPU rate is the pricePer-minute H100 rate ($6.50/hr equiv) is 1.5–2× raw AWS on-demand H100 list
Scale-to-zero with configurable warm-pool eliminates idle billingCold-start latency on scale-to-zero (3–8s) requires warm-pool tuning for latency SLAs
Truss framework supports any PyTorch or TensorFlow model, not a fixed catalogMixing Model APIs and dedicated deployments requires manual cost-modeling — no unified spend forecast
SOC 2 Type II + HIPAA out of the box on Basic tierMission Critical SLA (99.95%) is Enterprise-only — no published SLA on Basic
BYOC option preserves Baseten platform while using customer-owned GPU reservationsGeographic regions limited compared to hyperscalers; APAC residency requires Enterprise + BYOC

Billing UX : Baseten’s account controls and payment experience

  • Self-serve signup — Sign up at app.baseten.co with email; free credits applied automatically. No credit card required to deploy first model.
  • Per-minute billing transparency — Console shows per-deployment minute-level usage history; each replica has a real-time cost meter.
  • Scale-to-zero configuration — Per-deployment settings for minimum replicas (0 = scale-to-zero), maximum replicas (autoscaling cap), and warm-pool retention window (0–60 minutes).
  • Model APIs unified balance — Token consumption across all Model API models bills against the same workspace credit balance as dedicated deployments.
  • Spend alerts — Configurable threshold alerts at $X spend per workspace per period; sent via email and webhook.
  • Payment methods — Credit card and ACH on Basic; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
  • Annual commit discounting — Pro and Enterprise customers receive volume discounts in exchange for annual usage commitments; over-commitment overages bill at standard per-minute or per-token rates.
  • Audit logging + RBAC — Workspace-level RBAC on Pro+; SOC 2 audit-log exports on Enterprise via webhook or S3 delivery.
  • No usage cap on Basic — Customers can run unlimited GPU minutes on Basic with credit card on file; no rate limit on dedicated deployments. Model APIs have published per-minute and per-day request limits per workspace.

Strategic wins : Why Baseten’s pricing decisions worked

1. Per-minute transparency as the GTM wedge against hyperscaler opacity

Baseten’s decision to publish per-minute H100 and B200 rates created a sharp competitive contrast against AWS Bedrock, Vertex AI, and Azure ML — all of which embed inference costs inside aggregate monthly bills with no per-request visibility. For AI engineering leaders forecasting unit economics, Baseten’s transparent rate card lets them model cost-per-inference and amortize against revenue-per-query before signing a contract. This transparency-as-positioning is unusual in B2B infrastructure and likely accelerated mid-market adoption.

2. Truss as an open-source moat for dedicated-deployment pricing

By open-sourcing Truss in 2022, Baseten created a free-to-use packaging framework that became the de-facto standard for model deployment among ML teams. Customers who built on Truss for local development naturally migrated to Baseten for production hosting — making per-minute dedicated pricing the path of least resistance. This PLG via developer tooling drove growth without paid acquisition spend through 2023–2024.

3. Dual SKU (per-minute + per-token) caught both ends of the workload spectrum

Most inference platforms force a choice: pay per-token for multi-tenant (Replicate, Fireworks) or pay per-hour for dedicated (Modal, RunPod). Baseten’s combined offering — per-minute dedicated AND per-token Model APIs in one workspace — captures both bursty low-volume queries (cheap on Model APIs) and sustained high-QPS production (cheap on dedicated). The multi-dimensional usage model maximizes wallet share within each customer.

4. BYOC as the enterprise upsell unlock without abandoning the platform

Baseten’s 2025 launch of self-hosted (BYOC) deployment let large enterprises keep Truss, scale-to-zero, and observability while using their own GPU reservations and VPC. This addresses the two largest enterprise blockers — data residency and committed-spend optimization — without forcing the customer onto a different product. It is a textbook enterprise pricing unlock that captures large customers who would otherwise self-build on Kubernetes.


Areas to improve : Gaps in Baseten’s pricing approach

1. Pro tier pricing should be published

Today the per-minute and per-token rates are identical on Basic, Pro, and Enterprise — the tier difference is volume discount magnitude and support level. By hiding the Pro discount schedule behind a sales call, Baseten loses self-serve growth-stage customers who want to model expansion economics without a sales conversation. Publishing a tiered volume-discount schedule (e.g., 10% off above $5K/mo, 20% off above $25K/mo) would let mid-market customers self-qualify into Pro without sales friction.

2. Cold-start cost is invisible until you hit it

Scale-to-zero is positioned as a cost-saver — and it is — but the latency cost of cold starts (3–8 seconds on a large model) requires customers to either tune warm-pool retention manually or accept latency SLA misses. The pricing page does not show warm-pool premium calculations or cold-start probability curves; customers discover this only after deploying. Adding a cold-start economics calculator to the pricing page would set expectations and likely increase Pro tier conversions for latency-sensitive workloads.

3. No unified spend forecast across Model APIs + dedicated

Customers using both Model APIs and dedicated deployments must manually combine per-token and per-minute forecasts. Baseten’s console shows historical spend by SKU but does not project future spend across mixed workload types. A unified usage forecasting view — token volume + active GPU minutes projected forward — would materially reduce bill-shock anxiety for cost-sensitive AI engineering teams.

4. Pricing page lacks egress and bandwidth detail

For high-volume inference workloads serving large image, audio, or video payloads, network egress can become a meaningful cost line. Baseten’s pricing page does not break out bandwidth pricing — customers learn the rate from invoices. Making egress pricing explicit (and ideally bundling a generous free egress allowance) would reduce a recurring source of surprise bills.


Key takeaways

  1. Per-minute billing is the new transparency floor for inference infrastructure. Baseten’s per-minute GPU rate card made cost-modeling tractable for AI engineering teams in a way per-hour hyperscaler billing never did. Inference platforms targeting cost-conscious AI-native customers should publish per-second or per-minute rates as a baseline expectation.

  2. Open-source tooling is the most efficient PLG channel for infrastructure pricing. Baseten’s open-sourcing of Truss in 2022 made it the de-facto packaging framework and seeded production migration to per-minute hosted deployments. For usage-based infrastructure products, seeding adoption through OSS via guides like our intro to UBP is dramatically more cost-efficient than paid acquisition.

  3. Dual-track usage pricing captures both ends of the workload spectrum. By offering per-minute dedicated AND per-token Model APIs in one workspace, Baseten captures bursty low-volume customers (where per-token wins) and sustained high-QPS customers (where per-minute wins) without forcing a platform choice. Most competitors leave one segment uncaptured by force-fitting workloads to a single SKU model.

  4. BYOC as enterprise upsell preserves the customer relationship past procurement. The self-host option converts the data-residency or committed-spend objection into an Enterprise contract — keeping the customer on Baseten’s control plane and Truss tooling rather than losing them to in-house Kubernetes. This is one of the cleanest enterprise pricing architectures in AI infrastructure.

  5. Cached-input pricing is the new table stakes for multi-tenant inference. OpenAI and Anthropic established cached input as a 50–80% discount norm; Baseten extending this to hosted DeepSeek and Kimi endpoints removes a switching-cost objection for customers using these models via first-party APIs. Multi-tenant inference platforms without cached input will lose RAG and agent-loop workloads to those that ship it.


UBP implications

  1. Per-minute granularity is the credible commitment for scale-to-zero economics. Per-hour billing makes scale-to-zero a marketing claim more than a financial reality — rounded-up minimums consume the savings. For usage-based pricing models where idle-state economics are part of the value proposition, billing granularity must match the marketing claim.

  2. Multi-SKU usage billing reduces vendor sprawl but requires unified forecasting. Baseten’s per-minute + per-token dual-SKU model maximizes wallet share within each customer, but customers struggle to forecast mixed-SKU spend. Future usage-based platforms shipping multiple billing dimensions must invest in unified forecasting tools as a first-class product surface, not an afterthought reporting feature.

  3. Cached input is the next default in token-based pricing. When OpenAI, Anthropic, and now hosted-DeepSeek-via-Baseten all offer 50–80% cached-input discounts, the rate card without caching becomes uncompetitive. Token-pricing competitors must either match cached-input rates or differentiate on latency/throughput sufficient to justify the gap.


Sources


Bottom line

Baseten priced inference infrastructure for the post-hyperscaler era: per-minute GPU rate cards published openly, no markup on hosted open-source models, scale-to-zero with configurable warm pools, and a dual per-minute + per-token model that lets a single workspace cover both low-volume bursty queries and sustained production workloads. The Truss-based developer experience and BYOC enterprise upsell make the platform sticky once adopted.

For AI-native engineering teams cost-modeling production inference, Baseten is the most legible commercial alternative to building on raw Kubernetes — and the transparent rate card is itself a strategic asset. The remaining gaps (hidden Pro discount schedules, cold-start economics not surfaced on pricing pages, no unified mixed-SKU forecasting) are GTM polish problems rather than structural pricing flaws.

Compare with peers via the blueprint corpus, or model your own spend using the Baseten pricing calculator.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Cached Input Pricing on Model APIs

Baseten added cached-input pricing to Model APIs — prompts re-using prefixes from prior requests within a session window are billed at 50–80% off the standard input rate (e.g., DeepSeek V3.1 input drops from $0.50 to $0.25 cached). This narrows the gap against first-party caching offerings from OpenAI and Anthropic.

Self-Hosted (BYOC) Deployment Option

Baseten launched a self-hosted deployment option for enterprises requiring data-residency control or compute-cost optimization via their own AWS, GCP, or Azure GPU reservations. Pricing shifts from per-minute usage to a platform license fee plus customer-supplied compute. Quoted as Enterprise-only.

B200 GPU Availability + Mission Critical SLA Tier

Baseten added NVIDIA B200 (180GB) at $0.16633/min as its first Blackwell-generation instance, and formalized a Mission Critical SLA tier for enterprise inference workloads with 99.95% uptime guarantees and 24/7 incident response.

H100 GPU Pricing Published at $0.10833/min

Baseten published transparent per-minute pricing for H100 80GB at $0.10833/min (~$6.50/hour), undercutting hosted inference markups by several major hyperscalers while remaining higher than raw AWS on-demand rates. The transparency was positioned as a competitive differentiator over Bedrock and Vertex AI.

Model APIs (Multi-Tenant) Launched

Baseten introduced Model APIs — multi-tenant per-token endpoints for popular open-weight models (DeepSeek, Llama 3, Mixtral, Whisper). Pricing: per million input/output tokens, comparable to first-party rates. This added a pure pay-per-token SKU to complement the per-minute dedicated deployment offering.

Series C ($75M) — IVP-Led at $825M Valuation

Baseten raised a $75M Series C led by IVP at an ~$825M post-money valuation. Customers disclosed at the round: Writer, Descript, Patreon, Robust Intelligence, Picnic Health. Series C funded the launch of Model APIs and Mission Critical SLA tier.

Model Library Launched

Baseten launched its Model Library — a catalog of pre-deployed open-source models (Stable Diffusion, Whisper, Llama 2, Mistral) that customers could deploy with one click. Pricing: per-minute GPU rate of underlying instance type, no model markup. This established the per-minute, per-instance billing model that remains the core SKU today.

Series B ($20M) — Truss Framework Open-Sourced

Baseten raised a $20M Series B led by Greylock and South Park Commons at a $90M valuation, and open-sourced Truss — its model-packaging framework that became the standard interface for deploying PyTorch and TensorFlow models on Baseten infrastructure. Pricing pivoted to consumption-based GPU minutes.

Baseten Founded

Tuhin Srivastava, Amir Haghighat, and Philip Howes (ex-Gumroad) founded Baseten with a focus on letting data scientists deploy ML models without DevOps. Initial product was a no-code model serving interface.

Trivia
  • · Baseten's $0.10833/minute H100 rate works out to ~$6.50/hour — roughly 1.5–2× AWS on-demand H100 list but Baseten markets the spread as the cost of scale-to-zero plus engineer-free ops.
  • · Truss, Baseten's open-source model-packaging framework, predates the company's Model APIs by three years — Baseten started as a 'bring your weights, we run the serving stack' product before adding hosted multi-tenant model endpoints in 2024.
  • · Baseten's $75M Series C in February 2024 was led by IVP at a $825M post-money valuation; customers cited at the round included Writer, Descript, Patreon, and Robust Intelligence.

Questions & answers

How much does Baseten cost per month?
Baseten has no monthly fee on the Basic tier — you pay only for the GPU minutes your dedicated deployments run plus per-token usage on Model APIs. A small production deployment running an A10G GPU 24/7 would cost roughly $870/month ($0.02012 × 60 × 24 × 30); the same model on a scale-to-zero pattern serving ~10 requests/hour might cost under $100/month.
What GPU types does Baseten support and at what price?
Baseten supports T4 ($0.01052/min), L4 ($0.01414/min), A10G ($0.02012/min), A100 80GB ($0.06667/min), H100 MIG 40GB ($0.0625/min), H100 80GB ($0.10833/min), and B200 180GB ($0.16633/min). All rates are per-minute and billed only while the replica is serving traffic.
Does Baseten have a free tier?
Yes — the Basic tier carries no monthly fee. New accounts also receive free credits for experimentation. Basic users get SOC 2 Type II and HIPAA-compliant deployments, in-app and email support, and access to all instance types — they only pay for actual GPU minutes consumed.
How do Baseten's Model APIs compare to dedicated deployments?
Model APIs are multi-tenant per-token endpoints for popular open-weight models (DeepSeek, Kimi, Nemotron, Llama) — billed at $0.30–$4.00 per million tokens depending on the model. Dedicated deployments are single-tenant per-minute GPU rentals where you control the model and runtime. Model APIs are cheaper for low-volume use; dedicated deployments are cheaper at sustained high QPS or for proprietary models.
What does Baseten Enterprise / Mission Critical SLA include?
Enterprise tier adds custom SLAs (Mission Critical at 99.95% uptime), self-hosted (BYOC) deployment option in your AWS/GCP/Azure VPC, data-residency controls, advanced RBAC, custom regions, and dedicated Slack/Zoom support. Pricing is quote-based and typically tied to annual usage commitments at volume discounts.
Does Baseten charge for idle replicas?
No — Baseten's pricing is explicitly per active minute, and idle replicas in scale-to-zero state incur no charge. The trade-off is cold-start latency on the first request after idle; the platform offers configurable warm-pool windows for latency-sensitive workloads at the cost of slightly higher idle billing.