All companies
technology

BentoML pricing

bentoml.com facts checked analysis reviewed
Quick summary
Product segment
Region
Product
BentoCloud — managed model-serving & inference platform
Industry
technology
Commits
Available (annual)
In this page
AI Summary
  • BentoML is an open-source Python framework for building model-serving apps (Bentos); the commercial product is BentoCloud, a managed serverless inference platform.
  • BentoCloud bills compute per second across CPU and GPU instances: CPU (cpu.1) at $0.00001322/sec, NVIDIA T4 (gpu.t4.1) at $0.00014198/sec (~$0.51/hr), L4 ~$0.80/hr, H100 ~$2.65/hr.
  • Three tiers: Starter (free, pay-as-you-go, $10 signup credits), Pro at $1,000/mo plus usage (priority A100/H100/H200, multi-region), and Enterprise (custom, self-host or BYOC in your own VPC).
  • Scale-to-zero means idle deployments cost nothing, and per-second metering avoids paying for a full hour you didn't use — the headline difference from hourly GPU clouds.
  • Usage commitments unlock discounts on Pro and Enterprise; BentoML raised a $9M seed in 2023 (DCM Ventures) and monetizes the open-source framework through BentoCloud.
Pricing summary
BentoCloud 2026 — Plans & compute pricing
Per-second CPU/GPU compute for serverless model serving, with scale-to-zero so idle deployments cost nothing.
Starter
Free + usage
Individuals & teams shipping their first models
Enterprise
Custom
Regulated orgs wanting BYOC / on-prem
Rates as of June 2026 (bentoml.com/pricing). Compute metered per second; verify current rates before committing.

About

BentoML is an open-source Python framework for building model-serving and AI applications — you package a model, its dependencies, and inference logic into a deployable unit the project calls a “Bento.” Founded in 2019 by Chaoyu Yang and Winston Wenyan Yin, both former Databricks engineers, BentoML grew a large open-source following as a standard way to turn trained models into production-ready inference services. The framework itself is free and self-hostable.

The commercial product is BentoCloud, a fully managed serverless platform that deploys, auto-scales, and observes those Bentos as inference endpoints — handling the GPU provisioning, cold starts, and scaling that teams would otherwise build themselves. This is the classic open-core pattern: give away the framework, monetize the managed runtime. BentoML raised a $9M seed round in June 2023 (DCM Ventures, Bow Capital, Firestreak Ventures) to build out BentoCloud.

BentoCloud competes in the serverless GPU-inference category alongside Baseten, Modal, Replicate, and Runpod — platforms that abstract raw GPU clouds (like Lambda or CoreWeave) into deploy-a-model workflows. For current pricing, see BentoCloud’s pricing page.


Pricing summary : How BentoML’s pricing model works

BentoCloud is pure usage-based compute, metered per second, wrapped in a freemium-plus-platform-fee tier structure. You pay for the CPU and GPU instances your deployments actually consume, and because services scale to zero, an idle deployment costs nothing between requests. There are three tiers:

  1. Starter — Free to start, pay-as-you-go. You only pay for compute used, billed monthly to a credit card, and new accounts get $10 in free credits. Includes scale-to-zero, SOC 2 Type II, a monitoring dashboard, real-time logging, and community Slack support.
  2. Pro$1,000/month plus usage. Adds priority access to high-performance GPUs (A100, H100, H200), unlimited seats and deployments, and multi-region options across US, EU, and APAC. Invoice billing.
  3. Enterprise — Custom-priced. Self-hosting or deployment inside the customer’s own VPC on AWS, GCP, or Azure (BYOC), or on-premises, with dedicated support, SSO, and compliance. Usage commitments unlock discounts.

What makes this different: Most “deploy a model” platforms bill by the hour or by request. BentoCloud meters dedicated GPU/CPU instances by the second, then layers scale-to-zero on top — so you’re not paying for a warm idle GPU or rounding every short job up to a full hour. The Pro tier’s flat $1,000 platform fee is the price of admission to the best accelerators and multi-region, which is unusual for a self-serve inference product.


Pricing by product

On-demand compute rates, metered per second, as of June 2026:

InstanceSpecPer-secondApprox. /hrBest for
CPU (cpu.1)general compute$0.00001322~$0.05Lightweight / preprocessing
NVIDIA T4 (gpu.t4.1)16GB VRAM, 8 vCPU$0.00014198~$0.51Small-model & batch inference
NVIDIA L424GB VRAM, 12 vCPU~$0.80Cost-efficient inference
NVIDIA H10080GB VRAM, 16 vCPU, 200GiB RAM~$2.65LLM / frontier inference

Tier structure on top of compute:

TierPlatform feeComputeKey mechanics
StarterFreePay-as-you-go, per second$10 credits, scale-to-zero, credit-card billing
Pro$1,000/moUsage on topPriority A100/H100/H200, multi-region, invoice
EnterpriseCustomUsage + commitmentsBYOC / on-prem, SSO, committed-use discounts

Sales motions across products: Starter and Pro are self-serve / PLG (sign up, deploy, pay by card or invoice); Enterprise (BYOC, on-prem, committed-use) is sales-led. A100/H100/H200 capacity is prioritized for Pro and Enterprise.


Hidden costs : What BentoML users actually pay

BentoCloud’s headline meter is clean (per second, scale-to-zero), but a real bill has a few moving parts beyond the GPU rate:

Line itemCost
GPU compute (e.g. H100)~$2.65/hr, metered per second while the replica is running
CPU compute (cpu.1)$0.00001322/sec, ~$0.05/hr
Pro platform fee$1,000/mo on top of usage (only if you need Pro GPUs/regions)
Min-replica / always-onDisabling scale-to-zero keeps a replica warm — and billing — 24/7
Cold-start vs. warm trade-offKeeping replicas warm cuts latency but removes the scale-to-zero savings

The biggest real-world cost lever is whether you let deployments scale to zero. Scale-to-zero is the headline saving, but latency-sensitive production endpoints often pin a minimum replica count to avoid cold starts — and a pinned GPU replica bills continuously at the per-second rate, which on an H100 is roughly $1,900/month per always-on replica. The second is the $1,000/month Pro fee: it’s worth it once you genuinely need prioritized A100/H100/H200 capacity or multi-region, but it’s pure overhead for a small Starter workload.

Want to estimate your own BentoCloud bill? Use the BentoML pricing calculator to model your costs based on instance type and runtime.


Pricing evolution : BentoML pricing history and changes

Cadence

PeriodPrice changesProduct / SKU additionsNotes
2023BentoCloud launched$9M seed; pay-per-use managed serving on the open-source framework
2024–2025Tiers formalizedStarter / Pro / EnterprisePer-second CPU/GPU metering, scale-to-zero, multi-region on Pro
2026 Q2Rates publishedA100/H100/H200 on Pro+T4 ~$0.51/hr, L4 ~$0.80/hr, H100 ~$2.65/hr; $10 signup credits

Tracked range: 2023–present (BentoCloud commercial launch onward).

Notable changes

  • 2023 — BentoCloud launches as the managed, serverless commercial layer on top of the open-source BentoML framework, monetizing on pay-per-use compute, backed by a $9M seed.
  • 2024–2025 — Pricing settles into three tiers: free pay-as-you-go Starter, Pro at $1,000/mo plus usage with priority high-end GPUs and multi-region, and custom Enterprise BYOC. Compute metered per second across CPU and GPU instances.
  • June 2026 — On-demand rates in effect: CPU $0.00001322/sec, T4 $0.00014198/sec (~$0.51/hr), L4 ~$0.80/hr, H100 ~$2.65/hr; $10 signup credits; usage commitments unlock discounts.

The through-line is open-core monetization: the framework stays free while BentoCloud captures value on managed compute, with per-second granularity and scale-to-zero as the buyer-friendly hooks and a flat Pro fee gating the premium accelerators.


What’s unique : BentoML’s distinctive pricing mechanics

1. Per-second compute, not per-hour. BentoCloud meters dedicated GPU/CPU instances by the second (a T4 is literally priced as $0.00014198/sec), so short or bursty inference jobs don’t round up to a full billed hour the way they do on most hourly GPU clouds.

2. Scale-to-zero on dedicated serving. Idle deployments drop to zero replicas and stop billing — combining the cost profile of serverless with dedicated-instance performance, which is rare for model serving where teams usually keep GPUs warm.

3. Open-core with a Pro platform fee. The framework is free and self-hostable; BentoCloud monetizes the runtime. The $1,000/mo Pro fee is an explicit gate for priority A100/H100/H200 and multi-region rather than a per-seat or per-request charge.


Strengths & weaknesses

StrengthsWeaknesses
Per-second metering avoids paying for unused hours$1,000/mo Pro fee is steep for small workloads
Scale-to-zero — idle deployments cost nothingPremium GPUs (A100/H100/H200) gated behind Pro+
Free open-source framework, free Starter tier, $10 creditsAlways-on replicas erase the scale-to-zero savings
Open-core: self-host the framework or buy the managed runtimeCold starts are the trade-off for scale-to-zero
BYOC / on-prem option for regulated buyersSmaller GPU catalog than raw GPU clouds at Starter level

Billing UX : BentoML billing controls and transparency

  • Billing controls — Per-second metering with scale-to-zero by default; teams can pin minimum replicas (trading cost for latency). Pro adds a fixed $1,000/mo platform fee; Enterprise uses committed-use discounts.
  • Usage visibility — A monitoring dashboard plus real-time logging are included from the Starter tier, so spend tracks directly to running replicas and instance type.
  • Payment options — Starter is billed monthly to a credit card on total usage; Pro and Enterprise are invoice-billed with terms set in individual order forms.

Strategic wins : Why BentoML’s pricing decisions worked

1. Open-core: free framework as the top of the funnel

By keeping BentoML free and open-source, the company built a large developer base that already packages models in its format — making BentoCloud the path of least resistance when those models go to production. See how AI companies structure pricing.

2. Per-second + scale-to-zero as a trust signal

Metering by the second and dropping idle deployments to zero directly addresses the buyer’s biggest fear in GPU serving — paying for idle accelerators. It’s a usage-aligned meter that lowers the risk of trying the platform. Related: outcome-based pricing trends.

3. A flat Pro fee to capture serious production teams

Rather than nickel-and-diming requests, the $1,000/mo Pro tier cleanly separates hobbyists from teams that need priority H100/H200 and multi-region — a simple value-metric gate. See choosing the right usage metric.


Areas to improve : Gaps in BentoML’s pricing approach

1. The Pro fee is a cliff, not a ramp

Jumping from free Starter to a $1,000/mo Pro fee is a steep step with little in between. A mid-tier (or pay-as-you-go access to better GPUs without the flat fee) would smooth the path for growing teams. See bill shock and cost unpredictability.

2. Always-on cost is easy to under-estimate

Scale-to-zero is the headline, but production endpoints that pin replicas to avoid cold starts bill continuously — and that reality isn’t obvious from the per-second sticker. Clearer always-on cost modeling in-console would help.

3. Limited public rate card for premium GPUs

T4, L4, and H100 rates are discoverable, but A100/H200 pricing is effectively gated behind Pro/Enterprise. Publishing the full accelerator rate card would match the transparency buyers get from raw GPU clouds.


Key takeaways

  1. BentoCloud is pure per-second usage pricing on CPU/GPU compute, with a freemium Starter and a paid Pro platform fee. For the underlying model, see the introduction to usage-based pricing.
  2. Per-second metering + scale-to-zero is the buyer-friendly core — you don’t pay for idle GPUs or round short jobs up to an hour.
  3. It’s an open-core business: the BentoML framework is free; BentoCloud monetizes the managed runtime.
  4. The $1,000/mo Pro fee gates priority A100/H100/H200 and multi-region — a clean value-metric step, but a cliff from the free tier.
  5. Always-on replicas are the real hidden cost — pinning a GPU to dodge cold starts removes the scale-to-zero savings.

UBP implications

  1. Per-second metering builds trust in GPU serving. Aligning the meter to actual consumption (down to the second, with scale-to-zero) directly counters the buyer’s fear of paying for idle accelerators — a reusable pattern for any compute-heavy usage business.
  2. Open-core lets the free tier do the selling. A free, widely-adopted framework feeds the paid managed runtime, so the usage meter only has to convert developers who are already invested.
  3. A flat platform fee can cleanly segment usage tiers. BentoCloud’s $1,000/mo Pro fee separates serious teams from hobbyists without complicating the per-second compute meter — a simpler alternative to tiered per-request pricing.

Sources


Bottom line

BentoML is a clean example of open-core monetization in AI infra: the model-serving framework is free and widely adopted, while BentoCloud captures value on managed compute. Its pricing is pure usage — CPU/GPU instances metered by the second with scale-to-zero, so idle deployments cost nothing — wrapped in a free Starter, a $1,000/mo Pro tier for priority H100/H200 and multi-region, and custom Enterprise BYOC. The buyer-friendly hooks are per-second billing and scale-to-zero; the costs to watch are the flat Pro fee and always-on replicas that quietly undo the scale-to-zero savings. Browse the pricing blueprint for more fully-researched company profiles, or compare BentoML against other Infrastructure, Compute & MLOps companies.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Published instance rates: T4 ~$0.51/hr, L4 ~$0.80/hr, H100 ~$2.65/hr

On-demand per-second rates in effect: CPU $0.00001322/sec, T4 $0.00014198/sec (~$0.51/hr), L4 ~$0.80/hr, H100 ~$2.65/hr; $10 signup credits; usage commitments unlock discounts.

Per-second compute tiers formalized (Starter / Pro / Enterprise)

BentoCloud settled into a three-tier shape: free pay-as-you-go Starter, Pro at $1,000/mo plus usage with priority A100/H100/H200 and multi-region, and custom Enterprise BYOC. Compute billed per second across CPU and GPU instance types.

BentoCloud commercial launch backed by $9M seed

BentoML raised a $9M seed (DCM Ventures, Bow Capital, Firestreak Ventures) and built out BentoCloud, the managed serverless layer on top of the open-source framework, on a pay-per-use compute model.

Trivia
  • · BentoML is open-source and free — the company makes money on BentoCloud, the managed serverless layer that deploys and auto-scales the 'Bentos' you package with the framework.
  • · BentoCloud bills compute by the second, not the hour: a T4 GPU is metered at $0.00014198/sec, so you don't pay for a full hour you didn't use.
  • · Deployments scale to zero, meaning an idle Starter project can cost nothing between requests — unusual for dedicated GPU serving.

Questions & answers

How does BentoCloud's pricing work?
BentoCloud is pure usage-based: you pay per second for the CPU and GPU instances your deployments consume, with scale-to-zero so idle services cost nothing. A free Starter tier runs pay-as-you-go (with $10 in signup credits), a Pro tier adds a $1,000/month platform fee for priority high-end GPUs and multi-region, and Enterprise is custom-quoted for self-hosting or deployment inside your own cloud (BYOC).
How much does a GPU cost on BentoCloud?
BentoCloud meters per second. An NVIDIA T4 (gpu.t4.1) is $0.00014198/sec (about $0.51/hr), an L4 is roughly $0.80/hr, and an H100 is about $2.65/hr. A plain CPU instance (cpu.1) is $0.00001322/sec. A100, H100 and H200 capacity is prioritized for Pro and Enterprise customers.
Does BentoML have a free tier?
Yes. The Starter tier is free to begin and runs pay-as-you-go — you pay only for compute consumed, billed monthly to a credit card. New accounts also get $10 in free credits. Because deployments scale to zero, an idle Starter project can sit at no cost between requests.
What's the difference between BentoML and BentoCloud?
BentoML is the free open-source Python framework for packaging models and AI apps into deployable 'Bentos.' BentoCloud is the paid managed platform that deploys, auto-scales, and observes those Bentos as serverless inference endpoints. You can run BentoML yourself for free; BentoCloud is how the company monetizes it.