AI Infrastructure & Cloud Pricing: Examples & Companies

19 companies in the corpus Updated full analysis
Definition

AI Infrastructure & Cloud Pricing is Pricing for AI compute infrastructure — GPU clouds, serverless inference, and training platforms.

Also known as: GPU Cloud PricingAI Compute Pricing

What is it

AI Infrastructure & Cloud pricing is pricing for AI compute infrastructure — GPU clouds, serverless inference, and training platforms.

This is the bottom layer of the AI stack: the raw silicon and orchestration that every model, agent, and application ultimately runs on. Vendors in this category sell GPU-hours, inference seconds, and per-output runs rather than finished features. The corpus tracks 19 in-corpus companies here, from pure GPU marketplaces like Vast.ai and RunPod to serverless inference clouds like Groq, Together AI, and DeepInfra, to per-second compute platforms like Modal and Lightning AI.

What unites them is that the underlying product is fungible — an H100 is an H100 no matter whose data center it sits in — so pricing, not features, is the primary battleground. That economic reality pushes the category toward intensely competitive, transparent unit rates and a wider set of discount levers (spot, batch, reserved, committed spend) than any other product segment in the corpus.

The category also has the highest unit proliferation. A single vendor like DeepInfra meters across four units at once — per token, per image, per GPU-hour, and reserved clusters — because customers arrive with different workloads (chat inference, media generation, training) that each have a natural billing unit. For a deeper grounding in how to pick those units, see the guide on choosing the right usage metric.


How it works

Infra-cloud pricing is built from a small set of billing units, layered with availability tiers and volume discounts. Almost every vendor here is fundamentally pure-usage — you pay for what you consume — and then bolts commitment or hybrid options on top for larger buyers.

Billing unitWhat it metersExample from the corpus
GPU-hourRented compute, on-demand or spotRunPod H100 at $2.89/hr; Vast.ai interruptible from $0.194/hr
GPU-secondFine-grained serverless computeModal H100 at $0.001097/sec; Replicate dedicated H100 at $0.001525/sec
Per 1M tokensLLM / embedding inferenceGroq Llama 3.1 8B at $0.05/$0.08; DeepInfra from $0.02/1M in
Per outputGenerative-media runsfal images from $0.02; Replicate FLUX at $0.025–$0.09/image
Credit poolPre-funded compute drawn down by usageLightning AI (15 monthly credits ≈ 80 spot GPU-hrs); Anyscale ACUs

On top of those units sit three near-universal discount levers. Availability tiers trade reliability for price: on-demand capacity at a fixed rate, spot or interruptible capacity that can be reclaimed mid-job (Vast.ai’s auction, RunPod’s Community Cloud at ~20–40% below Secure Cloud). Batch discounts trade latency for price — Groq and Together AI both cut serverless inference 50% for asynchronous jobs. Reserved and committed spend trade flexibility for price, locking a discounted rate against an annual commitment.

Unit math: A pure GPU-second bill is GPU-seconds × per-second rate. Running a job for 90 minutes on a Modal H100: 5,400 sec × $0.001097 = $5.92. The same job on a fixed-hourly H100 at $2.89/hr (RunPod, billed by the hour) costs 2 hrs × $2.89 = $5.78 — but you pay for the rounded-up second hour even if the job finishes early, which is exactly the gap per-second billing closes.


Companies using this

Nineteen in-corpus companies sell AI infrastructure and cloud compute, the broadest set in the product-categories axis. They range from raw GPU marketplaces like Vast.ai and RunPod to inference-specialized clouds like Groq, Cerebras, and fal.


Patterns observed

  • Pure-usage is the default; commitment is the upsell. Almost every company here lists pure-usage as its primary model — see fal, Novita AI, and DeepInfra for no-subscription pay-as-you-go. The larger clouds then layer commitment tiers on top: RunPod, Groq, Together AI, Anyscale, Cerebras, and Baseten all carry a commitment tier for enterprise spend.

  • The billing unit keeps shrinking. Hourly GPU rental was the original unit, but the competitive frontier has moved to the second. Modal bills H100 at $0.001097/sec, Replicate at $0.001525/sec, and Lightning AI bills its Studios “by the second” against a credit pool. Per-second billing is now a selling point precisely because it eliminates the rounding waste of hourly rates.

  • Availability tiering is how marketplaces compete. Vast.ai runs an actual auction — interruptible bids from $0.194/hr — and RunPod splits Secure Cloud from a Community Cloud that runs ~20–40% cheaper. The same hardware sells at two or three prices depending on how much reliability the buyer is willing to give up.

  • Batch is the standard latency-for-price trade. Groq and Together AI both advertise a flat 50% discount for asynchronous batch inference, mirroring the batch-API discounts seen at the foundation-model layer. It has become a near-default SKU for any token-metered cloud.

  • Free tiers are starter credits, not free usage. Where a free tier exists it is almost always a credit grant: Modal’s $30/month, Anyscale’s $100 ACU credit, Cerebras’s $10 Developer tier plus a rate-capped free tier. Perpetual free GPU time is not viable when the underlying cost is real silicon.


Counterexamples & variants

The clearest variant is the pure per-output model, which abandons compute-time billing entirely. fal has no seats, no subscription, and no free tier — it charges per generated image (Seedream V4 at $0.03, Flux Kontext Pro at $0.04) or per megapixel. For a buyer this is the opposite of a GPU-hour deal: you never see the hardware or the runtime, only the output you asked for. It works because generative-media inference has a predictable cost per run, but it would fail for long-running training jobs where output is not a meaningful unit.

A second variant is the credit-pool subscription, which looks like SaaS on the surface but is usage underneath. Lightning AI bundles a monthly credit allotment into freemium seat tiers (15 credits ≈ 80 spot GPU-hours on the free plan), then bills per-second compute against those credits plus pay-as-you-go overage. Anyscale does the same with Anyscale Compute Units (ACUs). This blurs the line with the hybrid pricing model and trades the radical transparency of a raw GPU-hour rate for budget predictability.

The category’s edge case is the inference-specialized chip cloud. Cerebras and Groq sell custom silicon (wafer-scale and LPU respectively) rather than commodity NVIDIA GPUs, so they cannot compete on raw GPU-hour rates the way Vast.ai does. Instead they price per token on the strength of speed — pitching tokens-per-second throughput as the value metric, with hardware contracts reserved for the enterprise tier. It is the one corner of the category where the product is not fungible.


What this means for buyers vs vendors

For buyers

Normalize every quote to a comparable unit before you compare vendors — convert per-second and per-hour rates to the same basis and confirm whether you are billed for actual runtime or rounded-up hours, because per-second clouds like Modal can beat a cheaper headline hourly rate on bursty workloads. Decide upfront how much you will sacrifice for price: spot and interruptible capacity (Vast.ai, RunPod Community Cloud) and batch discounts (Groq, Together AI) can halve your bill if your jobs tolerate interruption or latency. And before signing a reserved or committed-spend deal, model your actual utilization — committed discounts only pay off above a usage floor that the pure-usage default already covers below. The guide on usage invoicing and billing cycles is a useful primer on reading these meters.

For vendors

Pure-usage is table stakes in this category — buyers expect to pay from the first second with no minimum, so lead with a transparent unit rate and layer commitment on top rather than gating entry behind a subscription. Compete on the dimension you actually win: marketplaces like Vast.ai win on raw price via auctions and spot capacity, while chip clouds like Cerebras and Groq win on speed and price per token, not per hour. The infrastructure required is non-trivial — per-second metering, spot reclamation, batch queuing, and credit-pool accounting all demand precise usage tracking, covered in the guide on tracking and metering usage events, and you can sanity-check competitive token rates against the live pricing calculator. This commitment-on-top-of-usage shape is documented further in the infra commitment as standard trend.

Company Product Pricing modelBilling unitsFree tier Verified
AnyscaleManaged Ray platform for distributed AI training, inference, and batch processing (RayTurbo, Anyscale Compute Units)
pure-usagecommitmenthybrid
gpu-hourscpu-hourscredits
Yes2026-05-29
BasetenML inference infrastructure — dedicated GPU deployments, Model APIs, and Truss framework
pure-usagehybridcommitment
gpu-hourstokensrequests
Yes2026-05-29
BrowserbaseBrowser-agent infrastructure: headless browser sessions, web Search/Fetch APIs, agent identity, runtime, and a model gateway behind one API key
freemiumhybridpure-usage
browser-hoursapi-callsrequests+2
Yes2026-06-02
CerebrasWafer-scale AI inference cloud and WSE hardware systems
pure-usagesubscriptioncommitment
tokensapi-callsgpu-hours
Yes2026-05-30
DeepInfraServerless inference cloud — per-token LLM/embedding APIs, per-image and per-minute media models, per-hour on-demand GPU containers, and reserved DeepCluster GPU clusters
pure-usagecommitment
tokensgpu-hoursrequests+1
No2026-06-02
E2BOpen-source cloud sandboxes for AI agents — secure, isolated micro-VMs that run LLM-generated code, coding agents, and computer-use workflows
freemiumhybrid
cpu-hoursgb-hoursstorage-gb
Yes2026-06-02
FalGenerative-media inference platform — serverless per-output model APIs plus dedicated GPU compute
pure-usage
gpu-hoursrequestsmedia-minutes
No2026-06-01
Fireworks AIGenerative AI inference platform — serverless per-token, on-demand GPU, fine-tuning, batch API
pure-usagehybridcommitment
tokensgpu-hoursrequests
Yes2026-05-30
GroqGroqCloud — LPU-based ultra-low-latency inference API for Llama, GPT-OSS, Qwen, Whisper, and Mixtral
pure-usagehybridcommitment
tokensrequestsapi-calls
Yes2026-05-29
Lightning AICloud GPU/CPU Studio compute platform for building, training, and serving AI models, billed by the second with a credit pool.
hybridfreemiumpure-usage
gpu-hourscpu-hourscredits+3
Yes2026-06-02
ModalServerless compute and GPU platform — per-second billing for Python functions, batch jobs, and model serving
pure-usagefreemiumsubscription+1
gpu-hourscpu-hoursgb-hours+2
Yes2026-05-29
Novita AIPay-as-you-go AI cloud: 200+ model inference APIs, on-demand GPUs, and per-second agent sandboxes under one API
pure-usagefreemium
tokensgpu-hourscpu-hours+2
Yes2026-06-02
ReplicateCloud platform for running, fine-tuning, and deploying AI models via REST API
pure-usagehybridcommitment
gpu-hourstokensrequests
Yes2026-05-30
RunPodGPU cloud marketplace — Secure Cloud and Community Cloud Pods, Serverless endpoints, and persistent storage
pure-usagehybridcommitment
gpu-hoursstorage-gb
No2026-05-30
Together AIAI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning
pure-usagehybridcommitment
tokensgpu-hourscpu-hours+1
Yes2026-05-29
turbopufferServerless vector and full-text search database on object storage
pure-usagecommitment
storage-gbvectors-indexedgb-hours+1
No2026-06-04
UpstashUpstash (Redis, Vector, QStash, Search, Workflow)
pure-usagefreemiumhybrid
requestsapi-callsvectors-indexed+3
Yes2026-06-03
Vast.aiGPU rental marketplace — on-demand, interruptible (spot), and reserved cloud GPUs plus autoscaling serverless inference
pure-usagecommitment
gpu-hoursstorage-gbbandwidth-gb
No2026-06-02
ZenRowsUniversal Scraper API, Scraping Browser, and Residential Proxies
hybridsubscriptionpure-usage
requestsapi-callsbandwidth-gb+2
Yes2026-06-04

FAQ

What is AI infrastructure and cloud pricing?

It is the pricing for the raw AI compute layer — GPU clouds, serverless inference, and training platforms. Vendors charge by the GPU-hour or GPU-second, per token, or per output, almost always on a pure pay-as-you-go basis with spot, reserved, and committed-spend discounts on top.

How much does an H100 GPU cost per hour?

On-demand H100 rates in the corpus cluster around $2–$3 per hour (RunPod lists H100 from $2.89/hr). Per-second clouds like Modal ($0.001097/sec ≈ $3.95/hr) and Replicate ($0.001525/sec) bill the same hardware in finer increments, and spot or interruptible capacity (Vast.ai from $0.194/hr for smaller cards) goes far lower.

What is the difference between spot, on-demand, and reserved GPU pricing?

On-demand GPUs run at a fixed hourly rate with guaranteed availability. Spot or interruptible GPUs (Vast.ai auction bids, RunPod Community Cloud) are cheaper but can be reclaimed mid-job. Reserved or committed capacity locks a discounted rate in exchange for a usage commitment, which most clouds — RunPod, Groq, Together AI, Anyscale — offer to enterprise buyers.

Do AI infra clouds bill per hour, per second, or per token?

All three coexist. GPU rentals are billed per hour (Vast.ai, RunPod) or per second (Modal, Replicate, Lightning AI). Serverless inference is billed per million tokens (Groq, Together AI, DeepInfra) or per output such as per image (fal, Replicate). Most clouds run several of these meters in parallel under one account.

Which AI infrastructure providers offer a free tier?

Most do, usually as starter credits rather than perpetual free usage: Modal gives $30/month in credits, Anyscale $100 in ACU credits, and Cerebras a free rate-capped tier plus a $10 Developer tier. Pure-usage clouds like fal, Vast.ai, and DeepInfra skip the free tier and bill from the first dollar.

Why is AI infrastructure pricing so competitive?

The product is fungible — an H100 is an H100 regardless of who racks it — so vendors compete almost entirely on price, capacity, and distribution. That drives the unit proliferation and aggressive spot, batch, and reserved discounts seen across RunPod, Vast.ai, Groq, and Together AI.

Trivia

  • The smallest billing unit in the whole corpus lives here: Modal meters H100 compute at $0.001097 per second, and Replicate prices its dedicated H100 at $0.001525 per second — both bill to the second, not the hour.

  • Vast.ai runs a live auction: its interruptible (spot) GPUs start from $0.194/hr because the highest bidder gets the machine, making the same silicon 50%+ cheaper than on-demand rates that competitors fix.

  • Batch is the category's standard discount lever — Groq and Together AI both cut serverless inference 50% for asynchronous batch jobs, trading latency for price.

See all pricing trivia

Related product categories

Back to companies