AI Infrastructure & Cloud Pricing: Examples & Companies

What is it

AI Infrastructure & Cloud pricing is pricing for AI compute infrastructure — GPU clouds, serverless inference, and training platforms.

This is the bottom layer of the AI stack: the raw silicon and orchestration that every model, agent, and application ultimately runs on. Vendors in this category sell GPU-hours, inference seconds, tokens, per-output runs, and database operations rather than finished features. The corpus tracks 42 in-corpus companies here — the broadest product category on the site — from pure GPU marketplaces like Vast.ai and RunPod, to serverless inference clouds like Groq, Together AI, and DeepInfra, to per-second compute platforms like Modal and Lightning AI.

The category has widened well beyond bare GPU rental. It now includes hyperscale-class clouds like CoreWeave, Lambda, and Nebius; managed vector databases such as Pinecone, Qdrant, and Weaviate; model gateways like OpenRouter and Portkey; agent sandboxes (E2B, Modal); and the observability layer that watches it all (Helicone, LangSmith, Arize). What unites them is that they all sell consumption of infrastructure, priced by a metered unit rather than a per-seat license.

For the GPU-cloud core the underlying product is fungible — an H100 is an H100 no matter whose data center it sits in — so pricing, not features, is the primary battleground. That economic reality pushes the raw-compute vendors toward intensely competitive, transparent unit rates and a wider set of discount levers (spot, batch, reserved, committed spend) than any other product segment in the corpus. The category also carries the highest unit proliferation: a single vendor like DeepInfra meters across per-token, per-image, per-minute media, and per-GPU-hour units at once, because customers arrive with different workloads that each have a natural billing unit. For a deeper grounding in how to pick those units, see the guide on choosing the right usage metric.

One metered H100 · three prices set by reliability

How it works

Infra-cloud pricing is built from a small set of billing units, layered with availability tiers and volume discounts. Almost every raw-compute vendor here is fundamentally pure-usage — you pay for what you consume — and then bolts commitment or hybrid options on top for larger buyers.

Billing unit	What it meters	Example from the corpus
GPU-hour	Rented compute, on-demand or spot	RunPod H100 PCIe at $2.89/hr; Vast.ai interruptible from $0.194/hr; CoreWeave HGX H100 ≈ $6.16/GPU/hr
GPU-second	Fine-grained serverless compute	Modal H100 at $0.001097/sec; Replicate dedicated H100 at $0.001525/sec
Per 1M tokens	LLM / embedding inference	Groq Llama 3.1 8B at $0.05/$0.08; DeepInfra Llama 3.1 8B at $0.02/$0.05
Per output	Generative-media runs	fal Seedream V4 at $0.03/image, Flux Kontext Pro at $0.04/image
DB operations	Vector reads/writes + storage	Pinecone ~$16–18/M read units, ~$4–4.50/M write units, ~$0.33/GB-mo
Credit pool	Pre-funded compute drawn down by usage	Lightning AI (credits ≈ spot GPU-hours); Anyscale ACUs

On top of those units sit three near-universal discount levers. Availability tiers trade reliability for price: on-demand capacity at a fixed rate, spot or interruptible capacity that can be reclaimed mid-job (Vast.ai’s auction, RunPod’s Community Cloud, CoreWeave’s HGX H100 spot at $19.71/hr vs $49.24 on-demand). Batch discounts trade latency for price — Groq and Together AI both cut serverless inference 50% for asynchronous jobs. Reserved and committed spend trade flexibility for price, locking a discounted rate against a commitment — Together AI’s reserved cluster H100 drops to $3.09/hr, and CoreWeave quotes reserved contracts at up to 60% off on-demand.

Unit math: A pure GPU-second bill is GPU-seconds × per-second rate. Running a job for 90 minutes on a Modal H100: 5,400 sec × $0.001097 = $5.92. The same job on a fixed-hourly H100 at $2.89/hr (RunPod, billed by the hour) costs 2 hrs × $2.89 = $5.78 — but you pay for the rounded-up second hour even if the job finishes early, which is exactly the gap per-second billing closes.

Companies using this

Forty-two in-corpus companies sell AI infrastructure and cloud compute — the broadest set in the product-categories axis. They span raw GPU marketplaces like Vast.ai and RunPod, hyperscale GPU clouds like CoreWeave and Lambda, inference-specialized chip clouds like Groq and Cerebras, and the surrounding layer of vector databases, gateways, sandboxes, and observability that AI applications depend on.

Patterns observed

Pure-usage is the default; commitment is the upsell. Almost every raw-compute company here lists pure-usage as its primary model — see fal, Novita AI, and DeepInfra for no-subscription pay-as-you-go, and Lambda and Nebius for per-GPU-hour clouds that bill from the first hour. The larger clouds then layer commitment on top: RunPod, Groq, Together AI, DeepInfra, and Cerebras all carry a commitment tier for enterprise spend, and CoreWeave earns the bulk of its revenue from sales-quoted reserved contracts.
The billing unit keeps shrinking. Hourly GPU rental was the original unit, but the competitive frontier has moved to the second. Modal bills H100 at $0.001097/sec, Replicate at $0.001525/sec, and Lightning AI bills its Studios “by the second” against a credit pool. Per-second billing is now a selling point precisely because it eliminates the rounding waste of hourly rates. Agent sandboxes like E2B inherit the same per-second compute model for micro-VM run-time.
Availability tiering is how marketplaces compete. Vast.ai runs an actual auction — interruptible bids from $0.194/hr — RunPod splits Secure Cloud from a cheaper Community Cloud, and CoreWeave publishes on-demand and spot side by side. The same hardware sells at two or three prices depending on how much reliability the buyer is willing to give up.
Batch is the standard latency-for-price trade. Groq and Together AI both advertise a flat 50% discount for asynchronous batch inference, mirroring the batch-API discounts seen at the foundation-model layer. It has become a near-default SKU for any token-metered cloud.
Free tiers are starter credits or capped plans, not free compute. Where a free tier exists it is almost always a credit grant or a capped starter: Modal’s $30/month, Anyscale’s $100 ACU credit, Cerebras’s $10 Developer tier plus a rate-capped free tier, and Pinecone’s capped Starter plan (2 GB storage, 1M read units/month). Perpetual free GPU time is not viable when the underlying cost is real silicon.
The adjacent infra layer prices on operations, not silicon. Beyond raw GPUs, the category’s databases, gateways, and observability tools meter their own units. Pinecone bills read units, write units, and per-GB storage; OpenRouter passes through model prices at no markup and monetizes with a 5.5% credit-purchase fee; and Helicone prices log volume with a free tier and $79 Pro plan. These vendors compete more on features than on fungible-hardware price, but still converge on transparent usage meters.

Counterexamples & variants

The clearest variant is the pure per-output model, which abandons compute-time billing entirely. fal has no seats, no subscription, and no free tier — it charges per generated image (Seedream V4 at $0.03, Flux Kontext Pro at $0.04) or per megapixel and per-second video. For a buyer this is the opposite of a GPU-hour deal: you never see the hardware or the runtime, only the output you asked for. It works because generative-media inference has a predictable cost per run, but it would fail for long-running training jobs where output is not a meaningful unit.

A second variant is the credit-pool subscription, which looks like SaaS on the surface but is usage underneath. Lightning AI bundles a monthly credit allotment into freemium seat tiers, then bills per-second compute against those credits plus pay-as-you-go overage. Anyscale does the same with Anyscale Compute Units (ACUs). This blurs the line with the hybrid pricing model and trades the radical transparency of a raw GPU-hour rate for budget predictability.

A third variant is the percentage-fee gateway, where the vendor sells no compute of its own. OpenRouter routes across 400+ models at pass-through per-token pricing with zero markup and instead charges a 5.5% fee on prepaid credit purchases (minimum $0.80; 5% on crypto), plus a 5% BYOK fee after 1M free requests a month. There is no GPU-hour, no spot tier, and no reserved discount — the meter is the transaction, not the silicon. It is closer to a payments rail than a cloud.

The category’s true edge case is the inference-specialized chip cloud. Cerebras and Groq sell custom silicon (wafer-scale and LPU respectively) rather than commodity NVIDIA GPUs, so they cannot compete on raw GPU-hour rates the way Vast.ai does. Instead they price per token on the strength of speed — pitching tokens-per-second throughput as the value metric, with hardware contracts reserved for the enterprise tier. It is the one corner of the raw-compute category where the product is not fungible.

What this means for buyers vs vendors

For buyers

Normalize every quote to a comparable unit before you compare vendors — convert per-second and per-hour rates to the same basis and confirm whether you are billed for actual runtime or rounded-up hours, because per-second clouds like Modal can beat a cheaper headline hourly rate on bursty workloads. Watch for hidden node-level pricing on the big clouds: CoreWeave publishes 8-GPU node rates, so an HGX H100 node listed at $49.24/hr is really about $6.16 per GPU — always divide down before comparing to a single-GPU marketplace like RunPod at $2.89/hr.

Decide upfront how much you will sacrifice for price. Spot and interruptible capacity (Vast.ai, RunPod Community Cloud, CoreWeave spot) and batch discounts (Groq, Together AI) can halve your bill if your jobs tolerate interruption or latency, and reserved rates go further still — Together AI’s reserved H100 is $3.09/hr against dedicated rates. Model your actual utilization before signing a committed-spend deal: those discounts only pay off above a usage floor that the pure-usage default already covers below.

Finally, remember the meter changes with the layer. For a vector database like Pinecone your bill is driven by read units, write units, and storage, not GPU-hours — and a $50/month Standard minimum can dominate a small workload. For a gateway like OpenRouter the cost is a percentage on top of pass-through model prices. The guide on usage invoicing and billing cycles is a useful primer on reading these mixed meters, and you can sanity-check competitive token rates against the live pricing calculator.

For vendors

Pure-usage is table stakes for raw compute — buyers expect to pay from the first second with no minimum, so lead with a transparent unit rate and layer commitment on top rather than gating entry behind a subscription. Compete on the dimension you actually win: marketplaces like Vast.ai win on raw price via auctions and spot capacity, hyperscale clouds like CoreWeave win on cluster scale and free egress, and chip clouds like Cerebras and Groq win on speed and price per token, not per hour.

The infrastructure required is non-trivial. Per-second metering, spot reclamation, batch queuing, credit-pool accounting, and multi-unit invoicing all demand precise usage tracking, covered in the guide on tracking and metering usage events. Vendors selling the adjacent layer — vector DBs, gateways, observability — differentiate on features but should still expose a clean, meter-first bill; opaque pricing reads as a red flag next to the radical transparency buyers now expect from the GPU clouds. This commitment-on-top-of-usage shape is documented further in the infra commitment as standard trend.

Company	Product	Pricing model	Billing units	Free tier	Verified
Anyscale	Managed Ray platform for distributed AI training, inference, and batch processing (RayTurbo, Anyscale Compute Units)	pure-usage commitment hybrid	gpu-hours cpu-hours credits	Yes	2026-05-29
Arize AI	AI & LLM observability (Arize AX + Phoenix OSS)	freemium hybrid	trace-spans gb-ingested	Yes	2026-06-09
Baseten	ML inference infrastructure — dedicated GPU deployments, Model APIs, and Truss framework	pure-usage hybrid commitment	gpu-hours tokens requests	Yes	2026-05-29
Browserbase	Browser-agent infrastructure: headless browser sessions, web Search/Fetch APIs, agent identity, runtime, and a model gateway behind one API key	freemium hybrid pure-usage	browser-hours api-calls requests	Yes	2026-06-02
Cerebras	Wafer-scale AI inference cloud and WSE hardware systems	pure-usage subscription commitment	tokens api-calls gpu-hours	Yes	2026-05-30
Chroma	Open-source vector database + Chroma Cloud	pure-usage freemium	storage-gb bandwidth-gb api-calls	Yes	2026-06-09
CoreWeave	GPU cloud & AI compute infrastructure	pure-usage commitment	gpu-hours cpu-hours storage-gb	No	2026-06-15
Daily	Real-time voice and video WebRTC APIs (Video SDK + Pipecat Cloud)	pure-usage	media-minutes api-calls	Yes	2026-07-14
DeepInfra	Serverless inference cloud — per-token LLM/embedding APIs, per-image and per-minute media models, per-hour on-demand GPU containers, and reserved DeepCluster GPU clusters	pure-usage commitment	tokens gpu-hours requests	No	2026-07-14
E2B	Open-source cloud sandboxes for AI agents — secure, isolated micro-VMs that run LLM-generated code, coding agents, and computer-use workflows	freemium hybrid	cpu-hours gb-hours storage-gb	Yes	2026-06-02
Fal	Generative-media inference platform — serverless per-output model APIs plus dedicated GPU compute	pure-usage	gpu-hours requests media-minutes	No	2026-06-01
Fireworks AI	Generative AI inference platform — serverless per-token, on-demand GPU, fine-tuning, batch API	pure-usage hybrid commitment	tokens gpu-hours requests	Yes	2026-05-30
Gladia	Speech-to-text & audio intelligence API	pure-usage freemium commitment	media-minutes requests	Yes	2026-06-09
Groq	GroqCloud — LPU-based ultra-low-latency inference API for Llama, GPT-OSS, Qwen, Whisper transcription, and Orpheus text-to-speech	pure-usage hybrid commitment	tokens requests api-calls	Yes	2026-07-14
Helicone	Open-source LLM observability & AI gateway	hybrid freemium	requests logs storage-gb	Yes	2026-06-09
Humanloop	LLM evals, prompt management & observability	hybrid freemium	logs datapoints seats	Yes	2026-06-09
Hyperbolic	GPU cloud marketplace & serverless AI inference	pure-usage commitment	gpu-hours tokens images	Yes	2026-06-15
Lambda	GPU cloud & AI compute infrastructure	pure-usage commitment	gpu-hours	No	2026-06-09
LanceDB	AI-native multimodal lakehouse	freemium pure-usage commitment	storage-gb vectors-indexed gpu-hours	Yes	2026-06-09
LangChain	Agent orchestration frameworks + LangSmith platform	hybrid seat-plus-usage freemium	seats traces workflow-executions	Yes	2026-06-10
LangSmith	LLM tracing and evaluation	hybrid seat-plus-usage	seats traces	Yes	2026-06-09
Lightning AI	Cloud GPU/CPU Studio compute platform for building, training, and serving AI models, billed by the second with a credit pool.	hybrid freemium pure-usage	gpu-hours cpu-hours credits	Yes	2026-06-02
LiveKit	Open-source real-time (WebRTC) communications, LiveKit Cloud & Agents framework	hybrid freemium pure-usage	media-minutes credits bandwidth-gb	Yes	2026-06-30
Milvus	Vector database (OSS) + Zilliz Cloud (managed)	pure-usage freemium commitment	gpu-hours storage-gb vectors-indexed	Yes	2026-06-09
Modal	Serverless compute and GPU platform — per-second billing for Python functions, batch jobs, and model serving	pure-usage freemium subscription	gpu-hours cpu-hours gb-hours	Yes	2026-07-14
Nebius	AI cloud & GPU compute infrastructure	pure-usage commitment	gpu-hours cpu-hours storage-gb	No	2026-06-15
Novita AI	Pay-as-you-go AI cloud: 200+ model inference APIs, on-demand GPUs, and per-second agent sandboxes under one API	pure-usage freemium	tokens gpu-hours cpu-hours	Yes	2026-07-06
OpenRouter	Multi-model LLM API routing marketplace	pure-usage freemium	tokens credits requests	Yes	2026-07-14
Pinecone	Managed vector database (serverless)	pure-usage hybrid	requests storage-gb vectors-indexed	Yes	2026-06-09
Portkey	AI gateway & LLMOps governance platform	hybrid freemium	requests logs	Yes	2026-06-10
Qdrant	Open-source vector database + Qdrant Cloud	pure-usage freemium	cpu-hours gb-hours storage-gb	Yes	2026-06-09
Replicate	Cloud platform for running, fine-tuning, and deploying AI models via REST API	pure-usage hybrid commitment	gpu-hours tokens requests	Yes	2026-05-30
RunPod	GPU cloud marketplace — Secure Cloud and Community Cloud Pods, Serverless endpoints, and persistent storage	pure-usage hybrid commitment	gpu-hours storage-gb	No	2026-07-14
Together AI	AI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning	pure-usage hybrid commitment	tokens gpu-hours cpu-hours	Yes	2026-07-14
turbopuffer	Serverless vector and full-text search database on object storage	pure-usage commitment	storage-gb vectors-indexed gb-hours	No	2026-07-14
Unstructured	Document ingestion / ETL API	pure-usage freemium	pages-rendered documents	Yes	2026-07-14
Upstash	Upstash (Redis, Vector, QStash, Search, Workflow)	pure-usage freemium hybrid	requests api-calls vectors-indexed	Yes	2026-07-14
Usage AI	Cloud commitment management & savings optimization (AWS / Azure / GCP)	outcome-based pure-usage	outcomes	Yes	2026-06-16
Vast.ai	GPU rental marketplace — on-demand, interruptible (spot), and reserved cloud GPUs plus autoscaling serverless inference	pure-usage commitment	gpu-hours storage-gb bandwidth-gb	No	2026-07-14
Weaviate	AI-native vector database (open-source core + Weaviate Cloud managed serverless, dedicated/Enterprise Cloud, BYOC)	pure-usage hybrid commitment	vectors-indexed tokens api-calls	Yes	2026-07-06
Weights & Biases	MLOps experiment tracking, W&B Weave LLM observability/evals, Models registry, and Serverless Inference	freemium hybrid seat-plus-usage	seats storage-gb traces	Yes	2026-07-14
ZenRows	Universal Scraper API, Scraping Browser, and Residential Proxies	hybrid subscription pure-usage	requests api-calls bandwidth-gb	Yes	2026-06-04

Explore this theme in the knowledge graph

FAQ

What is AI infrastructure and cloud pricing?

It is the pricing for the raw AI compute layer — GPU clouds, serverless inference, vector databases, observability, and gateways that models and agents run on. Vendors charge by the GPU-hour or GPU-second, per token, per output, or per operation, almost always on a pay-as-you-go basis with spot, batch, reserved, and committed-spend discounts on top.

How much does an H100 GPU cost per hour?

On-demand H100 rates in the corpus cluster around $2.89–$6.16 per hour (RunPod lists H100 PCIe from $2.89/hr; CoreWeave's HGX H100 node works out to about $6.16/GPU/hr). Per-second clouds like Modal ($0.001097/sec ≈ $3.95/hr) and Replicate ($0.001525/sec) bill the same hardware in finer increments, and spot or interruptible capacity (Vast.ai from $0.194/hr for smaller cards) goes far lower.

What is the difference between spot, on-demand, and reserved GPU pricing?

On-demand GPUs run at a fixed hourly rate with guaranteed availability. Spot or interruptible GPUs (Vast.ai auction bids, RunPod Community Cloud, CoreWeave spot) are cheaper but can be reclaimed mid-job — CoreWeave's HGX H100 spot is $19.71/hr vs $49.24 on-demand. Reserved or committed capacity locks a discounted rate in exchange for a usage commitment, worth up to 60% off at CoreWeave and offered by RunPod, Together AI, Lambda, and Nebius.

Do AI infra clouds bill per hour, per second, or per token?

All of them coexist. GPU rentals are billed per hour (Vast.ai, RunPod, Lambda) or per second (Modal, Replicate, Lightning AI). Serverless inference is billed per million tokens (Groq, Together AI, DeepInfra) or per output such as per image (fal, Replicate). Vector databases meter read/write units and storage (Pinecone), and gateways like OpenRouter take a percentage fee. Most clouds run several of these meters in parallel under one account.

Which AI infrastructure providers offer a free tier?

Many do, usually as starter credits or capped free plans rather than perpetual free compute: Modal gives $30/month in credits, Anyscale $100 in ACU credits, Cerebras a rate-capped free tier plus a $10 Developer tier, and Pinecone and Qdrant offer free starter clusters. Pure-usage clouds like fal, Vast.ai, Lambda, and DeepInfra skip the free tier and bill from the first dollar.

Why is AI infrastructure pricing so competitive?

Much of the product is fungible — an H100 is an H100 regardless of who racks it — so GPU vendors compete almost entirely on price, capacity, and distribution. That drives the unit proliferation and aggressive spot, batch, and reserved discounts seen across RunPod, Vast.ai, CoreWeave, Groq, and Together AI. Adjacent infra like vector DBs and observability differentiate more on features but still trend toward transparent usage rates.

Related product categories

Back to companies