All companies
technology

OctoAI pricing

octo.ai facts checked analysis reviewed
Quick summary
Pricing model
Sales motion
Product segment
Region
Product
Generative AI inference platform (acquired by NVIDIA, sunset Oct 2024)
Industry
technology
Commits
None
In this page
AI Summary
  • OctoAI (formerly OctoML) was a generative-AI inference platform that NVIDIA acquired on Sept 25, 2024; the standalone commercial product was wound down effective Oct 31, 2024 — this is a post-mortem, not a live rate card.
  • Text generation was billed per 1M tokens: Llama 3 8B at $0.15, Llama 3 70B around $0.90, Mixtral 8x7B at $0.30 input / $0.50 output, Mistral 7B at $0.10 input / $0.25 output, embeddings at $0.05.
  • Media generation (SDXL, SD 1.5, SVD) was usage-metered per image / per second of GPU compute; OctoAI marketed the 'fastest SDXL endpoint' at roughly 3.1s latency.
  • Self-serve accounts got $10 in free credit; OctoStack (private/self-hosted deployment, launched April 2024) and dedicated compute were sales-quoted enterprise contracts.
  • NVIDIA reportedly paid about $165M (up to ~$250M with retention) — roughly 18 cents on the dollar versus OctoAI's ~$900M Series C peak — and shut the public API in about five weeks with no successor product.
Pricing summary
OctoAI — discontinued (acquired by NVIDIA, Oct 2024)
Historical inference tiers. The standalone OctoAI platform was sunset on Oct 31, 2024; these are reconstructed for reference, not current rates.
Text Gen (per-token)
Discontinued /1M tokens
Historical: LLM API by the token
OctoStack / Enterprise
Now part of NVIDIA
Private/self-hosted inference
Post-mortem reference only — OctoAI is no longer sold standalone. Historical rates reconstructed from contemporaneous reporting; contact NVIDIA for current offerings.

About

OctoAI — originally OctoML — was a generative-AI inference platform founded in 2019 as a University of Washington spinout commercializing the Apache TVM machine-learning compiler. Led by CEO Luis Ceze, the company first sold model-optimization and deployment tooling, then in late 2023 rebranded to OctoAI and pivoted into a hosted inference cloud: developers could call open models (Llama, Mixtral, Mistral, Stable Diffusion) through a usage-metered API at prices and speeds the hyperscalers were not yet matching. The pitch was “run any model, any hardware, fastest and cheapest,” and it raised roughly $132M at a reported ~$900M valuation in its 2021 Series C.

The story ends as a post-mortem. On September 25, 2024, NVIDIA acquired OctoAI — reportedly for about $165M (up to ~$250M with retention incentives), roughly 18 cents on the dollar against that ~$900M peak. Within weeks, OctoAI emailed customers that its commercial services would wind down effective October 31, 2024, giving developers about five weeks to migrate off the public text-gen and media-gen APIs. There was no successor “powered by OctoAI” product; CEO Luis Ceze moved to NVIDIA as VP of AI Systems Software, and the asset NVIDIA appeared to value most was OctoStack, OctoAI’s hardware-agnostic private-deployment layer. Today octo.ai redirects to NVIDIA.

Because the standalone platform no longer exists, everything below is historical — reconstructed from contemporaneous reporting and third-party pricing comparisons. None of it is purchasable today; for current options you would contact NVIDIA.


Pricing summary : How OctoAI’s pricing model worked

OctoAI was, while it operated, a pure usage-based inference platform — you paid for what you generated, not a per-seat subscription. There were three monetization surfaces:

  1. Text Gen (per token) — open LLM endpoints (Llama, Mixtral, Mistral, Code Llama) billed per 1M tokens, typically with the same input and output rate on smaller models. Self-serve, with new accounts getting free signup credit.
  2. Media Gen (per image / per compute-second) — Stable Diffusion XL, SD 1.5, and Stable Video Diffusion endpoints, usage-metered by image generated and underlying GPU compute rather than tokens. Unlike text-gen, it supported customer fine-tunes.
  3. OctoStack / dedicated compute — a private, self-hosted inference stack (and custom dedicated capacity) sold as sales-quoted enterprise contracts, with no public rate card.

What makes this different: the afterlife is the story. OctoAI is now a sales-only, NVIDIA-internal asset — the public self-serve rate card was retired in October 2024, and the platform was acquired and sunset in roughly five weeks. We classify it sales-only and treat all dollar figures as historical, because there is no live price to quote and presenting old rates as current would be misleading.


Pricing by product

These are historical (2023-2024) list rates, reconstructed from third-party reporting — not current prices. Text generation was billed per 1M tokens:

Model (text-gen)Input / 1M tokensOutput / 1M tokensNotes
Llama 3 8B Instruct$0.15$0.15Flat input/output rate
Llama 3 70B Instruct~$0.90~$0.90Also reported at $0.765 each
Mixtral 8x7B Instruct$0.30$0.50Split input/output
Mistral 7B Instruct$0.10$0.25Split input/output
Text embeddings (GTE-Large)$0.05Per 1M tokens

Media generation (SDXL, SD 1.5, SVD) was usage-metered per image and/or per second of GPU compute rather than per token; OctoStack and dedicated compute were sales-quoted. New self-serve accounts received $10 in free credit.

Sales motions across products: historically self-serve/PLG for the token and image APIs (free credit, no sales call) with sales-led OctoStack and dedicated-compute contracts on top. Post-acquisition the entire platform is sales-only and folded into NVIDIA — there is no longer a self-serve motion to buy OctoAI standalone.


Hidden costs : What OctoAI users actually paid (and the real cost of the shutdown)

For a discontinued platform, the largest “hidden cost” is not a line item — it is migration risk. When OctoAI gave customers roughly five weeks to move off the API before the Oct 31, 2024 cutoff, teams that had hardcoded OctoAI endpoints and pricing into production bore the full re-platforming cost: re-pointing to a new provider, re-validating outputs, and absorbing whatever rate delta the replacement charged.

The historical metered costs that drove real bills were:

Line item (historical)How it was billed
Text-gen tokensPer 1M tokens (e.g. Mixtral 8x7B at $0.30 input / $0.50 output)
Media-gen imagesPer image / per second of GPU compute (SDXL, SD1.5, SVD)
Embeddings$0.05 per 1M tokens
Free credit offset$10 signup credit, then pay-as-you-go
OctoStack / dedicatedSales-quoted contract (no public rate)

Output tokens cost more than input on the larger split-rate models, so chat workloads with long generations skewed toward the output rate — the usual asymmetry that surprises teams modeling only the headline input price.

Want to estimate inference costs the way OctoAI customers had to? Use the OctoAI pricing calculator to model token and image spend, then compare against a live provider before you commit.


Pricing evolution : OctoAI pricing history and changes

Cadence

PeriodPrice changesProduct / SKU additionsNotes
2023 H2Per-token rate card publishedText Gen Solution; rebrand OctoML to OctoAI$0.15 (Llama 3 8B) to ~$0.90 (70B); $10 free credit
2024 H1OctoStack private deploymentEnterprise sales-quoted; hardware-agnostic
2024 H2Rate card retired entirelyPlatform sunsetNVIDIA acquired (Sep 25); APIs off Oct 31

Tracked range: 2023-2024 (the platform’s full commercial life). All prices historical; reconstructed from contemporaneous reporting — see 2026-06-15-main-validated.txt.

Notable changes

  • November 2023 — OctoML rebrands to OctoAI and launches the per-token Text Gen Solution alongside its existing Media Gen (SDXL/SD1.5/SVD) endpoints. Self-serve, usage-based, $10 free credit. Headline text rates ran from $0.15 per 1M tokens (Llama 3 8B) up to about $0.90 (70B-class), with Mixtral 8x7B at $0.30 input / $0.50 output and Mistral 7B at $0.10 / $0.25.
  • April 2024 — OctoStack launches: a self-hosted/private inference stack across NVIDIA, AMD, and AWS Inferentia hardware, claiming roughly 4x better GPU utilization. This shifted OctoAI’s enterprise story from “call our API” to “run our stack in your environment.”
  • September-October 2024 — NVIDIA acquires OctoAI (~$165M-$250M reported) and winds the commercial platform down by Oct 31, 2024. The public rate card disappears; pricing becomes irrelevant because the product is no longer sold.

The trajectory is the lesson: a transparent, aggressively-cheap usage-based rate card was not enough to sustain an independent inference cloud once frontier-model economics and hyperscaler/NVIDIA gravity set in — the company was absorbed and its self-serve pricing erased within weeks.


What’s unique : OctoAI’s distinctive pricing mechanics

1. Two metering models under one platform. OctoAI ran both a per-token text-gen meter and a per-image / per-compute-second media-gen meter — pricing each modality on the unit that actually mapped to its cost, rather than forcing images into a token abstraction.

2. Hardware-agnostic enterprise pricing via OctoStack. Instead of only renting its own cloud by the unit, OctoAI sold a private deployment layer that ran across NVIDIA, AMD, and Inferentia — a sales-quoted contract whose value was utilization (the ~4x claim), not a published rate.

3. A rate card with a hard expiry. The most distinctive “mechanic” in hindsight is that the entire pricing surface was switched off on a fixed date after acquisition — a reminder that with a venture-backed inference startup, the rate card is only as durable as the company’s independence.


Strengths & weaknesses

StrengthsWeaknesses
Transparent per-1M-token rates undercutting hyperscalersPlatform no longer exists — acquired and sunset
Free $10 credit lowered self-serve onboarding frictionOnly ~5 weeks’ notice before the API went dark
Per-modality metering (tokens vs images)Heavy migration cost dumped on production users
OctoStack: hardware-agnostic private deploymentExited at ~18 cents on the dollar vs peak valuation
Fast SDXL endpoint with fine-tune supportNo durable, independent pricing to rely on

Billing UX : OctoAI billing controls and transparency

  • Billing controls — Historically pay-as-you-go on metered usage (tokens / images), with a $10 free credit to start; OctoStack and dedicated compute were invoiced enterprise contracts. Today there are no self-serve billing controls because the standalone product is discontinued.
  • Usage visibility — While live, the OctoAI console exposed per-model token and image usage; that dashboard is gone post-sunset.
  • Payment options — Self-serve card billing for the metered APIs and sales-led invoicing for OctoStack/enterprise — now superseded by NVIDIA’s enterprise procurement, since OctoAI is sales-only and internal to NVIDIA.

Strategic wins : Why OctoAI’s pricing decisions worked (while they lasted)

1. Transparent, cheap per-token pricing as a wedge

By publishing flat per-1M-token rates (Llama 3 8B at $0.15) and handing out free credit, OctoAI made it trivial for developers to try open models without a sales call — the classic usage-based onboarding wedge. See how AI companies structure pricing.

2. Metering each modality on its real cost driver

Pricing text by the token and media by the image / compute-second meant customers paid on the unit that tracked OctoAI’s own GPU cost — a cleaner alignment than forcing everything into one abstraction. Related: outcome-based pricing trends.

3. Moving enterprise value to utilization, not list price

OctoStack repriced the enterprise conversation around GPU utilization (the ~4x claim) rather than a per-unit list rate — exactly the asset that made OctoAI attractive to NVIDIA. See choosing the right usage metric.


Areas to improve : Gaps in OctoAI’s pricing approach

1. Cheap usage rates could not fund an independent inference cloud

OctoAI priced to win developers, but per-token margins in a commoditizing inference market were thin against hyperscaler and NVIDIA scale — cheap rates were a great wedge and a poor moat. See bill shock and cost unpredictability.

2. No durability guarantee for customers’ pricing

A roughly five-week shutdown window left production users to absorb migration cost. Inference vendors that want trust need clearer continuity commitments around their rate card and endpoints.

3. Self-serve transparency, then a sudden sales-only cliff

OctoAI went from open, self-serve pricing to no pricing at all almost overnight after acquisition — a discontinuity that turned its earlier transparency into a liability for anyone who had standardized on it.


Key takeaways

  1. OctoAI was pure usage-based inference, now discontinued — per-1M-token text-gen, per-image media-gen, sales-quoted OctoStack — acquired by NVIDIA in Sept 2024 and sunset Oct 31, 2024. For the underlying model, see the introduction to usage-based pricing.
  2. The historical rate card was aggressively cheap — Llama 3 8B at $0.15, Mistral 7B at $0.10 / $0.25, Mixtral 8x7B at $0.30 / $0.50 per 1M tokens — built to win self-serve developers.
  3. Two meters, one platform — tokens for text, images/compute-seconds for media — each priced on its real cost driver.
  4. The biggest cost ended up being the shutdown — a roughly five-week migration window after the rate card was switched off entirely.
  5. Cheap usage pricing is a wedge, not a moat in commoditizing inference; the broader lesson for the category is that pricing transparency does not by itself sustain an independent vendor.

UBP implications

  1. Match the meter to the modality. OctoAI’s split of per-token text-gen and per-image media-gen is a reusable pattern: bill each product on the unit that maps to its underlying cost rather than forcing one abstraction across everything.
  2. A usage rate card is only as durable as the vendor. Buyers standardizing on a metered API should weigh continuity risk, because a startup’s published prices can vanish on an acquisition timeline measured in weeks.
  3. Transparent low rates win adoption but rarely fund independence. In a commoditizing inference market, cheap per-unit pricing is an excellent onboarding wedge and a weak long-term defense — a caution for any UBP business pricing below its scaled competitors.

Sources


Bottom line

OctoAI (formerly OctoML) is a post-mortem, not a live pricing profile: a pure usage-based inference platform — cheap per-1M-token text generation, per-image media generation, and a sales-quoted OctoStack private-deployment layer — that NVIDIA acquired in September 2024 and shut down within weeks, retiring its rate card entirely by October 31, 2024. Its arc is the lesson the category keeps relearning: transparent, aggressively low usage pricing is a superb developer wedge but a poor moat, and a metered rate card is only as durable as the company behind it. Browse the pricing blueprint for more fully-researched company profiles, or compare OctoAI against other AI inference and infrastructure companies.

Want to compare OctoAI against other AI infrastructure companies? Browse the pricing blueprint.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Standalone commercial platform sunset

OctoAI wound down its commercial services effective Oct 31, 2024, giving developers about five weeks to migrate off the public text-gen and media-gen APIs. No successor product; the published rate card was retired entirely.

Acquired by NVIDIA (~$165M-$250M)

NVIDIA acquired OctoAI for a reported ~$165M base (up to ~$250M with retention), roughly 18 cents on the dollar versus the ~$900M Series C peak. CEO Luis Ceze joined NVIDIA as VP of AI Systems Software.

OctoStack private-deployment tier added

OctoAI launched OctoStack — a self-hosted/private inference stack running in the customer's own environment across NVIDIA, AMD, and AWS Inferentia, claiming ~4x GPU utilization. Sales-quoted enterprise contract layered on top of the self-serve token/image APIs.

OctoML rebrands to OctoAI; launches per-token Text Gen

OctoAI launched its Text Gen Solution (Llama 2 Chat, Code Llama, Mistral) billed per 1M tokens, alongside an existing Media Gen Solution (SDXL/SD1.5/SVD) billed per image / per second of compute. Self-serve with $10 free credit; OctoML rebranded to OctoAI.

Trivia
  • · OctoAI began life as OctoML, a 2019 University of Washington spinout commercializing the Apache TVM compiler project before pivoting into a hosted generative-AI inference platform.
  • · NVIDIA reportedly paid about $165M (up to ~$250M with retention) — roughly 18 cents on the dollar versus the ~$900M valuation OctoAI raised at in its 2021 Series C.
  • · After the acquisition NVIDIA gave developers only about five weeks to migrate before the public API went dark on Oct 31, 2024 — there was no successor 'powered by OctoAI' product.

Questions & answers

Can I still buy OctoAI today?
No. NVIDIA acquired OctoAI (formerly OctoML) on September 25, 2024, and OctoAI wound down its commercial services effective October 31, 2024 — about a five-week migration window. The public text-generation and media-generation APIs were discontinued, octo.ai now redirects to NVIDIA, and there is no successor 'powered by OctoAI' product. Any pricing you find online for OctoAI is historical.
How did OctoAI price text generation?
OctoAI charged per 1M tokens, usually with the same rate for input and output on smaller models. Historically (2023-2024): Llama 3 8B at $0.15, Llama 3 70B around $0.90, Mixtral 8x7B at $0.30 input / $0.50 output, Mistral 7B at $0.10 input / $0.25 output, and text embeddings at $0.05 per 1M tokens. New accounts received $10 in free credit.
How did OctoAI price image and media generation?
Media Gen (Stable Diffusion XL, SD 1.5, and Stable Video Diffusion) was usage-metered — billed per image generated and/or per second of GPU compute rather than per token. OctoAI marketed the 'fastest SDXL endpoint' at roughly 3.1 seconds average latency, and unlike its text-gen product it supported customer fine-tunes.
What was OctoStack and how was it priced?
OctoStack, launched in April 2024, was OctoAI's private/self-hosted inference stack for enterprises — it ran in the customer's own environment across NVIDIA, AMD, and AWS Inferentia hardware and claimed about 4x better GPU utilization. It was sold as a sales-quoted enterprise contract with no public rate card, and it was the asset NVIDIA was most interested in.