Baseten Explained: What an AI Inference Platform Does |...

You trained a model, or you grabbed an open-weights one off Hugging Face, and now you need it to answer real requests at 2 a.m. without you babysitting a GPU box. That's the gap nobody warns you about. The model works fine in a notebook. Getting it to serve traffic, autoscale when a customer's launch goes sideways, and not bill you $40 an hour while it sits idle overnight is a completely different problem.

This is the slice of the stack Baseten sells. It's an AI inference and model-deployment platform, founded in 2019 in San Francisco, that runs open-source, custom, and fine-tuned models on GPUs for you. I want to walk through what it actually does, because "inference platform" is one of those phrases that means everything and nothing until you've tried to deploy something yourself.

And the reason it's worth a closer look right now: in June 2026 the company raised a $1.5 billion Series F. That's a lot of money pointed at a problem most people don't think about until they hit it.

The actual problem it solves

Self-hosting a model means you own the whole chain. You pick a GPU, you write a server, you containerize it, you wire up autoscaling, you handle the cold-start when a new instance spins up, and you eat the cost of keeping hardware warm even when no one's calling. Each of those steps has its own way to ruin your week.

Baseten's pitch is that you hand it a packaged model and it runs the chain. Deployments autoscale, and they scale to zero, which means when traffic stops, the GPUs spin down and you're not paying for idle compute. Billing on the dedicated side is per-minute GPU time with no idle charge. There's also a separate Model APIs product billed per-token, which I'll get to.

That scale-to-zero detail matters more than it sounds. A lot of small operators run a model that gets used in bursts. A few hundred requests during business hours, nothing overnight. If you're paying for a reserved GPU around the clock, the math gets ugly fast. Per-minute billing with a true zero floor is the difference between "this is a side project I can afford" and "this is a bill I have to explain."

Truss: the part that's actually open source

The piece you'll touch first is Truss, Baseten's open-source model-packaging framework. The idea is you describe your model and its environment once, and that package becomes deployable. Truss supports the serving stacks people actually use: vLLM, SGLang, TensorRT-LLM, plus the framework-level stuff like transformers, diffusers, PyTorch, and TensorFlow.

That spread matters because it covers both ends. If you're serving a big language model and you want vLLM or TensorRT-LLM for throughput, Truss handles it. If you're running a diffusion model for images or some custom PyTorch thing you built, same packaging path.

A Truss project is basically a directory with your model code and a config that declares the Python deps, the system packages, and the hardware you want. Roughly:

# config.yaml
model_name: my-llm
resources:
  accelerator: H100
  use_gpu: true
python_version: py311
requirements:
  - vllm

# package and push it to Baseten
truss push

Because Truss is open and on GitHub, you can read exactly how the packaging works before you commit to anything. That's a real point in its favor. The packaging format isn't a black box you only see from inside a billing dashboard.

Chains, for when one model isn't the whole job

Real applications are rarely one model call. You've got a retrieval step, then a transcription model, then an LLM, then maybe a re-ranker. Stringing those together so each piece scales independently is its own headache, because the transcription step and the LLM step have wildly different hardware appetites.

Baseten Chains is an SDK for building and deploying compound AI workflows, built on top of Truss. The point is that each stage in the chain can be its own deployment with its own hardware and its own autoscaling, instead of cramming everything into one oversized container that's bottlenecked by its hungriest component. If your transcription step needs a beefy GPU but your orchestration logic is just CPU glue, you don't have to pay GPU rates for the glue.

Model APIs: skip the deployment entirely

There's a faster on-ramp for common open models. Baseten's Model APIs give you OpenAI-compatible endpoints for models like DeepSeek, Qwen, and GLM. You don't package anything. You point your existing OpenAI client at their endpoint, swap the base URL and key, and you're calling DeepSeek instead of GPT.

This part is billed per-token, which is the model you already know from every hosted LLM API. So there are really two billing worlds here, and it's worth keeping them straight:

Product	What you deploy	Billing
Dedicated deployments	Your own packaged model via Truss	Per-minute GPU, no idle charge, scales to zero
Model APIs	Nothing, you call a hosted endpoint	Per-token

If you just want DeepSeek behind an OpenAI-shaped API and you don't care where it runs, Model APIs is the short path. If you've got a fine-tuned or custom model that has to be yours, the dedicated deployment path is the one.

The money, stated carefully

Here's where I have to be precise, because the numbers got reported loosely in a few places.

In January 2026, specifically January 23, Baseten announced a $300 million Series E at a $5 billion valuation. The round was led by IVP and CapitalG. NVIDIA participated and put in $150 million of that $300 million total, so half the round came from NVIDIA, but NVIDIA wasn't the lead. People sometimes flatten that into "NVIDIA led it," which isn't right.

Then on June 22, 2026, the company announced a $1.5 billion Series F. That round was led by Altimeter, Conviction, and Spark Capital, and co-led by Sands Capital and Wellington Management. The valuation is the part to be careful about: it's reported as "up to $13 billion," structured across two tranches at $13 billion and $11 billion. So if you see a flat "$13 billion valuation," that's the top of the range, not a single clean number.

On the business side, the Series F materials cite roughly 20x year-over-year revenue growth and more than 1 billion inference calls a day, running across 87 clusters on 18 clouds. Named customers in that release include Abridge, Clay, Cursor, Lovable, Mercor, and OpenEvidence. Those are AI products people actually use, which is the more useful signal than any valuation. A 20x revenue jump is the kind of number that explains why investors wrote a $1.5 billion check.

So is it for you?

If you're a solo builder or a small shop and you mostly want to call an open model through a familiar API, the per-token Model APIs are probably where you start. No packaging, no GPU math, just an endpoint.

If you've got a custom or fine-tuned model that has to run as your own deployment, the Truss-plus-dedicated-GPU path is the real product, and the scale-to-zero, per-minute billing is the thing that makes it affordable for bursty traffic. The trade-off is that you're now depending on a vendor for the part of the stack you could, in theory, run yourself on a raw cloud GPU. That's a genuine decision, not a no-brainer. You're paying for someone else to own the autoscaling and the cold-starts so you don't have to. Whether that's worth it depends entirely on how much you value not being paged at 2 a.m.

This is the kind of building block I keep coming back to in my book The $97 Launch: you can stand up a real product on free and pay-as-you-go pieces, and only start paying real money once you have real users. The per-token Model APIs fit that shape, since nothing bills until something actually gets called.

Fact-check notes and sources

Founded 2019 in San Francisco; co-founders Tuhin Srivastava (CEO), Amir Haghighat (CTO), Phil Howes, and Pankaj Gupta. Source: Baseten About Us.
Truss is the open-source packaging framework and supports vLLM, SGLang, TensorRT-LLM, transformers, diffusers, PyTorch, and TensorFlow. Source: Truss README on GitHub.
Chains is an SDK for compound AI workflows built on top of Truss. Source: Baseten blog: Chains for production compound AI systems.
Series E: $300M at a $5B valuation, announced January 23, 2026, led by IVP and CapitalG, with NVIDIA contributing $150M of the $300M (a participant, not the lead). Source: Businesswire: Baseten Raises $300M at a $5B Valuation.
Series F: $1.5B announced June 22, 2026, led by Altimeter, Conviction, and Spark Capital, co-led by Sands Capital and Wellington Management; valuation "up to $13B" across two tranches at $13B and $11B. Source: Businesswire: Baseten Raises $1.5 Billion.
~20x year-over-year revenue growth; 1B+ inference calls per day across 87 clusters on 18 clouds; named customers Abridge, Clay, Cursor, Lovable, Mercor, OpenEvidence. Source: Yahoo Finance: Baseten Raises $1.5 Billion.
Note on the config snippet above: it's an illustrative Truss config to show the shape of a packaging file, not a verbatim quote from Baseten's docs.

This post is informational, not investment or vendor-purchasing advice. Baseten and the companies named are referenced as nominative fair use. Funding figures and pricing were accurate as of June 2026; verify current terms with the vendor. No affiliation is implied.

Baseten, Explained: What a $13B Inference Platform Actually Does