← Back to Blog

Local AI or Cloud APIs for a Small Business? The Honest Cost Math (Part 1 of 5)

Local AI or Cloud APIs for a Small Business? The Honest Cost Math (Part 1 of 5)

This is the first part of a five part series on the practical AI and web stack for a small or medium business. Each part takes one decision, shows what it looks like with real examples, and is honest about when the move makes sense and when it does not. We start with the question I get asked most: should a small business run AI on its own hardware, or just pay a cloud API per use?

The instinct to "own it" is strong, especially after reading about $700 used graphics cards that run capable models at home. For a hobbyist that is a great project. For a business deciding where the money goes, the math usually points the other way, and it is worth seeing the actual numbers before you spend anything.

What the cloud actually costs at small-business volume

Hosted AI is priced per million tokens, and a token is roughly three quarters of a word. The cheap tiers in mid-2026 are genuinely cheap. Google's Gemini 3 Flash is about $0.50 per million input tokens and $3.00 per million output. OpenAI's GPT-5.4-nano is about $0.20 in and $1.25 out. On the open-model hosts, Groq serves Llama 3.1 8B at roughly $0.05 in and $0.08 out. Anthropic's Claude Haiku 4.5 is $1 in and $5 out, and Claude Sonnet 4.6 is $3 in and $15 out when you want noticeably better writing.

Put a real workload through that. Say a business generates 1,000 drafts a month, each one a customer reply or a product description, with about 600 tokens of context going in and 350 tokens coming out. That is 0.6 million input tokens and 0.35 million output tokens a month, total.

  • On GPT-5.4-nano: about 56 cents a month.
  • On Gemini 3 Flash: about $1.35 a month.
  • On Groq's Llama 3.1 8B: about 6 cents a month.
  • On Claude Haiku, for higher quality: about $2.35 a month. On Claude Sonnet, around $7 a month.

A single-location restaurant rewriting its menu descriptions and answering reservation emails, a storage facility drafting overdue-payment notices, a realtor turning listing notes into polished copy: all of them land in the few-dollars-a-month range. The bill is a rounding error. There is no infrastructure to babysit, no card to fail, no model to update.

What the hardware actually costs

Now the other side. The consensus value pick for running models locally is a used RTX 3090 with 24GB of memory, which runs about $700 to $900 and draws 350 watts. It is a genuinely good piece of hardware, and at a typical US electricity rate it costs only about 6 cents an hour to run. It will happily run an 8B to 32B model with Ollama, and you can even fine-tune on it. A Mac Mini starting at $799 does the quiet, low-power version of the same thing.

But notice what you are comparing. To break even against a $2-a-month cloud bill, an $800 card has to run for over 30 years before the purchase price alone pays for itself, before counting the electricity, the time spent setting it up, and the fact that the open models you run at home are usually a step behind the frontier ones the APIs serve. For a business, that is not a close call.

So when does local actually win?

Two cases, and they are real.

Privacy and compliance. If you handle data you are not allowed to send to a third party, the cloud bill stops being the deciding factor. A small law firm reviewing privileged documents, a clinic touching patient records, a contractor on work with data-handling clauses: for these, keeping the model on a machine you control can be the requirement, not the preference. Here local is not about saving money, it is about being allowed to do the work at all. If you want the full setup, I wrote it up in running open weights at home.

Very high, steady volume. The cloud is cheap per call, but it is still per call. If you are running not 1,000 generations a month but millions, the line eventually crosses and owning the hardware gets cheaper. Almost no small business is there. Most medium businesses are not either. If you think you might be, measure your real token volume for a month before you buy anything, because the answer is usually "not yet."

For everything in between, there is a clean middle path: use the cheap cloud tiers for daily work, and rent a GPU by the hour only for the occasional heavy job like fine-tuning, which runs about 30 to 70 cents an hour. You get the capability without the capital expense. The full hardware ladder, from free cloud GPUs to a real rig, is in the free-first home AI hardware guide.

How to actually start, cheaply

The practical on-ramp for a small business is not a hardware purchase. It is one cheap-tier API key and a small, well-defined task.

  1. Pick one repetitive writing or sorting job that eats time every week. Drafting replies, summarizing call notes, tagging incoming requests.
  2. Wire it to a cheap model first (a nano, mini, Flash, or Haiku tier). Only move up to a mid tier like Sonnet or Gemini Pro if the output quality is visibly short.
  3. Set a hard monthly spend cap in the provider dashboard so a runaway loop can never surprise you. At these prices you will likely never hit it, but the cap is free peace of mind.
  4. Keep your prompts and any data you feed in as portable text, not locked inside one vendor's console, so switching providers later is a config change, not a rebuild.

That is the whole move. The point of starting this way is the same idea behind everything I build and the argument of The $97 Launch: spend on the smallest real thing, learn what actually limits you, and only then spend more. For AI, the smallest real thing costs a few dollars a month, not a few hundred up front.

The honest summary

Buy the GPU if the law requires it or your volume genuinely demands it. Otherwise, rent the intelligence by the token, keep your data portable, and put the money you did not spend on hardware into the part of the business only you can build. Next in this series, the same logic applied to where your website and apps live.

The series

Related reading

Fact-check notes and sources

LLM API prices change often and are quoted per million tokens; treat these as approximate mid-2026 figures and confirm on each provider's page before relying on them.

  • Anthropic Claude pricing (Haiku 4.5 $1/$5, Sonnet 4.6 $3/$15, Opus 4.8 $5/$25 per 1M tokens): Claude pricing.
  • OpenAI (GPT-5.4-nano ~$0.20/$1.25, GPT-5.4-mini ~$0.75/$4.50): OpenAI API pricing.
  • Google Gemini (3 Flash ~$0.50/$3.00, Flash-Lite ~$0.25/$1.50): Gemini API pricing.
  • Open-model hosts (Groq Llama 3.1 8B ~$0.05/$0.08; Together Llama 3.3 70B ~$1.04): Groq pricing, Together AI pricing.
  • Used RTX 3090 (~$700 to $900, 350W) and local-rig economics: GPU listings; US residential electricity ~17 to 18 cents/kWh, EIA. Mac Mini from $799: Apple.

This post is informational and not financial, legal, or compliance advice. Product names and prices are current as of mid-2026 and change; verify before relying on them. No affiliation with the vendors mentioned is implied.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026