SMBs with internal AI assistants hit the bill shock around month three.
The LLM works. Customers love it. The monthly API bill is $1,200 and climbing. Nobody knows why.
Usually the answer is: every query retrieves 5-8 documents, each ~3000 tokens, then feeds all of it to GPT-4 or Claude Opus. At 1000 queries/day × 24,000 retrieval tokens × $2.50 per 1M input tokens × 30 days, that's $1,800/month on input tokens alone. The math was always going to work out this way; nobody ran it.
The cheap fix: estimate token counts during content design and pick the right model tier for the content's actual complexity. A FAQ page doesn't need Opus. A legal-analysis page might.
What the LLM Retrieval Cost Estimator does
You paste a URL or article text. The tool:
- Counts characters and words.
- Estimates tokens for five major tokenizer families:
- OpenAI cl100k (GPT-4)
- OpenAI o200k (GPT-4o / GPT-5)
- Anthropic (Claude)
- Google (Gemini)
- Meta Llama
- Computes the per-call cost for this page across 10 major models at current April 2026 list prices.
- Computes the $1M-call cost (useful for capacity planning).
- Estimates RAG chunks at standard 512-token chunks.
- Emits an AI strategy prompt that picks the right model tier and recommends chunk size.
The three cost-design decisions
1. Which model tier for which content. Don't use Claude Opus for a FAQ when Haiku works. Don't use GPT-5 for product-description retrieval when GPT-4o-mini is fine. The estimator makes the cost spread visible.
Rule of thumb:
- Factual lookups, FAQ, short-answer retrieval: Gemini Flash, Haiku, GPT-4o-mini
- Narrative explanation, multi-fact synthesis: GPT-4o, Sonnet, Gemini Pro
- Complex reasoning, legal analysis, code review: Opus, GPT-5
Most SMB content is in the first or second tier. Using tier-3 models for tier-1 content is the default accidental pattern.
2. Which tokenizer to optimize for. If your primary LLM is GPT-4o, tokens are ~5% cheaper than GPT-4 because of the more efficient o200k tokenizer. Gemini tokenization is ~7% more efficient still. Anthropic's tokenizer is slightly less efficient than either. If you're planning a multi-million-call workload, that 7-12% spread is real money.
3. Chunk size for RAG. Standard 512 tokens is a default. Dense reference content (legal text, API docs) benefits from 256-token chunks for precision. Narrative content (case studies, long-form articles) benefits from 1024-token chunks for coherence. The tool flags content type and suggests the right chunk size.
The token-inflating content patterns
Four patterns that silently inflate token counts:
Heavy punctuation. Em-dashes, semicolons, ellipses tokenize inefficiently. A paragraph laced with stylistic punctuation can cost 15-20% more tokens than a clean-prose equivalent.
Code blocks. Code is expensive. A 400-word page with two code blocks can cost twice as much in tokens as a 400-word page of prose.
Numerical tables. Tables of figures tokenize by cell — every number becomes multiple tokens. A 20-row table can cost 600-1000 tokens by itself.
Repeated headers / navigation. If the main content extraction doesn't strip navigation (the tool does, but some pipelines don't), every page's retrieval includes the site nav. At scale, that's 5-10% of total spend wasted.
The AI strategy prompt flags these patterns per page and recommends specific fixes.
The 100k-query / month worked example
Scenario: SMB with 100 pages in a RAG knowledge base. Each query retrieves ~3 pages (~1500 tokens each = 4500 tokens). 100 queries/day = 3000/month = 36,000/year.
Cost at different model tiers:
- GPT-4o ($1.25/1M input): 4500 × 3000 × $1.25 / 1M = $17/month on input tokens
- Claude Opus 4.6 ($15/1M): $202/month
- Claude Sonnet 4.6 ($3/1M): $40/month
- Gemini Flash ($0.075/1M): $1/month
The spread is 200x. Most SMBs land somewhere in the Sonnet/GPT-4o band because the quality/cost tradeoff is honest there. Moving to Flash/Haiku saves 90%+ if quality allows.
Related reading
- RAG Readiness Audit — upstream: can your content even be ingested?
- Chunk Retrievability — chunk-level extraction scoring
- Model-Specific Snippet Audit — per-model extraction optimization
- AI Model Recommender — picks the right model tier for a task
Fact-check notes and sources
- OpenAI tokenizer differences (cl100k vs o200k): OpenAI tokenizer docs
- Per-model pricing (April 2026): public pricing pages for each vendor
- Tokenization heuristics (chars-per-token ratios): derived from official tokenizer libraries with representative English prose
This post is informational, not LLM-architecture-consulting advice. Mentions of OpenAI, Anthropic, Google, Meta, AWS Bedrock are nominative fair use. No affiliation is implied.