LLM Retrieval Cost Estimator

SMBs with internal AI assistants hit the bill shock around month three.

The LLM works. Customers love it. The monthly API bill is $1,200 and climbing. Nobody knows why.

Usually the answer is: every query retrieves 5-8 documents, each ~3000 tokens, then feeds all of it to GPT-4 or Claude Opus. At 1000 queries/day × 24,000 retrieval tokens × $2.50 per 1M input tokens × 30 days, that's $1,800/month on input tokens alone. The math was always going to work out this way; nobody ran it.

The cheap fix: estimate token counts during content design and pick the right model tier for the content's actual complexity. A FAQ page doesn't need Opus. A legal-analysis page might.

What the LLM Retrieval Cost Estimator does

You paste a URL or article text. The tool:

Counts characters and words.
Estimates tokens for five major tokenizer families:
- OpenAI cl100k (GPT-4)
- OpenAI o200k (GPT-4o / GPT-5)
- Anthropic (Claude)
- Google (Gemini)
- Meta Llama
Computes the per-call cost for this page across 10 major models at current April 2026 list prices.
Computes the $1M-call cost (useful for capacity planning).
Estimates RAG chunks at standard 512-token chunks.
Emits an AI strategy prompt that picks the right model tier and recommends chunk size.

The three cost-design decisions

1. Which model tier for which content. Don't use Claude Opus for a FAQ when Haiku works. Don't use GPT-5 for product-description retrieval when GPT-4o-mini is fine. The estimator makes the cost spread visible.

Rule of thumb:

Factual lookups, FAQ, short-answer retrieval: Gemini Flash, Haiku, GPT-4o-mini
Narrative explanation, multi-fact synthesis: GPT-4o, Sonnet, Gemini Pro
Complex reasoning, legal analysis, code review: Opus, GPT-5

Most SMB content is in the first or second tier. Using tier-3 models for tier-1 content is the default accidental pattern.

2. Which tokenizer to optimize for. If your primary LLM is GPT-4o, tokens are ~5% cheaper than GPT-4 because of the more efficient o200k tokenizer. Gemini tokenization is ~7% more efficient still. Anthropic's tokenizer is slightly less efficient than either. If you're planning a multi-million-call workload, that 7-12% spread is real money.

3. Chunk size for RAG. Standard 512 tokens is a default. Dense reference content (legal text, API docs) benefits from 256-token chunks for precision. Narrative content (case studies, long-form articles) benefits from 1024-token chunks for coherence. The tool flags content type and suggests the right chunk size.

The token-inflating content patterns

Four patterns that silently inflate token counts:

Heavy punctuation. Em-dashes, semicolons, ellipses tokenize inefficiently. A paragraph laced with stylistic punctuation can cost 15-20% more tokens than a clean-prose equivalent.

Code blocks. Code is expensive. A 400-word page with two code blocks can cost twice as much in tokens as a 400-word page of prose.

Numerical tables. Tables of figures tokenize by cell — every number becomes multiple tokens. A 20-row table can cost 600-1000 tokens by itself.

Repeated headers / navigation. If the main content extraction doesn't strip navigation (the tool does, but some pipelines don't), every page's retrieval includes the site nav. At scale, that's 5-10% of total spend wasted.

The AI strategy prompt flags these patterns per page and recommends specific fixes.

The 100k-query / month worked example

Scenario: SMB with 100 pages in a RAG knowledge base. Each query retrieves ~3 pages (~1500 tokens each = 4500 tokens). 100 queries/day = 3000/month = 36,000/year.

Cost at different model tiers:

GPT-4o ($1.25/1M input): 4500 × 3000 × $1.25 / 1M = $17/month on input tokens
Claude Opus 4.6 ($15/1M): $202/month
Claude Sonnet 4.6 ($3/1M): $40/month
Gemini Flash ($0.075/1M): $1/month

The spread is 200x. Most SMBs land somewhere in the Sonnet/GPT-4o band because the quality/cost tradeoff is honest there. Moving to Flash/Haiku saves 90%+ if quality allows.

Fact-check notes and sources

OpenAI tokenizer differences (cl100k vs o200k): OpenAI tokenizer docs
Per-model pricing (April 2026): public pricing pages for each vendor
Tokenization heuristics (chars-per-token ratios): derived from official tokenizer libraries with representative English prose

This post is informational, not LLM-architecture-consulting advice. Mentions of OpenAI, Anthropic, Google, Meta, AWS Bedrock are nominative fair use. No affiliation is implied.

The Token-Efficiency Audit Every Content Team Ignores Until The Bill Arrives