Multi-Model Routing — When To Break Out Of Claude-Only

Part of the Claude Code workflow series. Start with the install primer; then what to do after install; then this post when you've noticed the cost of Claude-everywhere and want a practical model-routing setup.

Claude Code is a tool. Claude itself is a model family. Those are not the same thing. You can absolutely use Claude Code for every task and pay the full Opus bill each month. You can also look at how you actually use it, notice that 30% of your sessions are "classify these 200 items" or "give me a three-word summary" — tasks that don't need the full reasoning of Opus — and route those to a cheaper or faster model.

This isn't about abandoning Claude Code. You stay in Claude Code. You just sometimes ask Claude Code to delegate to another model, or you keep a second CLI open for the cases where a different model fits better.

The four reasons to route away from Opus

1. Cost. Opus per-token pricing is the highest in the lineup because it's the most capable model. For high-volume mechanical tasks (tagging, summarizing, JSON reshuffling), Haiku or Sonnet is 5–20× cheaper and produces identical output quality for the task.

2. Latency. Inline autocomplete, shell-command suggestion, quick clarification — you want sub-second response. Opus is fast but Codex CLI and GitHub Copilot CLI are usually faster for those specific patterns because they're tuned for shell-speed interaction.

3. Creative variance. Two models trained by different teams on different data produce genuinely different answers. When you want a second opinion on architecture, design, or approach, running the same prompt through Claude and GPT (via Codex CLI) and comparing is higher-signal than asking Claude twice.

4. Specific strengths. Some models are measurably better at certain things. GPT-4 has a reputation for mathematical proofs and certain kinds of structured reasoning. Gemini has extremely long context windows for bulk document analysis. Open-weights models can run locally with zero data egress. The best model for a given job isn't always the one you default to.

The three patterns that work

Pattern 1 — plan in Claude, execute elsewhere

The setup:

Claude Code for planning, reading, and decision-making. You use Claude Opus's reasoning to produce a detailed plan for a feature.
Codex CLI (or another tool) for mechanical execution. Once the plan is locked, you feed each step to Codex to implement. Codex is faster per turn and cheaper per token for "here's the plan, now type the code."

When it's worth it: tasks with a complex planning phase and a mechanical execution phase. Refactors, migrations, boilerplate generation. The planning is 20% of the time but 80% of the value; route the 80%-of-time execution to the cheaper model.

When it isn't: tasks with tight feedback loops where planning and execution overlap. Active debugging, exploratory coding. Context-switching between CLIs kills your flow.

Pattern 2 — two models, one prompt, diff the answers

The setup:

Ask the same architectural question of both Claude and Codex (or whichever two models you have set up).
Compare the answers. Where they agree, you have high confidence. Where they disagree, you read both arguments and make the call.

When it's worth it: genuinely open design questions. "Should this service be event-driven or request-response?" "Is this schema normalized enough?" A second opinion is cheap (a few bucks) and the disagreement surfaces blind spots a single model's defaults would have papered over.

When it isn't: questions with objectively-right answers. "What's the correct regex for RFC 5322 email validation?" One good model gets this right; asking two is wasted tokens.

Pattern 3 — Haiku for volume, Opus for judgment

The setup:

Haiku for "process these 1,000 items in a loop" tasks. Tag each image. Classify each log entry. Summarize each commit. Cheap, fast, good enough.
Opus for the judgment calls that come out the other end. "Based on the 1,000 Haiku-generated tags, which category has the weakest coverage and what content should we write?"

When it's worth it: anything with a volume component + a synthesis component. Log analysis, content tagging, bulk summarization, codebase-wide refactor analysis.

When it isn't: single-item reasoning tasks. A one-file review. A one-function refactor. No volume component means no Haiku delegation.

Practical routing without a custom framework

You don't need LiteLLM or a router layer to do this. The cheapest-possible setup:

Three shell aliases, three CLIs:

# ~/.zshrc (or .bashrc)
alias code-claude='claude'            # Claude Code default
alias code-fast='codex'               # OpenAI Codex / ChatGPT CLI
alias code-suggest='gh copilot suggest' # GitHub Copilot CLI for shell commands

Now you pick by alias. code-claude for the real work. code-fast for "quick question, one-shot answer." code-suggest for "what's the find/grep/xargs invocation I need here?"

Per-project model preference via environment variable:

Claude Code accepts a model override. If a specific project should use Sonnet by default (cheaper, still capable):

# .envrc (via direnv) or project-level script
export ANTHROPIC_MODEL=claude-sonnet-4-6
claude

Haiku for the fastest / cheapest:

export ANTHROPIC_MODEL=claude-haiku-4-5

Keep Opus for the projects where reasoning matters most; drop to Sonnet or Haiku for volume projects.

Skills that delegate:

A .claude/skills/bulk-tag.md skill might look like:

# bulk-tag

Tag items in bulk. Uses Haiku for the per-item calls.

Invoke: /skill bulk-tag <input-file> <tag-scheme>

Implementation:
- Read <input-file> line-by-line
- For each line, call the Anthropic API with model=claude-haiku-4-5 and the tag-scheme prompt
- Collect outputs into <input-file>.tagged
- Report count of tagged items + any failures

Now /skill bulk-tag logs.txt severity-taxonomy handles the volume job without burning Opus tokens. Your main session stays on Opus for everything that isn't bulk.

When a router framework is worth it

If you're routing 5+ models across 20+ task types with rate-limiting, failover, and cost tracking, you've grown out of the alias-plus-env-var setup. At that scale:

LiteLLM — unified SDK across ~100 LLM providers. Drop-in OpenAI-compatible API. Good for teams.
OpenRouter — similar, hosted. Pay-as-you-go across providers with a single bill.
Continue.dev — IDE-side routing with per-task model configs. Works alongside Claude Code for non-CLI workflows.

For most solo developers, aliases + env vars + skills are enough. Don't over-engineer. Switch to a framework only when you've measured that you'd benefit from one.

The honest limits

Model behavior varies week to week. Provider updates can shift output quality subtly without a version bump. A prompt that worked well in Claude last month might produce worse output this month; the fix is usually re-testing against current versions. Don't hardcode assumptions about model comparative strengths that are more than a quarter old.

Cost tracking is per-provider. Claude usage shows up on the Anthropic bill; GPT on the OpenAI bill; Gemini on Google's bill. If you're routing heavily, budget across all of them — and know that "save money by switching to Haiku for X" can be offset if the context switch slows you down by 20%.

Lock-in is real but overstated. You may find yourself building workflows that depend on Claude Code features (skills, subagents, hooks) that don't have 1:1 equivalents in other CLIs. That's fine — keep Claude Code as the host, delegate to other models for specific tasks, don't try to rebuild the whole harness.

What to try next week

Pick one task type you currently use Claude Opus for that doesn't really need Opus. Most candidates: bulk tagging, one-line summaries, routine file renames, commit-message generation. Route it to Haiku and measure whether output quality drops.
Set up Codex CLI alongside Claude Code for quick questions. Use Codex for 3 days. Notice which kinds of questions you reach for it on (usually: fast, one-shot, no file context).
Next time you have an architectural decision, ask the same question in Claude and Codex. Read both. Notice what you learn from the disagreement.
If you've never measured your Claude token spend by session type, do it this week. /insights shows part of this; the full picture is in your Anthropic billing dashboard. Knowing where the money goes is step one to routing.

Fact-check notes and sources

Anthropic: Effective harnesses for long-running agents — touches on when different capabilities matter per task.
Paddo: Inside the Claude Code Team's actual workflow tools — mentions multi-model patterns the team itself uses.
Augment Code: Google Antigravity vs Claude Code — comparison context for when Gemini-via-Antigravity might be the better fit.
LiteLLM: LiteLLM documentation — for readers who graduate past the alias setup.
OpenRouter: openrouter.ai — hosted multi-provider routing.

Informational, not engineering consulting advice. Model pricing, capability comparisons, and CLI feature sets evolve. Verify current pricing and capability per-provider before committing a workflow. Mentions of Anthropic, OpenAI, Google, GitHub, LiteLLM, OpenRouter, Continue.dev, and linked publications are nominative fair use. No affiliation is implied.

Claude Is Not Always The Right Model — When To Break Out Of The Monoculture, And How To Route Between Providers Without Losing Your Mind

The four reasons to route away from Opus

The three patterns that work

Pattern 1 — plan in Claude, execute elsewhere

Pattern 2 — two models, one prompt, diff the answers

Pattern 3 — Haiku for volume, Opus for judgment

Practical routing without a custom framework

When a router framework is worth it

The honest limits

What to try next week

Related reading

Fact-check notes and sources

Send a Message