Anthropic shipped Claude Opus 4.7 on April 16, 2026 at the same sticker price as Opus 4.6: $5 per million input tokens, $25 per million output tokens. The announcement post led with the coding improvements. The community discovered the rest.
Two weeks later, enough independent benchmark data exists to draw a real picture. Not the marketing picture. The full one.
Where Opus 4.7 leads
Coding: clear first place
On SWE-bench Verified, the benchmark that measures real-world GitHub issue resolution, Opus 4.7 scores 87.6%. That's up from 80.8% on Opus 4.6 and ahead of GPT-5.3-Codex at 85.0% and Gemini 3.1 Pro at 80.6%.
On the harder SWE-bench Pro variant, the gap is wider. Opus 4.7 hits 64.3%, compared to GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. For multi-step coding tasks that require understanding a full repository, Opus 4.7 is the strongest model available right now.
Tool use: decisive advantage
On MCP Atlas, which measures how well a model calls external tools in agentic workflows, Opus 4.7 scores 77.3%. This is ahead of every competitor. For developers using Claude Code, Codex, or any tool-calling CLI, this is the benchmark that most closely maps to "can the model actually use the tools I gave it."
Vision: a different league
Opus 4.7 accepts images up to 2,576 pixels on the long edge, roughly 3.75 megapixels, which is more than three times the resolution of prior Claude models. One early-access partner testing computer vision for security work reported visual acuity jumping from 54.5% to 98.5%. If your workflow involves screenshots, diagrams, or visual debugging, this matters.
Chatbot Arena: narrow lead
Opus 4.7 in thinking mode leads the LM Arena leaderboard at 1504 ELO. Gemini 3.1 Pro sits at 1493 and GPT-5.4-high at 1482. The gap is small enough that for general chat it won't matter. For coding-specific ELO, the lead is more meaningful.
Where Opus 4.7 regressed
Long-context retrieval: severe drop
This is the regression that caught the most people off guard. On MRCR v2 (Multi-needle Retrieval in Context at Range), which tests whether a model can find specific information buried in long documents:
- At 1 million tokens: Opus 4.7 scores 32.2%. Opus 4.6 scored 78.3%. That's a 46-point drop.
- At 256K tokens (8-needle retrieval): Opus 4.7 scores 59.2%. Opus 4.6 scored 91.9%. A 33-point drop.
Anthropic's response: they say MRCR stacks distractors in ways that don't reflect real usage, and they point to GraphWalks as a better long-context benchmark where 4.7 improves. That may be true for some workloads. For anyone running RAG pipelines, legal document analysis, or large-codebase navigation that depends on precise retrieval from long context, the MRCR regression maps directly to their failure mode.
If your workflow involves finding a specific function definition in a 200-file codebase, or extracting a specific clause from a 400-page contract, test on your actual data before upgrading.
Agentic search: a step backward
On BrowseComp, which measures the model's ability to autonomously search and synthesize web content, Opus 4.7 dropped 4.4 points from Opus 4.6. If your agent workflow involves the model searching the web, reading pages, and compiling results, this matters.
Tokenizer cost: 32 to 34 percent more tokens
Opus 4.7 ships a new v2 tokenizer. For the same input text, it generates 32 to 34 percent more tokens at production scale (10K+ token prompts). The per-token price is unchanged, but your effective cost per prompt is higher.
For a developer running a few sessions a day, this is barely noticeable. For anyone running batch workflows, automated testing, or high-volume API calls, it shows up fast. A prompt that cost $0.50 under Opus 4.6 now costs $0.66 under 4.7. Multiply that across thousands of calls.
What broke for early adopters
budget_tokens returns 400
Opus 4.6 let you control thinking depth with budget_tokens:
thinking: { type: "enabled", budget_tokens: 32000 }
Opus 4.7 removed this entirely. The only supported mode is adaptive thinking, where the model decides how deeply to think. If your code passes budget_tokens, you get a 400 Bad Request error. No deprecation notice. No warning in the previous release. The parameter just stopped working.
The migration is simple: change to thinking: { type: "enabled" } and remove the budget parameter. But nobody knew they needed to migrate until their production calls started failing.
Thinking tokens hidden by default
In Opus 4.6, thinking tokens were visible in API responses. In 4.7, they're hidden by default under adaptive thinking. If you had monitoring, logging, or cost-tracking that depended on seeing the thinking token count, it broke silently. You can still access them, but the default behavior changed.
The 48-hour firestorm
Within two days of release:
- A Reddit post titled "Opus 4.7 is not an upgrade but a serious regression" hit 2,300 upvotes
- A GitHub issue on the claude-code repository documented "worse quality at higher token cost vs 4.6 for production coding workloads"
- Multiple developer blogs published "don't upgrade yet" warnings
- Developers who'd let their CLI auto-update found themselves debugging unexpected 400 errors in production
The developers who pinned to claude-opus-4-6-20260316 (or whichever snapshot they'd tested) were unaffected. Their Monday morning was normal.
The actual upgrade decision
Here's the honest scorecard:
| Dimension | 4.6 → 4.7 | Verdict |
|---|---|---|
| Coding (SWE-bench Verified) | 80.8% → 87.6% | Upgrade |
| Coding (SWE-bench Pro) | 53.4% → 64.3% | Upgrade |
| Tool use (MCP Atlas) | Lower → 77.3% | Upgrade |
| Vision acuity | 54.5% → 98.5% | Upgrade |
| Chatbot Arena ELO | Lower → 1504 | Marginal |
| Long-context retrieval (MRCR 1M) | 78.3% → 32.2% | Do not upgrade |
| Long-context retrieval (MRCR 256K) | 91.9% → 59.2% | Do not upgrade |
| Agentic search (BrowseComp) | Higher → dropped 4.4pts | Hold |
| Token cost (same text) | Baseline → +32-34% | Cost increase |
| budget_tokens API | Works → 400 error | Breaking change |
If your work is primarily short-to-medium context coding with tool calls, 4.7 is a clear improvement. If your work depends on long-context retrieval, document analysis, or research-heavy agentic search, stay on 4.6 until Anthropic addresses the MRCR regression or you've tested on your own data and confirmed it doesn't affect your specific patterns.
If you're cost-sensitive on high-volume workflows, the tokenizer change alone may be reason to wait.
How to not be an early adopter casualty
The developers who got burned let the model update reach production without testing. The ones who were fine had pinned their version and had a validation routine ready.
The full pre-upgrade validation checklist covers ten steps including independent benchmark checking, community reaction monitoring, and API compatibility testing. The short version:
- Pin your current working model version
- Check SWE-bench, Chatbot Arena, Aider's leaderboard, and Artificial Analysis before upgrading
- Run your own regression test prompt
- Compare token counts between old and new
- Read the migration guide and the community reaction
- Upgrade deliberately, not automatically
If you're running a business on these tools, the chapter on AI tool cost management in The $97 Launch covers how to build a stack that doesn't break when a single provider ships a bad update. Search "The $97 Launch" on Amazon Kindle.
Related reading
- How to validate an AI coding model before you trust it — the full 10-step checklist with benchmark sites
- Every time an AI model update broke something — the complete timeline from GPT-4's prime-number collapse to Opus 4.7
- A Markdown file is the best memory layer for your AI coding tool — project context that survives model swaps
- Top AI CLIs and how to use them with our generators — keeping a second CLI warm as a fallback
- Two CLIs, one workflow: Codex alongside Claude Code — the practical routine for running both daily
Fact-check notes and sources
- SWE-bench Verified scores (Opus 4.7: 87.6%, GPT-5.3-Codex: 85.0%, Gemini 3.1 Pro: 80.6%): swebench.com and tokenmix.ai/blog/swe-bench-2026-claude-opus-4-7-wins.
- SWE-bench Pro (Opus 4.7: 64.3%, GPT-5.4: 57.7%, Gemini 3.1 Pro: 54.2%): Anthropic-reported, April 2026. platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7.
- MCP Atlas (77.3%) and Chatbot Arena ELO (1504): llm-stats.com/blog/research/claude-opus-4-7-launch and lmarena.ai.
- Vision acuity (54.5% to 98.5%): Early-access partner testing reported via mindstudio.ai/blog/claude-opus-4-7-vs-4-6-comparison.
- MRCR v2 regression (78.3% to 32.2% at 1M tokens, 91.9% to 59.2% at 256K): blog.wentuo.ai/en/claude-opus-4-7-long-context-regression-en.html and xlork.com/blog/claude-opus-4-7-backlash.
- BrowseComp regression (4.4 points): mindstudio.ai/blog/claude-opus-4-7-vs-4-6-comparison.
- Tokenizer cost increase (32-34%): openrouter.ai/announcements/opus-47-tokenizer-analysis and Hacker News thread news.ycombinator.com/item?id=47816960.
- budget_tokens removal and adaptive thinking migration: platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7 and dev.to.
- Reddit reaction (2,300 upvotes) and GitHub issue: r/ClaudeAI, April 2026, and github.com/anthropics/claude-code/issues/51440.
- Opus 4.7 pricing ($5/$25 per million tokens, unchanged from 4.6): artificialanalysis.ai/articles/opus-4-7-everything-you-need-to-know.
This post is informational, not consulting or financial advice. Mentions of Anthropic, OpenAI, Google, AMD, and all benchmark organizations are nominative fair use. No affiliation is implied.