GPT-5.5 vs Claude Opus 4.7 CLI Comparison

Part of the Claude Code + Codex workflow series. Start with the install primer, the post-install workflow, and the Codex vs Claude Code comparison. This post is the short-term update on what GPT-5.5 and Opus 4.7 actually mean at the terminal.

OpenAI shipped GPT-5.5 (internal codename "Spud") and Anthropic countered with Claude Opus 4.7 inside about a month. If you read the marketing, both are transformational. If you work at a terminal with either CLI open, the story is quieter — the model behaves a bit better in some specific places, the pricing got complicated, and the capacity problem got worse. Here's the honest read.

What "Spud" actually is

"Spud" is OpenAI's internal codename for GPT-5.5, confirmed via Axios reporting. It's a point release between GPT-5 and GPT-6 — closer in spirit to GPT-4 → GPT-4o than to a generation change. The headline claims are incremental reasoning gains, better agentic tool use, and lower cost per token on the frontier tier. Whether those claims hold up under your workload depends on the workload; more on that below.

Anthropic's response — Claude Opus 4.7 — landed within weeks. Opus 4.7's own release notes emphasize stronger long-context coherence (Claude's durable edge), improved agentic harness behavior, and better tool-use stability. Same structural move in the opposite direction: not a new generation, a tuning pass that closes gaps without rewriting the base.

Michael Parekh called the rivalry correctly in his Substack piece: "it isn't durable, it's a cadence story." Both companies are shipping on an IPO-preparation schedule. The pace is driven by bankers asking for momentum metrics, not by a fundamental research unlock. You can feel that at the terminal — each release feels like a delta, not a step.

At the terminal — the delta you can actually feel

A practical read after a week of using both CLIs on real work since both releases:

GPT-5.5 via Codex CLI

Tool-use coherence on long agent chains is noticeably better. Ten-step plan, file reads between each step — holds the thread better than GPT-5 did.
Structured-output adherence improved. JSON-schema-constrained responses are more consistent across many turns.
Latency-to-first-token is roughly flat. Not faster; not slower.
o-series reasoning tier feels a tick sharper on ambiguous bugs. The kind where you paste a function + stack trace and ask "why is this failing?" — the answer is more often right on the first try.
Mini-tier cost dropped slightly for bulk work, which matters if you're running high-volume classification or tagging.

Claude Opus 4.7 via Claude Code

Multi-hour sessions hold context meaningfully better. This is where Claude's lead has always been; 4.7 extends it.
Subagent review quality improved — the adversarial reviewer pattern produces more useful disagreement than before, less consensus-as-validation.
/insights reports get sharper — the friction detection catches more subtle patterns (e.g., approvals fired many times with specific regex shapes). Downstream-useful, not just cosmetic.
Rate-limit behavior under load is more graceful — retries backoff smarter, fewer hard 429s.
Opus token pricing stayed put this release. Anthropic did not follow OpenAI on the mini-tier price cut.

If you were running a side-by-side task on both CLIs the week before these releases, and you re-run it today, you'll see the shifts above in most real work. You will not see them in every task. Plenty of prompts produce the same output as last month.

Benchmarks moved; the order on your workload probably didn't

Parekh's point about the lead being narrow applies here. GPT-5.5 narrowly beats Opus 4.7 on several public benchmarks; Opus 4.7 narrowly beats GPT-5.5 on others. The winner flips depending on which eval set you run. Translation for daily work: the task-level trade-offs from the Codex vs Claude Code comparison still apply almost unchanged.

If Claude Code was your daily driver for long sessions and complex refactors, Opus 4.7 is still the better fit.

If Codex CLI was your daily driver for fast focused tasks and structured output, GPT-5.5 is still the better fit.

The one workload where I noticed the order flip: hard debugging of subtle logic errors. GPT-5.5's o-series tier pulled ahead enough that it's now my default for "here's the failing test + the function; figure out the real bug" before I even open Claude Code. A quarter ago, Opus was my default for that. This week, GPT-5.5 o-series starts. Then if it doesn't land, Claude Code is the follow-on with whole-codebase context.

The price story — the real change

The headline release-note gains aren't the biggest shift. The biggest shift is pricing and tiering, and Parekh nailed the shape of it before most people noticed.

OpenAI cut the GPT-5.5 mini-tier per-token rate, but the frontier tier stayed expensive and the rate limits got tighter. Anthropic held Opus pricing, but moved Claude Code's most advanced features behind higher plan tiers — the adversarial subagents pattern that was straightforward on Pro a quarter ago now wants a Team plan for full parallelism.

Parekh's frame — "cadence creates demand the companies can't fully serve, which forces velvet-rope gating users feel as a price shock" — is what's happening. The base tier experience is slightly better. The full-capability experience costs meaningfully more than it did six weeks ago. If your monthly Claude or OpenAI bill jumped in April, it's probably not your usage that changed.

The uncomfortable implication: the product strategy at both companies is now IPO-optimized. Momentum metrics matter more than any individual feature. Expect more frequent incremental releases, more frequent tier reshuffles, and more frequent "we've updated our API limits" emails. Less predictability is the cost of the release cadence you enjoy.

At your terminal — three adjustments worth making this week

1. Re-run your multi-model routing for two specific patterns.

GPT-5.5 mini is now cheap enough for a wider swath of bulk tasks that used to justify Claude Haiku. Commit-message generation, one-shot utility generation, short-answer Q&A that doesn't need context — worth re-evaluating for cost.

2. Move hard debugging to GPT-5.5 o-series.

If you've been defaulting to Claude Opus for "why is this test failing" debugging, try GPT-5.5 o-series for 10 real cases. Note which wins. My data is anecdotal but the shift was clean enough that I've made it the new default.

3. Watch your monthly bill.

The tier changes are real. If you haven't opened your OpenAI + Anthropic billing dashboards in two weeks, open them now. You may be paying meaningfully more for the same workload.

What hasn't changed

The Claude Code workflow series — plan mode, CLAUDE.md, skills, hooks, /insights — still applies identically. Nothing in this release cycle invalidated any of it.

The Codex mini-series workflow — narrower scoped prompts, structured output, step-by-step for hard tasks — still applies identically.

The two-CLI workflow — left pane Claude Code for the main work, right pane Codex for quick side tasks — is still the right setup. GPT-5.5 makes the right pane a bit more useful; Opus 4.7 makes the left pane a bit more reliable. Net: you still want both open.

What I'd watch for next

GPT-6. OpenAI has historically followed a .5 release with a major version within 4–8 months. If the IPO pressure Parekh describes is real, expect that window to compress.

Claude Sonnet 4.7. Anthropic often stages tuning gains across the family — Opus first, Sonnet and Haiku following. The Sonnet 4.7 shift (when it lands) will matter more to cost-sensitive workflows than the Opus shift did.

Open-weights catch-up. Qwen and Gemma both got releases in Q1 2026 that narrowed the frontier gap on specific coding tasks. If proprietary pricing keeps ratcheting, self-hosted Qwen-Coder becomes more attractive for bulk work.

Tier stabilization. The rate-limit churn will eventually settle. "Eventually" is probably on the other side of either IPO announcement. Until then, assume the feature surface of your paid tier could shift mid-month.

The one-paragraph summary

Spud is real, the improvement is real, the lead isn't durable. Opus 4.7's tuning closed most of the gap within weeks. For daily CLI work, the advice from the Codex vs Claude Code comparison still holds — long sessions and complex refactors to Claude Code, fast focused tasks and o-series reasoning to Codex — with the one adjustment that hard debugging has tipped toward GPT-5.5 o-series. The real change this month is the pricing tightening behind the scenes. Watch your bills, not the benchmark tables.

Fact-check notes and sources

OpenAI: Introducing GPT-5.5 — the official announcement page. Spud is OpenAI's internal codename per Axios reporting.
Michael Parekh's analysis — OpenAI's ChatGPT 5.5 "Spud" vs Anthropic. Source of the "cadence story" framing, the AOL / iPhone-ramp analogy, and the "velvet-rope gating" pricing frame.
Anthropic: Claude Opus 4.7 release notes (check the blog for the dated entry). Includes the long-context + agentic-harness improvement claims this post summarizes.
Claude Code docs: Changelog — reflects harness + rate-limit behavior changes shipped alongside 4.7.
OpenAI platform: Models page — current pricing + rate-limit tiers for GPT-5.5 / GPT-5.5 mini / o-series.

Informational, not investment or procurement advice. Benchmarks referenced reflect the state of public evaluations at time of writing; specific eval-set results shift with each release cycle. Verify current pricing, rate limits, and feature-tier mapping against vendor docs before committing to a workload. Mentions of OpenAI, Anthropic, Claude Code, Codex, GPT, Michael Parekh, Axios, and linked publications are nominative fair use. No affiliation is implied.

GPT-5.5 (Spud) vs Claude Opus 4.7 — What Actually Changed At The CLI, And Why The Lead Isn't Durable