Companion to the Codex mini-series and the Claude Code workflow series. If you've never used either CLI, start with the install primer. This post assumes you've written a few prompts to each family and noticed they behave slightly differently, and you want a real framework for that.
Most "prompt engineering" advice is family-agnostic — clear instructions, examples, specific output format, etc. Those rules are true for both GPT and Claude. The interesting differences are the places where one family rewards a pattern the other is neutral to, or where a prompt that works well with one family needs a specific adjustment to land well with the other.
Four differences I've found repeatable across hundreds of prompts, with before-and-after examples.
Difference 1 — explicit step-by-step vs implicit reasoning
GPT responds measurably better when you explicitly ask for step-by-step reasoning on complex tasks. Claude tends to do this implicitly; adding the instruction rarely hurts but often doesn't help.
Before (works for Claude, leaves GPT producing shallower output):
Figure out why the checkout flow is double-charging customers and
propose a fix.
After (works for both, noticeably better on GPT):
Figure out why the checkout flow is double-charging customers.
Work through this step by step:
1. Read the checkout submission handler
2. Identify every place a charge is initiated
3. Check for retry logic, race conditions, or duplicate event handlers
4. Propose the minimal fix
Show your reasoning for each step before giving the final fix.
The second prompt is more verbose but extracts meaningfully more careful reasoning from GPT. Claude is less sensitive to the scaffolding but doesn't penalize you for including it.
When it matters most: multi-step logical reasoning, debugging, architectural analysis. When it matters least: straight code generation, explanation tasks, simple lookups.
Difference 2 — examples outperform rules for GPT; Claude handles abstract rules more gracefully
GPT is a stronger few-shot learner than Claude; Claude is stronger at following abstract style rules stated once.
Stylistic rule (works for Claude, underperforms on GPT):
Generate commit messages in the conventional-commits format.
Use imperative mood. Start with a type like fix:, feat:, chore:.
Keep the subject under 72 chars. Body optional.
Claude absorbs this rule and produces consistent conventional-commits messages across many calls. GPT tries to follow it but drifts — you'll see fix: become Fix: by the fifth call, or the subject lengthen past 72 chars.
Few-shot examples (works for both, noticeably better on GPT):
Generate commit messages matching the style of these examples:
Example 1:
feat: add rate limiting to /api/auth endpoints
Example 2:
fix: prevent double-charge on payment retry (closes #412)
Example 3:
chore: bump typescript to 5.4.0
Now generate 5 commit messages for the following changes:
[...]
GPT is markedly more consistent with examples; Claude is fine with either approach.
Practical translation: when you have a stylistic pattern you want enforced, use 3 examples for GPT. One is not enough; five is usually overkill. For Claude, a clear stylistic rule can work if 3 examples aren't readily available.
Difference 3 — structured output declarations
GPT stays inside a declared output schema across many turns noticeably better than Claude, especially when the schema is specified as JSON schema or Pydantic model. Claude is improving here but was historically weaker.
Weak schema declaration (Claude handles, GPT drifts):
Return the analysis as a list of findings with fields for file, issue, and severity.
Strong schema declaration (works for both, critical for GPT):
Return your analysis as a JSON array with this exact shape:
[
{
"file": "string — relative path from repo root",
"issue": "string — one-sentence description of the problem",
"severity": "critical" | "warning" | "info",
"line": "number — 1-indexed line number in the file"
}
]
Return only the JSON array. No preamble, no trailing explanation.
GPT-family models have been explicitly tuned for structured output (via the response_format API parameter or similar). Taking advantage of that with an explicit schema produces consistent parseable output. Claude benefits too but is more forgiving of under-specified schemas.
Practical translation: if your downstream consumer is a script that parses the output, spend the extra 30 seconds specifying the schema clearly. GPT rewards you more for it, but Claude doesn't penalize you.
Difference 4 — long-context coherence
Claude holds coherence across very long contexts (50K+ tokens, long multi-turn sessions) more reliably than GPT. GPT degrades noticeably faster on long conversation threads.
You don't see this in one-shot prompts; it shows up when you're 40 turns deep into a session where the model needs to remember a decision made 20 turns ago.
If you're running long Claude Code sessions: the coherence is real and buys you real productivity.
If you're running Codex: break long tasks into shorter sessions, explicitly re-ground at the start of each new session, and don't rely on implicit memory across 30+ turns.
There isn't a "before and after" prompt example for this one — it's a structural constraint, not a prompt-style one. Match your session length to what the model can coherently hold.
Difference 5 (bonus) — hedging and claim confidence
GPT defaults more toward confident-sounding output; Claude defaults more toward hedged output. Neither is universally better; they're different failure modes.
When you want decisive output — a single recommendation, a clear answer to a judgment call — GPT is often closer to your preferred output out of the box. Prompting with "be decisive, commit to one answer, don't hedge" helps both families.
When you want careful output — admitting uncertainty, flagging caveats, noting when you don't know — Claude is often closer out of the box. Prompting with "note any uncertainty explicitly" nudges both.
Know what you want. Either model can produce either output style with the right prompt; the defaults just differ.
Translation examples — take a prompt that works well, adapt it
Starting from a prompt that works well for Claude:
Review this pull request for security issues. Flag anything that looks
like an injection vulnerability, auth bypass, or secret leakage.
Ignore style and performance.
To make this work equally well for GPT:
Review this pull request for security issues. Work through it step by step:
1. Scan for SQL injection, command injection, XSS vectors
2. Scan for auth bypass, authorization-check gaps, session-handling issues
3. Scan for hardcoded secrets, API keys, credentials in config files
4. Ignore style issues and performance issues
Return findings as a JSON array:
[
{ "type": "injection|auth|secret", "file": "...", "line": 42, "severity": "critical|warning", "description": "..." }
]
Return only JSON. No preamble.
Same review, structured for GPT's strengths: explicit steps, structured output schema. Claude would handle either prompt; GPT handles the second one noticeably better.
Starting from a prompt that works well for GPT:
Given these three examples of good commit messages, generate 5 more
in the same style for the following diff: [...]
To make this work equally well for Claude:
Generate 5 commit messages for the following diff, following the
conventional-commits specification: imperative mood, lowercase first
word, type prefix (feat/fix/chore/docs), subject under 72 chars.
Diff: [...]
Same output target. Claude takes the abstract spec; GPT takes the examples. Both produce similar quality with the right framing.
The meta-skill
The meta-skill is reading which family you're currently prompting and adjusting. Most developers running both tools adopt unconscious prompt-switching within a few weeks — you start writing "step by step" when you reach for Codex and implicit prose when you reach for Claude.
A useful explicit exercise for the first month: notice when a prompt lands well or lands weakly. Ask yourself "if I were using the other tool, which of these four differences would I tune for?" After three or four times doing this, the switching becomes automatic.
What doesn't differ
Plenty of prompt-engineering advice is family-agnostic. These work equally well for both GPT and Claude:
- Clear, specific instructions
- Specifying the persona or role explicitly
- Providing context about the task, audience, and constraints
- Asking the model to check its own work or cite sources
- Breaking up complex tasks into smaller prompts
- Saying what you want, not what you don't want
Nothing above changes per family. The four differences in this post are the delta on top of the shared baseline.
Related reading
- AI Terminal Kickstart — install both families.
- Codex CLI after install — Codex-specific first-week habits.
- Codex vs Claude Code — task-by-task comparison.
- Two-CLI workflow — running both CLIs concurrently.
- Multi-model routing — the broader "break out of one model" framing.
Fact-check notes and sources
- OpenAI: Prompt engineering guide, Structured outputs.
- Anthropic: Claude prompting best practices.
- Anthropic: Effective context engineering for AI agents.
- Walturn: Mastering prompt engineering for Claude — Claude-specific practical notes.
- Arize: CLAUDE.md best practices from prompt learning — empirical study on prompt effectiveness.
Informational, not engineering consulting advice. Prompt behavior differs across model versions within each family, and the deltas described here reflect general patterns as of Q2 2026. Test specific prompts on the exact model versions you're using before depending on behavioral claims. Mentions of OpenAI, Anthropic, GPT, Claude, and linked publications are nominative fair use. No affiliation is implied.