Codex CLI Workflow — First Week With OpenAI's Agentic T...

Companion to the Claude Code series. Start with the AI Terminal Kickstart if you haven't installed yet. If you're running both Codex and Claude Code, two-CLI workflow covers the split-screen routine. This post is about the Codex-specific day-to-day once you've typed your first prompt and are wondering what to do next.

The Codex CLI and Claude Code look similar from 20 feet away. Both accept a prompt, both read and write files, both run shell commands with your approval, both can be asked to plan before editing. The differences show up in week one — model-selection nuance, prompt-style that GPT families respond to better, rate-limit and cost mechanics that diverge from Anthropic's, and integration patterns with the rest of the OpenAI product family.

If you've read the Claude Code follow-on, this post covers the analogous territory for Codex. Different tool, different muscle memory.

First session — the shape that works

Codex sessions tend to be shorter than Claude Code sessions by default. A Claude Code session might run for two hours on a single task with many back-and-forths; a Codex session is more often 10–30 minutes for a discrete task. Neither is wrong — it's a cultural difference in how the two tools evolved.

If you're coming from Claude Code, the adaptation to notice is:

Ask narrower questions. A Claude Code prompt might be "refactor the checkout flow to extract a billing helper." The equivalent Codex-friendly prompt is "here's the billing helper I want extracted; pull calculateTotals, applyDiscount, and formatCurrency out of checkout.ts and put them in billing.ts, then update the three callers." Same outcome, more specification up front.
Feed the files explicitly. Claude Code reads the repo autonomously and builds context as it goes. Codex works best when you hand it the files you want it to focus on. Paste contents or reference paths directly; don't assume it'll discover things.
Iterate in short loops. Prompt → response → apply → verify → next prompt. Each turn fits cleanly in the session; you don't need to keep a long thread alive.

Neither pattern is better universally. Claude Code rewards long coherent sessions; Codex rewards tight scoped ones. Match your interaction style to the tool you're using.

Model selection within the OpenAI ecosystem

Codex doesn't put all OpenAI models behind one brand. Depending on how you've configured your auth, you may have access to multiple:

GPT-5 / GPT-4o-class frontier models — the strongest general reasoning. Expensive per-token; slower per turn. Use for architecture decisions, tricky refactors, anything where you'd otherwise call a senior engineer.
GPT-4o-mini / smaller-tier models — faster and cheaper. Noticeably less reasoning depth, but easily capable for "rename this variable across 20 files" or "write a regex for RFC 5322 email validation." Use for volume and mechanical work.
o-series models (o1, o3, o-reasoning) — tuned for structured, multi-step reasoning. Different latency profile: they spend more time thinking before the first token. Worth reaching for when the task is "figure out why this test is failing" or "design the data model for this new feature."

Practical default: frontier model for the first-pass plan; smaller tier for execution; o-series when you're stuck on a hard bug or a design question. Most daily work lives in the first two; the o-series is the specialty tool.

If your CLI supports a model override flag or env var, get comfortable with it. Toggling per task saves meaningful money over a month.

Prompt shape — what GPT families respond to

GPT and Claude are both capable, but they have different sweet spots in prompt style. Four patterns that work well for Codex specifically:

1. Structured requests with explicit output format. GPT models respond well to "return the result as a JSON array with fields file, change, rationale." They're more likely than Claude to stay inside a declared schema across many turns. If your workflow involves producing structured data, say so up front.

2. Examples over rules. Showing "here are three good diffs and one bad diff; produce more in the style of the good ones" often outperforms "produce clean diffs." GPT models are strong few-shot learners and weaker at absorbing stylistic rules stated abstractly.

3. Step-by-step when the task is complex. "Think through this step by step" is not a magic phrase but the underlying instruction is correct — for multi-step reasoning, GPT models benefit from explicit decomposition more than Claude does. Don't rely on the model to decompose silently.

4. Short turns over long monologues. GPT models occasionally lose coherence in very long responses. If you want 2,000 words of output, split it into three prompts of 600–800 words each and assemble the pieces. Claude can often hold coherence longer.

The prompt-style-gpt-vs-claude post in this mini-series goes deeper on these patterns with worked examples.

OpenAI-specific cost and rate-limit mechanics

Three things to know in your first week:

Rate limits are per-minute and per-day, per model tier. Frontier models have tighter limits than cheaper tiers. Hitting the limit produces a 429 error, not a graceful degradation. Mitigations: retry with exponential backoff, switch to a lower tier for the bursty part of the workload, or move to a higher tier of paid plan.

Token pricing is asymmetric. Output tokens cost more than input tokens (typically 2–3× more). Prompts that include large context dumps + short responses cost less than prompts that are short but produce long responses. Knowing this changes how you structure prompts — "summarize this 10,000-line file in 100 words" is cheap; "generate a 5,000-line refactor of this 100-line file" is expensive.

Prompt caching reduces cost for repeated context. OpenAI's prompt caching (where available for your model tier) lets you pay the input cost once for a stable prefix and reuse it across many turns. Works best with CLAUDE.md-style system prompts that stay constant — set up your session-wide context once, then burn through many short exchanges without re-paying input cost.

Check your OpenAI billing dashboard weekly for the first month. The cost surprises you get in month one calibrate your intuition for month two.

Codex-specific quirks worth knowing in week one

Session state is not persistent by default. Close the CLI, lose the in-memory conversation. If you need cross-session memory, explicitly commit it to a file (/.openai/context.md or similar) and reference it at the start of the next session. Different from Claude Code's Automatic Memory.

File editing tends toward full-rewrite rather than diff-patch. When Codex edits a file, the default model behavior is often to regenerate the whole thing. This works but is slow for large files and risks trivial formatting churn in the diff. Ask explicitly for "minimal diff" or "patch, don't rewrite" when that matters.

Tool-use negotiation is chattier than Claude Code. Codex often narrates its plan ("I'll first read the file, then search for the function, then modify it") before executing. Useful when you want to follow along; annoying when you just want the result. Prompt with "execute without narrating each step" when you prefer speed.

Git integration varies by install path. The official OpenAI CLI, codex via community packages, and various IDE-integrated Codex modes each handle git differently. Some commit automatically; some don't. Find out which behavior yours has before running anything destructive.

Cross-links into the multi-CLI view

If you're running Codex alongside Claude Code (a common pattern), the two-CLI workflow post covers the split-screen routine. The tl;dr: use Codex for short focused tasks and quick questions; use Claude Code for long-running refactors and complex planning. Neither tool is the right answer for every session.

For the honest comparison between the two, Codex vs Claude Code — task-level walks through the task types each one handles better. For where Codex sits relative to ChatGPT CLI and GitHub Copilot CLI (a common point of confusion), three products that sound similar separates them.

Common early mistakes

Expecting Claude Code's long-session coherence. Codex sessions degrade more quickly at turn 40+. Break up long sessions into shorter focused ones.
Treating all OpenAI models as equivalent. Frontier for reasoning, mini-tier for mechanical work, o-series for hard bugs. The model choice matters more than in some other ecosystems.
Not caching stable context. If you're running 20 prompts in a session with the same 3,000-token preamble, enabling prompt caching cuts your input cost by 10× or more. Off by default in some configurations.
Skipping the billing dashboard check. Codex costs can surprise you. Check weekly for the first month until your intuition calibrates.

What to try this week

Set up aliases for your two most-used models (frontier and mini-tier) so switching is one keystroke.
Write your first Codex-tailored prompt: structured output format, 2–3 examples of good responses, explicit step-by-step instruction.
If you have a recurring task (commit messages, PR descriptions, test-case generation), try it on both the frontier and mini-tier. Measure quality side by side. Commit to the cheaper one if it's good enough.
Open the OpenAI billing dashboard. Understand where your tokens are going.

Fact-check notes and sources

OpenAI Platform: API reference and model documentation — the canonical source for model capabilities, pricing, and rate limits.
OpenAI: Prompt engineering guide — structured-output and few-shot patterns.
OpenAI: Prompt caching documentation — specific to how OpenAI implements caching (differs from Anthropic's approach).
Anthropic: Effective context engineering for AI agents — useful for compare-and-contrast with the OpenAI guide even though it's Claude-focused.
Augment Code: Google Antigravity vs Claude Code — adjacent comparison context that helps frame the OpenAI alternative.

Informational, not engineering consulting advice. The Codex CLI surface, model names, and pricing reflect the state of the OpenAI product family at the time of writing (Q2 2026). Specific command-line flags, model IDs, and CLI behaviors can change between versions; verify against the current OpenAI CLI documentation before depending on any specific command. Mentions of OpenAI, GPT, Codex, ChatGPT, Claude, Anthropic, and linked publications are nominative fair use. No affiliation is implied.

Codex CLI After Install — What To Actually Do With OpenAI's Agentic Terminal Once It's Running