GPT-5.3 Codex dropped on February 2nd. I had been running Claude models as my daily driver for almost a year, and I figured if I was going to have an opinion about the alternatives, I should actually use them. So I ran a proper test: Codex only, for just over 20 days, on real projects, using OpenCode as the agentic CLI. No Claude fallback, no hedging.
The experiment is over. I am back on Claude. This is what happened.
Fair is fair: what Codex does well
I am not going to pretend the experiment was all bad, because it wasn't. Codex writes clean TypeScript. Genuinely clean - tighter than what Claude produces by default, more idiomatic, better variable names. If you handed me a diff from both tools without labels, I could usually identify the Codex output as the neater one.
Context retention was also not a problem. Multi-file tasks, long sessions, picking up where we left off - Codex handled all of it fine. That was one of my concerns going in, and it did not materialize.
So the baseline is: capable model, solid code quality, nothing technically broken about it. That makes the frustrations more interesting, because they were not about the code at all.
The CSV thing
The first thing that made me genuinely angry was file editing. I had a CSV that needed a few columns updated. Simple task. Here is what I asked for, and here is what I got:
Me: Update the email column in users.csv where status = "inactive"
Codex:
1. Writes a Python script to parse and transform the file
2. Runs the script
3. Script fails on a malformed row
4. Rewrites the script to handle it
5. Runs again
6. WorksThis was consistent behavior. Not once, not twice - every time I needed to touch a structured file, Codex reached for a script. I started explicitly saying "edit the file directly, do not write a script." It would acknowledge this, and then write a script anyway.
Claude edits files directly. It has the tools to do it. Using a Python script to change three rows in a CSV is like driving to the kitchen to get a glass of water. Technically it works. The overhead is the problem.
Context: I was helping my partner migrate data for her business that month. Orders, WooCommerce imports and exports, ERP exports - a lot of structured files, many of them heavy. I want to be fair: on the large, complex transformations, a Python script is the right call. More auditable, easier to rerun, handles edge cases cleanly. I know that. But Codex reached for a script on everything. The three-row update. The column rename. The filter that touched maybe fifteen lines. Multiply that pattern by twenty sessions and you understand where a significant portion of my frustration came from.
The blueprint problem
I work with blueprints. Before asking an AI to implement something non-trivial, I sketch out the structure - function signatures, data shapes, the general flow. It forces me to think before the AI types, and it gives the tool a clear contract to follow.
With Claude, this works well. The blueprint is a constraint, and Claude treats it like one. If it disagrees with something in the plan, it says so before touching the code.
Codex would start following the blueprint, then gradually drift from it. Not dramatically - not like it rewrote everything. More like it would silently make a different architectural choice three functions in, without flagging it. I would get to the end of an implementation and notice that the shape of the data structure was different from what I had specified.
When I asked why, the explanation was reasonable. Sometimes even correct. But I did not ask for the decision to be made. I asked for the decision to be surfaced so I could make it.
The yes-man problem
This is the core of it.
Claude, when I propose something questionable, will often say something like: "You want to do X, but I think Y would be better because Z. Want me to proceed with X or should we look at Y?" That friction is valuable. It has caught real mistakes. It has made me slow down when I was about to overbuild something at 11pm.
Codex does not do this. Or rather, it does it rarely and only when the problem is obvious. For anything in the gray area, the default response is agreement followed by implementation. "Sounds good, here's the code."
I want to be precise here: this is not about wanting an AI that lectures me. I do not need that. But there is a meaningful difference between a tool that executes instructions and a tool that thinks with you. Codex is very good at executing instructions. When the instructions are slightly wrong, it executes the slightly wrong instructions cleanly and quickly.
I noticed this most during architecture discussions. With Claude, those conversations feel like genuine back-and-forth - it challenges assumptions, points out edge cases, pushes back on over-engineering. With Codex, I felt like I was talking to someone who had decided their job was to agree and then build.
If you have ever had a direct report who implemented exactly what you said rather than what you meant, you know the feeling.
Cleaner code, more broken builds
Here is the thing about that cleaner code: it compiled wrong more often.
Codex would produce TypeScript that looked elegant but failed strict checks. Or had subtle type mismatches that only showed up at runtime. Or missed an import. The code was aesthetically better but required more iterations to reach a working state.
My rough scoring across the 20 days:
First-pass correctness: Claude ~slight edge
Repair loops to green CI: Claude clear win
Manual steering needed: Claude clear win
Final diff readability: Codex slight edgeIf you only ever look at the final output, Codex looks comparable or better. If you measure the full path from "implement this" to "it works in production", the picture changes.
Tokens spent on repair loops are not free. Neither is the cognitive load of steering a tool back on track repeatedly. Clean code that needs three iterations to compile is not actually cleaner than messier code that works the first time.
What coming back felt like
The first session back with Claude after 20 days was disorienting in a good way. We were working through something - I do not remember what - and it pushed back on something I said. Explained its reasoning. Offered an alternative.
I had not realized how much I had missed that until it happened. Twenty days of a tool that agreed with everything had made me forget what it felt like to have a collaborator that thought about the problem.
That is not a benchmark score. It is not a metric I can put in a table. But it is real, and it is significant for how I actually work.
The actual verdict
Codex is a good model. If you want a tool that executes what you say, produces clean output, and does not slow you down with questions, it delivers that. Some developers will prefer exactly that. Fast, agreeable, capable.
I am not one of them.
I build alone. That means there is no one else in the room to catch scope creep, bad architectural decisions, or logic that felt right at midnight. I have replaced some of that with AI workflow - but only if the AI actually challenges me. A tool that agrees with everything is worse than no second opinion at all, because it creates the illusion of having thought something through.
The cleaner code is real. I am still not using Codex.
This is a field report from one developer running one workflow on one set of projects. Model behavior changes fast, and your experience will depend on how you work. If you are curious about the workflow I use with Claude, I wrote about it here.
One thing I want to say clearly: OpenCode is good. It is what kept this experiment running past day three. Honestly, the UI is a step above Claude Code - the terminal experience, multi-session handling, the overall feel. If I am being completely honest: Claude Code is good, OpenCode is better. I would run Claude through it in a heartbeat. Anthropic has explicitly blocked third-party CLIs from hijacking Claude's OAuth though, so Claude Code it is. Make of that what you will.