Anthropic models haven't been far ahead for a while. Quite a few months at least. Chinese models are roughly equal at 1/6th the cost. Minimax is roughly equal to Opus. Chinese providers also haven't had the issues with uptime and variable model quality. The gap with OpenAI also isn't huge and GLM is a noticeably more compliant model (unsurprisingly given the hubristic internal culture at Anthropic around safety).
CC is a better implementation and seems to be fairly economic with token usage. That is the really the only defining point and, I suspect, Anthropic are going to have a lot of trouble staying relevant with all the product issues.
They were far ahead for a brief period in November/December which is driving the hype cycle that now appears to be collapsing the company.
You have to test at least every month, things are moving quickly. Stepfun is releasing soon and seems to have an Opus-level model with more efficient architecture.
Minimax is nowhere near Opus in my tests, though for me at least oddly 4.6 felt worse than 4.5. I haven't use Minimax extensively, but I have an API driven test suite for a product and even Sonnet 4.6 outperforms it in my testing unless something changed in the last month.
One example is I have a multi-stage distillation/knowledge extraction script for taking a Discord channel and answering questions. I have a hardcoded 5k message test set where I set up 20 questions myself based on analyzing it.
In my harness Minimax wasn't even getting half of them right, whereas Sonnet was 100%. Granted this isn't code, but my usage on pi felt about the same.
> CC is a better implementation and seems to be fairly economic with token usage. That is the really the only defining point and, I suspect, Anthropic are going to have a lot of trouble staying relevant with all the product issues.
What are you using to drive the Chinese models in order to evaluate this? OpenCode?
Some of Claude Code's features, like remote sessions, are far more important than the underlying model for my productivity.
Yes, 100% agree. OpenHands has self-hosted, KiloCode and RooCode both have a cloud option. I don't think you are able to pass a session around with any of them. Codex seems to have comparable features afaik.
CC tool usage is also significantly ahead imo (doesn't negate the price but it is something). I have seen issues with heavy thinking models (like Minimax) and client implementations with poor tool usage (like Cline).
CC has had a period over the last six months of delivering significant value...but, of course, you can just use CC with OpenRouter.
I haven't noticed a huge difference with other models but I agree that is definitely a strength (and CC has better tooling for this). However, I do think there are practical limitations to agentic workflows because of the relatively poor output vs humans. You can generate lots of code, but most of it will be shit.
Agentic workflows do have a place in well-defined, structured tasks...but I don't think that is what most people are trying to do with it.
...and codex is at least 10x better than Claude. I don't even bother starting a new session when working on a feature, a single compaction is basically unnoticeable. You have to compact several times to start needing to remind the model about a rule or two.