Model Release
Claude Opus 4.8 just stole the AI crown back
On Wednesday, Anthropic shipped Claude Opus 4.8 — and on the numbers that matter most, it beats GPT-5.5. It leads on agentic coding, reasoning, computer use, and real-world knowledge work. It's the #1 model on the Artificial Analysis global intelligence index. But the more interesting story is what Anthropic chose not to optimize for, and what's coming next.
Opus 4.8 beats GPT-5.5 on 4 of 5 major benchmarks
Here are the numbers, from Anthropic's own evaluation data and the independent Artificial Analysis Intelligence Index v4.0:
| Benchmark | Opus 4.8 | GPT-5.5 | Opus 4.7 |
|---|---|---|---|
| SWE-Bench Pro (agentic coding) | 69.2% | 58.6% | 64.3% |
| Terminal-Bench 2.1 (CLI coding) | 74.6% | 78.2% | 66.1% |
| Humanity's Last Exam (reasoning, w/tools) | 57.9% | — | 54.7% |
| OSWorld (agentic computer use) | 83.4% | 78.7% | 82.8% |
| GDPval-AA (real-world knowledge work) | 1,890 | 1,769 | 1,753 |
| Artificial Analysis Index (composite) | 61.4 | 60.2 | 57.3 |
The gap on SWE-Bench Pro is particularly brutal: 69.2% versus GPT-5.5's 58.6%. That's a 10.6-point lead on the benchmark that matters most for anyone shipping AI-generated code. On GDPval-AA — which simulates economically valuable tasks across 44 occupations and 9 industries — Opus 4.8 leads by 121 Elo points. These are not rounding errors.
GPT-5.5 wins one battle: Terminal-Bench 2.1, where it scores 78.2% to Opus 4.8's 74.6%. If your workflow is command-line coding with rapid-fire tool calls, GPT-5.5 is still the faster horse.
The "honesty" bet that nobody else is making
Anthropic's headline isn't speed or benchmarks. It's honesty. The company says Opus 4.8 is four times less likely than Opus 4.7 to let flaws in its own code pass unremarked. Early testers confirm the model is more willing to flag uncertainties and less likely to make unsupported claims — a persistent problem across AI models that tend to project confidence regardless of whether it's warranted.
This is not marketing fluff. Cognition, the company behind Devin, said Opus 4.8 "uses tools cleanly" and fixed the comment-verbosity and tool-calling issues that plagued Opus 4.7. Cursor reported improvements across every effort level on its CursorBench evaluation. Harvey, which builds AI for legal work, said Opus 4.8 delivered the highest score ever recorded on its Legal Agent Benchmark — and is the first model to break 10% on the all-pass standard.
The alignment numbers back this up. Anthropic's internal assessment found misalignment rates — deception, cooperation with misuse attempts — are substantially lower than Opus 4.7 and comparable to Claude Mythos Preview, the company's most restricted and best-aligned model. In a separate study by startup Emergence AI, agents powered by Claude Sonnet 4.6 recorded zero simulated crimes in isolation while other models' worlds collapsed into arson and violence.
The catch: Opus 4.8 takes 30% more turns to win
Here's the uncomfortable number buried in the benchmark data: Opus 4.8 uses approximately 30% more turns per task than GPT-5.5 to achieve its higher scores. It's more thorough, more careful, more likely to double-check its work — and each of those extra reasoning steps costs tokens and time.
For a developer running a single coding session, the difference is invisible. For an enterprise running thousands of agentic workflows per day, it's a line item. Anthropic's answer is Fast Mode: the same Opus 4.8 model running at 2.5x the speed and priced at one-third of standard rates (roughly $1.67 per million input tokens instead of $5). Activate it in Claude Code with /fast. For teams that don't need the model to triple-check every decision, it's a pragmatic hedge against the efficiency gap.
Standard pricing stays flat at $5 per million input and $25 per million output — same as Opus 4.7. No price hike for the upgrade.
Dynamic workflows: Claude Code gets subagents
Also shipping as a research preview: dynamic workflows in Claude Code. For complex tasks like a migration touching hundreds of files, Claude can now generate a plan, spin up parallel subagents to execute it, and verify results before reporting back. This is Anthropic's most direct answer yet to multi-agent orchestration — and it arrives the same week Claude Code usage hit roughly 4% of all public GitHub commits.
The company also publicly demonstrated 16 parallel Claude instances autonomously building a C compiler from scratch. The multi-agent angle is not theoretical.
The $65 billion elephant in the room
Opus 4.8 launched the same day Anthropic announced a $65 billion Series H at a $965 billion valuation — making it more valuable than OpenAI and the most valuable private AI company in the world. Both companies are racing toward IPOs later this year, with Anthropic reportedly targeting October 2026.
The funding isn't just a flex. It's the infrastructure bill for what's next. Anthropic CEO Dario Amodei confirmed that Mythos-class models are coming "in weeks." These are the higher-intelligence models that have already found more than 10,000 critical software vulnerabilities through Project Glasswing — and are currently restricted to governments and select partners. Opus 4.8, for all its improvements, is the stepping stone. Mythos is the destination.
What this means if you're building
For coding work: Opus 4.8 is the new default. The SWE-Bench lead is too large to ignore, and the honesty improvements mean less time auditing AI-generated code. If you're on Claude Code or Cursor, you're already running it.
For cost-sensitive pipelines: Test Fast Mode before committing. At ⅓ the price, it may deliver enough of the quality at a fraction of the cost. Teams running high-volume agentic workflows should benchmark both modes against their actual tasks, not generic leaderboards.
For the bigger picture: The model war is now a two-player game. Anthropic leads on agentic work and safety. OpenAI leads on raw terminal speed. Google's Gemini 3.1 Pro is a distant third on every metric that matters (54.2% on SWE-Bench, 76.2% on OSWorld). The mid-tier — Qwen, Kimi, MiMo — is closing fast but hasn't broken into the top tier yet.
For what comes next: Mythos is weeks away. When it ships — even in restricted form — it resets the conversation entirely. Opus 4.8 is the best general-purpose model you can use today. Mythos is the model that may make general-purpose models feel like the wrong question.