Models

The frontier just converged: GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro in May 2026

May 4, 2026 · 7 min read

For the first time in the modern era of large models, the three frontier systems sit inside a four-point band on every composite benchmark that matters. Picking a model is no longer about “the smartest one.” It is about fit.

What changed

OpenAI shipped GPT-5.5 in late April. Anthropic followed with Claude Opus 4.7 on April 16. Google's Gemini 3.1 Pro arrived in February. On the standard composite leaderboards — reasoning, coding, vision, agentic tool use, and long-context retrieval — they are now within four points of each other. The era of one obviously-best general model is over.

The convergence has a quieter cause behind it: every leading lab now defaults to “thinking” architectures that allocate variable compute to hard problems. Reasoning is no longer a toggled mode. It is built in.

What separates them in practice

Claude Opus 4.7 — best for hard, multi-step engineering

Opus 4.7 hits 64.3% on SWE-bench Pro, ahead of GPT-5.5 (57.7%) and Gemini 3.1 Pro (54.2%). The new xhigh effort tier, sitting between high and max, gives it deeper reasoning without the full compute bill. Its file-system memory has improved enough that long-running coding agents can keep meaningful scratchpads across turns. The 1M context window now ships at standard pricing — no long-context premium.

If your workload is multi-file refactors, agentic engineering, or anything that requires the model to plan-act-verify across an hour of work, Opus 4.7 is the one we reach for.

GPT-5.5 — best for breadth and computer use

GPT-5.5 is the single model that credibly leads in coding, computer use, reasoning, and knowledge work simultaneously. It does not top any individual benchmark, but it almost never falls outside the top three. For teams that want one model to power a chat app, a coding assistant, and a browser-driving agent without managing three integrations, GPT-5.5 is still the safest default.

Gemini 3.1 Pro — best for cheap long-context and multimodal

Gemini 3.1 Pro lags slightly on coding but pulls ahead on multimodal video understanding and ultra-long-context retrieval. Pricing per token is meaningfully lower than the other two for input tokens at scale. If your job involves stuffing hours of video, hundreds of PDFs, or a full code repository into a single prompt and asking questions, Gemini is the value pick.

The asterisk: Claude Mythos Preview

Anthropic announced Claude Mythos Preview in early April with an unusual caveat: it triggered the company’s ASL-4 safety protocol during evaluation and will not be released publicly or via the standard API. Mythos is reportedly the first model to cross the 10-trillion-parameter threshold using a Mixture-of-Experts architecture. Practical impact for builders today: zero. Strategic signal: capability is still climbing faster than the rate at which labs can ship it.

The Chinese open-weights pressure

Inside a 12-day window in April, four Chinese labs released open-weights models — Z.ai’s GLM-5.1, MiniMax M2.7, Moonshot’s Kimi K2.6, and DeepSeek V4 — that land at roughly the same agentic-engineering ceiling as the Western frontier at less than a third of the cost. Four of the top five video models by Elo are now Chinese-built. OpenAI shut Sora down in March. The frontier is no longer geographically singular.

How to actually pick

Hard, long-horizon coding agents: Claude Opus 4.7.
One model to do everything passably: GPT-5.5.
Long-context, multimodal, cost-sensitive: Gemini 3.1 Pro.
Self-hosted or sovereignty-constrained: DeepSeek V4 or GLM-5.1.

Most teams we talk to now run two of these in production behind a thin router. The router checks task type, latency budget, and cost ceiling, then picks. Nobody is brand-loyal anymore. That, more than any benchmark, is the real shift of 2026.