Models · Cybersecurity
The cyber threshold has been crossed: GPT-5.5 and Claude Mythos can now penetrate enterprise networks
Two frontier models, from two different labs, have independently completed a 32-step corporate network intrusion that the UK's AI Safety Institute estimates would take a skilled human around 20 hours. This is no longer a hypothetical. The offensive-defensive balance in cybersecurity just shifted — and the defenders might actually be winning.
The number that changes everything
On May 7, the UK AISI published its evaluation of GPT-5.5's cyber capabilities. The headline finding: GPT-5.5 completed "The Last Ones" (TLO) — a 32-step simulated corporate network attack spanning four subnets and roughly twenty hosts — end-to-end in 2 out of 10 attempts.
A month earlier, Claude Mythos Preview became the first model to clear TLO, doing so in 3 out of 10 attempts. That one lab hitting this threshold could have been an outlier. Two labs landing in the same place within weeks of each other is a trend.
The TLO simulation is not a toy CTF challenge. It models a full enterprise intrusion kill chain: the agent starts on an unprivileged attack box with zero credentials and must chain together reconnaissance, credential theft, lateral movement across multiple Active Directory forests, a CI/CD supply-chain pivot, and exfiltration of a protected internal database. There are no hints. No step-by-step guidance. The model has to figure out the entire attack path autonomously.
Both models did.
Head-to-head: GPT-5.5 vs Claude Mythos on offense
The AISI tested both models on a suite of 95 narrow cyber tasks across four difficulty tiers, plus two full cyber ranges. Here is what the data says:
| Evaluation | GPT-5.5 | Claude Mythos Preview | GPT-5.4 | Opus 4.7 |
|---|---|---|---|---|
| Expert cyber tasks (21 tasks) | 71.4% (±8.0%) | 68.6% (±8.7%) | 52.4% | 48.6% |
| Practitioner cyber tasks (27 tasks) | ~79% | ~75% | — | — |
| TLO end-to-end (2 of 10 attempts) | 2/10 | 3/10 | 0/10 | 0/10 |
| TLO average steps completed (100M tokens) | ~25 steps | ~22 steps | — | — |
On pure offensive cyber capability, GPT-5.5 is marginally ahead. It scores higher on Expert tasks and completes more steps on average per TLO attempt. Mythos has the higher end-to-end completion rate (3/10 vs 2/10), but the sample sizes are small enough that this is within noise.
The AISI's own words: "GPT-5.5 may be the strongest model we have tested."
What these models can actually do
The Expert-tier tasks are not abstract benchmarks. They require capabilities that, until mid-2025, were considered years away:
- Reverse engineering stripped binaries and embedded firmware without source code
- Developing reliable exploits for stack and heap overflows, use-after-frees, and type confusions
- Recovering cryptographic keys through padding-oracle, nonce-reuse, and weak-RNG attacks
- Winning TOCTOU races in privileged code paths
- Unpacking obfuscated malware
- Discovering and weaponizing synthetic vulnerabilities planted in real open-source software
These are the skills of a competent penetration tester with several years of experience. Both models now demonstrate them across a broad enough suite that cherry-picking is not a plausible explanation.
The defender's side: Project Glasswing
Here is where the story gets more interesting than "AI can hack now." Claude Mythos Preview is the model powering Project Glasswing, an industry consortium that includes AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks — all working together to use frontier AI to find and fix vulnerabilities in critical software before attackers reach them.
On vulnerability identification specifically, Mythos scores 83.1% compared to Opus 4.6 at 66.6%. Across coding benchmarks, the gap is even wider:
| Benchmark | Mythos Preview | Opus 4.6 | Delta |
|---|---|---|---|
| Agentic coding | 77.8% | 53.4% | +24.4 |
| Reasoning | 82.0% | 65.4% | +16.6 |
| Agentic search & computer use | 59.0% | 27.1% | +31.9 |
| SWE-bench Verified | 93.9% | 80.8% | +13.1 |
| Terminal-Bench 2.0 | 94.6% | 91.3% | +3.3 |
One Glasswing partner put it bluntly: "This is not only a game changer for finding previously hidden vulnerabilities, but it also signals a dangerous shift where attackers can soon find even more zero-day vulnerabilities and develop exploits faster than ever before."
The uncomfortable truth nobody wants to say out loud
Both models are unreleased. Mythos triggered Anthropic's ASL-4 safety protocol and will not ship publicly or via API. GPT-5.5's cyber variant — what the AISI evaluated — is not the same GPT-5.5 you access through ChatGPT. Both labs are holding these capabilities behind safety gates.
Here is the problem: the open-weights models are not far behind. The AISI notes that basic cyber tasks have been fully saturated by every model since at least February 2026. On advanced tasks, GPT-5.4 and Opus 4.7 — both publicly available — score around 50%. The gap between "publicly available" and "the cyber threshold" is maybe 12-18 months of capability improvement. At current scaling rates, that window could close faster than anyone expects.
And critically: performance on TLO continues to scale with inference compute. The AISI has not yet observed a plateau. Spend more tokens, get more hacking. There is no obvious ceiling.
Why this might actually be good news
The counterintuitive read: the same week we learned GPT-5.5 can autonomously breach enterprise networks, we also learned that Anthropic assembled a consortium of the world's largest tech companies to build an AI-powered defense infrastructure. Project Glasswing is the first major industry response that treats frontier AI cybersecurity as a collective-action problem rather than a competitive one.
The ratio of "defenders with frontier models" to "attackers with frontier models" matters more than any individual model's score. If Glasswing and similar initiatives can put these models in the hands of open-source maintainers and enterprise security teams before they become widely available to attackers, the defenders might actually pull ahead.
That is a big "if." But for the first time, there is a concrete plan behind it — not just a white paper.
What you should do now
If you run a security team, a platform, or any infrastructure that matters:
- Assume the threat model has changed. AI-augmented attackers are no longer theoretical. The models exist. They are not publicly available yet, but the capability gradient between "available" and "unreleased" is shrinking fast.
- Start using AI on defense. The same models that can find exploits in your software can find them for you first. If you are not running automated vulnerability discovery with frontier models by the end of 2026, you are falling behind the attackers who will.
- Watch the open-weights space. The moment a model scoring above 60% on AISI's Expert cyber suite is released with open weights is the moment the threat landscape changes irreversibly. That moment is likely months away, not years.
Sources: UK AISI evaluation of GPT-5.5 cyber capabilities (May 7, 2026) · Anthropic Project Glasswing (April 7, 2026) · Axios: OpenAI GPT-5.5 cybersecurity model (May 7, 2026)