HomeBlogUncategorizedAnthropic just built the best AI model in the world but won’t release it

Anthropic just built the best AI model in the world but won’t release it

It’s real. And it’s a big deal.

I wrote about evidence of leaks of Anthropic’s Mythos last week and the reality looks even brighter as Anthropic dropped the system card for Claude Mythos Preview today, and the numbers are staggering.

On SWE-bench Verified — the gold-standard test for whether an AI can fix real software bugs — Mythos hits 93.9%. Claude Opus 4.6 scored 80.8%. Gemini 3.1 Pro sits at 80.6%. That’s not a marginal improvement; that’s a different league.

SWE-bench Pro, the harder variant with multi-file diffs and no data leakage, lands at 77.8%. Opus 4.6 managed 53.4%. GPT-5.4 got 57.7%. Again — a massive gap.

USAMO 2026 is where it gets absurd. This is the USA Mathematical Olympiad, proof-based competition problems that took place after the model’s training cutoff. Mythos scored 97.6%. Claude Opus 4.6 scored 42.3%. That’s not a typo. The jump from 42% to 97% on elite-level mathematical proof writing is the kind of capability leap that makes you sit up straight.

On GPQA Diamond (graduate-level science questions), it’s 94.5% vs Opus 4.6’s 91.3% and GPT-5.4’s 92.8%. On Humanity’s Last Exam with tools, 64.7% vs 53.1% for Opus and 52.1% for GPT-5.4. On long-context graph traversal problems (GraphWalks BFS 256K-1M), it scored 80% where Opus managed 38.7% and GPT-5.4 got 21.4%.

The pattern is consistent: Mythos opens up daylight in nearly every category.

But there’s a catch: you can’t have it.

Anthropic has made the unprecedented decision to withhold Mythos Preview from general availability. Instead, it’s being funneled into a defensive cybersecurity program called Project Glasswing, shared only with a who’s-who of Big Tech infrastructure: Amazon, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. About 40 additional organizations that maintain critical software also get access.

The reason? The model is too good at hacking.

The cybersecurity angle is what matters most

Here’s where the story gets genuinely important for anyone thinking about risk — whether you’re managing a portfolio or managing a network.

On Cybench, a public cybersecurity benchmark of capture-the-flag challenges, Mythos Preview solved every single challenge with a 100% success rate. The benchmark is now considered saturated. It’s too easy for this model.

On CyberGym, which tests the ability to find real vulnerabilities in real open-source software, Mythos scored 0.83 vs Opus 4.6’s 0.67.

But the Firefox evaluation is the one that should get your attention. Anthropic previously worked with Mozilla to find security vulnerabilities in Firefox 147. Opus 4.6 could only develop working exploits of those vulnerabilities twice out of several hundred attempts. Mythos Preview does it reliably and repeatedly, independently identifying the most exploitable bugs and building proof-of-concept exploits. It leveraged four distinct bugs for code execution where Opus could only manage one, unreliably.

Anthropic says the model has already found thousands of high-severity vulnerabilities across every major operating system and web browser — some of which survived decades of human review and millions of automated tests. One example: chaining together a Linux kernel flaw that could grant complete machine control.

This is why Anthropic isn’t releasing it broadly. The offensive potential is too significant. This creates a huge opportunity for cybersecurity companies like Palo Alto Networks that have been given early access. Could they leverage that? Or will the companies be replaced by Mythos and other future models?

The CEO of PANW bought $10m worth of shares in the open market last month, so that’s a sign that at least one important person believes it will grow their business rather than sink it.

What about for the rest of us

Assuming the internet itself can survive Mythos, there is the question of what it can do and here’s the breakdown (or at least the hype).

It operates in a different tier from everything else on the market right now.

It writes code like a senior engineer, not a junior one.

It reasons at a genuinely elite level. Going from 42% to 97% on olympiad-level math proofs isn’t incremental progress. It’s the difference between a model that occasionally gets lucky and one that can construct rigorous multi-step logical arguments consistently. The GPQA Diamond score of 94.5% on expert-level science questions tells a similar story.

It handles massive context windows without falling apart. The GraphWalks score — 80% on 256K-to-1M token problems vs 38.7% for Opus and 21.4% for GPT-5.4 — shows the model can actually use its long context window.

Anthropic’s system card says Mythos Preview is their best-aligned model to date across essentially every dimension they can measure. That’s notable because models this capable tend to develop new failure modes. The system card is candid about some concerning behaviors observed in earlier internal versions — including one instance where the model escaped a sandbox, gained internet access, and posted on social media. Those behaviors were addressed in training, but the transparency about them is itself significant (and scary).

The investment angle

For anyone watching the AI trade, there are a few key takeaways.

First, the capability curve hasn’t plateaued. The gap between Mythos and Opus 4.6 is larger than many expected at this point. Scaling is still working, and Anthropic appears to be at or near the frontier.

Second, Anthropic’s decision to withhold the model is a genuine strategic move. They’re foregoing revenue from their most capable product because they believe the risks of broad deployment outweigh the benefits right now. Whether you view that as responsible leadership or an expensive hedge, it’s a signal about where frontier AI development is heading and it looks like there will be a club that retail isn’t in.

Anthropic says the eventual goal is to make Mythos-class models generally available once the right safeguards are in place. When that happens, the competitive landscape shifts materially. For now, the rest of us get to read the system card and think about what’s coming next.

This article was written by Adam Button at investinglive.com.


Leave a Reply

Your email address will not be published. Required fields are marked *

Contact information

If you have any queries or complaint reach us out.

Copyright: © 2024 – All Rights Reserved. Made with 💛 by A2Solutions.