Claude Opus 4.7 Retakes AI Lead With Coding and Vision Gains

Anthropic released Claude Opus 4.7 on Thursday, April 16, 2026, making the model generally available across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at the same price as its predecessor. The San Francisco company is positioning the release as the model enterprise teams can hand their hardest coding work to without looking over its shoulder, while continuing to hold back its more capable Mythos Preview model on the grounds that the cyber risks are not yet manageable.

Opus 4.7 Jumps Past GPT-5.4 and Gemini 3.1 Pro on Coding

The performance delta between Opus 4.7 and Opus 4.6, released roughly six months ago, is visible across every category Anthropic publishes. On the SWE-bench Pro coding benchmark, which tests a model's ability to fix real-world bugs in open-source repositories, Opus 4.7 resolves 64.3% of tasks. Opus 4.6 resolved 53.4% on the same evaluation. That 10.9-point jump is the kind of movement that usually takes a full model generation, not a point release.

On SWE-bench Verified, the curated subset of the benchmark, Opus 4.7 scores 87.6%. The only publicly tracked model that beats it is Anthropic's own unreleased Mythos Preview at 93.9%. On the GPQA Diamond reasoning benchmark, Opus 4.7 hits 94.2%. And on GDPval-AA, a third-party evaluation of economically valuable knowledge work across finance, law, and consulting, Opus 4.7 posts an Elo of 1753, meaningfully ahead of GPT-5.4 at 1674 and Gemini 3.1 Pro at 1314.

Benchmark	Opus 4.7	Opus 4.6	GPT-5.4	Gemini 3.1 Pro
SWE-bench Pro (coding)	64.3%	53.4%	62.8%	58.1%
SWE-bench Verified	87.6%	79.4%	84.2%	80.9%
GPQA Diamond (reasoning)	94.2%	91.8%	93.6%	92.1%
GDPval-AA (Elo)	1753	1598	1674	1314
CyberGym (vulnerability)	73.1%	66.1%	66.3%	61.2%
Agentic search	79.3%	71.0%	89.3%	77.4%

Claude Opus 4.7 benchmark performance versus key competitors at launch, April 16, 2026. Source: Anthropic release notes.

The release is not a clean sweep. GPT-5.4 still leads on agentic search at 89.3% versus Opus 4.7's 79.3%, and VentureBeat's analysis of directly comparable benchmarks puts the head-to-head at 7-4 in Opus 4.7's favor. That is a narrow lead in a market where Google's Gemini 3.1 Pro held the crown as recently as February and OpenAI's GPT-5.4 took it in early March.

The Rigor Story: Self-Verification and Literal Instruction Following

The benchmark numbers matter less than the behavioral shift Anthropic is describing. The company says Opus 4.7 has been trained to do something earlier Claude models would skip: devise its own verification steps before reporting a task as complete. In one internal test, the model built a Rust-based text-to-speech engine from scratch, then independently routed its own generated audio through a speech recognizer to confirm the output matched a Python reference implementation. That kind of self-check during a long autonomous run is what reduces the hallucination loops that plague production deployments of agentic systems.

The model also takes instructions more literally than Opus 4.6. That sounds like a feature and mostly is, but Anthropic flagged a migration risk in its release notes: prompts written for earlier Claude models sometimes produced results through loose interpretation. Opus 4.7 executes the exact text of the request. Prompt libraries tuned for the conversational looseness of Opus 4.6 will need to be re-tuned, particularly the system prompts driving agentic workflows.

"Claude Opus 4.7 is the strongest model Hex has evaluated. It correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks, and it resists dissonant-data traps that even Opus 4.6 falls for. It's a more intelligent, more efficient Opus 4.6: low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6."Hex, early-access Anthropic customer

Other early-access partners echoed the same theme. Replit president Michele Catasta said the model achieved higher quality at lower cost on log analysis and bug-finding tasks. Notion's AI lead Sarah Sachs reported a 14% improvement in multi-step workflows with a two-thirds reduction in tool-calling errors. Cognition, the maker of the Devin autonomous engineering agent, said Opus 4.7 can work coherently "for hours" on difficult problems that previously caused its models to stall.

Vision Resolution Triples, and That Matters for Computer-Use Agents

Opus 4.7 accepts images up to 2,576 pixels on the longest edge, or roughly 3.75 megapixels. That is more than three times the effective resolution of earlier Claude models. The use case is not vacation photos. It is computer-use agents that have to read dense screenshots of enterprise software, financial analysts extracting numbers from complex diagrams, and medical or scientific workflows that require pixel-accurate reading of charts and scans.

The clearest proof point came from XBOW, the autonomous penetration testing firm. On XBOW's internal visual-acuity benchmark, Opus 4.7 scored 98.5%. Opus 4.6 scored 54.5% on the same test. A 44-point jump on a perception benchmark is not an incremental improvement. It is the difference between a capability that can be used in production and one that cannot.

For enterprise teams building computer-use agents, the practical implication is that the visual ceiling that previously limited autonomous navigation through dense interfaces has moved sharply upward. The comparison is similar to what OpenAI's GPT-5.4 launch did for agentic planning: it turns a brittle capability into a reliable one.

The xhigh Effort Level and Task Budgets

Opus 4.7 introduces a new effort level called xhigh, sitting between the existing high and max tiers. The tradeoff is the one that has defined agentic AI for the past year: more reasoning produces better answers but burns more tokens and takes longer. The xhigh setting gives developers a specific band to target for long-running tasks where max is too slow and high is not thorough enough.

Anthropic also shipped a public beta of task budgets on the Claude API, letting developers set a hard ceiling on token spend for autonomous agents. The feature exists because Opus 4.7 produces more output tokens than Opus 4.6 at higher effort levels, particularly on later turns in agentic runs where the model is verifying its earlier work. The company's internal data shows token efficiency improves at every effort level on coding evaluations, but real-world workloads may not behave identically.

Inside Claude Code, Anthropic raised the default effort level to xhigh for all plans and added a /ultrareview slash command that runs a dedicated review session flagging bugs and design issues a senior human reviewer would catch. Pro and Max users get three free ultrareviews to try the feature. Auto mode, the permissions option that lets Claude make decisions on the user's behalf during long-running tasks, was extended to Max users at launch.

Pricing Holds Steady at $5 and $25 per Million Tokens

Pricing for Opus 4.7 matches Opus 4.6 exactly: $5 per million input tokens and $25 per million output tokens. That is identical to the prior generation and positions the model at the same tier as GPT-5.4 and Gemini 3.1 Pro for enterprise contracts. The catch is that Opus 4.7 uses an updated tokenizer that increases input token counts by 1.0 to 1.35x depending on the content type, so the effective cost per query is moderately higher even at the same per-token rate.

Users can control the impact by adjusting effort levels, setting task budgets, or prompting the model to be more concise. Anthropic's internal coding evaluation showed token usage improved across all effort levels despite the tokenizer change, but the company explicitly recommends teams measure the difference on real production traffic before assuming the same outcome.

Safety, Mythos, and the Cyber Verification Program

The most politically charged part of the release is what Opus 4.7 deliberately does not do. Anthropic trained the model with measures to reduce its cyber capabilities relative to the more powerful Mythos Preview, and shipped it with automated safeguards designed to detect and block requests indicating high-risk cybersecurity use. On the CyberGym vulnerability reproduction benchmark, Opus 4.7 scores 73.1%, below Mythos Preview at 83.1% but above GPT-5.4 at 66.3%.

The context is Anthropic's ongoing standoff with the Pentagon, which earlier this year labeled the company a supply chain threat under an authority normally reserved for foreign adversaries. A federal appeals panel recently denied Anthropic's bid to stay the designation, though a San Francisco judge had initially blocked it. The company is simultaneously in active discussions with the White House about deploying Mythos Preview inside federal agencies for defensive cybersecurity use, according to a Bloomberg report earlier this week.

"We're working closely with model providers, other industry partners, and the intelligence community to ensure the appropriate guardrails and safeguards are in place before potentially releasing a modified version of the model to agencies."Gregory Barbaccia, Federal Chief Information Officer, White House Office of Management and Budget

Anthropic is also launching a Cyber Verification Program that lets credentialed security professionals apply for access to Opus 4.7's unrestricted capabilities for legitimate defensive work including vulnerability research, penetration testing, and red-teaming. The verification model is a preview of where the most capable AI features are likely heading: gated behind professional credentials rather than universally available.

What Enterprise Teams Should Watch Next

Opus 4.7 arrives at a moment of genuine tension for Anthropic. The company's annual run-rate revenue reportedly hit $30 billion earlier this month, and venture investors are circulating term sheets valuing the company at $800 billion, more than double its $380 billion Series G valuation from February. At the same time, developers have flooded GitHub and X with complaints that Opus 4.6 and Claude Code have degraded in recent weeks, with users reporting exploration loops, memory loss, and ignored instructions.

The 4.7 release is Anthropic's answer to those critics. Whether it holds depends on how the model performs outside curated benchmarks, particularly on the long-running agentic workloads that customers like Cognition and Replit are increasingly running in production. The company has committed to test new cyber safeguards on Opus 4.7 before a broader release of Mythos-class models, which means the real-world behavior of this release will directly shape the terms of Mythos' eventual debut.

For teams currently running Opus 4.6, the migration path is straightforward on paper but requires re-tuning. The token economics shift, the instruction-following tightens, and the effort levels add complexity to workload planning. The upside, per the benchmark data and early-access feedback, is a model that finally delivers on the promise of autonomous coding without constant human babysitting. The open question is whether that reliability holds when the model is stress-tested by millions of concurrent users rather than a curated list of enterprise partners.