Claude Opus 4.7: the model that actually closes tickets, and the four-step Monday rollout
64.3 percent on SWE-bench Pro. Tool-call errors down two thirds. Flat pricing. Here is what the numbers say, how it compares on cost, and the four-step migration plan for teams with agents in production.
Anthropic released Claude Opus 4.7 on April 16. The release note was short. The numbers behind it are the most important single data point for any team running agents in production this quarter.
The headline score is 64.3 percent on SWE-bench Pro, the benchmark that requires a model to solve real-world software engineering tickets end-to-end, including tool calls and side effects. That is 6.6 points ahead of GPT-5.4 and almost 11 points above Opus 4.6. Tool-call errors dropped by roughly two thirds in Anthropic's own evals. Pricing is unchanged at $5 per million input tokens and $25 per million output tokens.
This article covers what changed, how it compares on benchmarks and on cost-per-outcome, and the four-step migration plan I would run on Monday.
What actually changed
Three things moved materially in 4.7. Every other change in the release is a byproduct of these.
- Tool-call reliability. The model no longer silently fails halfway through a multi-step workflow as often. In Anthropic's evals, tool-call error rate on the same task suites dropped from roughly 9 percent to about 3 percent.
- Instruction adherence over long contexts. 4.7 holds a complex system prompt across many turns. Prior versions would drift. This is what makes the weekly-review-via-MCP workflow finally usable.
- Cleaner stop behavior. Fewer "I'll continue in the next message" truncations. Saves tokens and saves user frustration.
Nothing else in the release changes your integration. Same API. Same pricing. Same context window.
The benchmark picture
A single SWE-bench Pro number is not a migration case. The chart below compares 4.7 against the four frontier models on four agentic benchmarks that actually correlate with production work.
Frontier benchmark scores (higher is better)
Two observations. First, the SWE-bench Pro lead is the biggest gap of the four benchmarks. That is the benchmark that most closely resembles production agent work. Second, Gemini 2.5 Pro is competitive on academic benchmarks (GPQA, MMMU) but behind on the two agentic ones. If your agents do real work, that gap shows up in production.
The cost picture
Pricing headlines are a trap. What matters is cost-per-outcome: how many tokens the model burns per successfully completed task. Lower error rates mean fewer retries, which means lower effective cost even at identical sticker prices.
Cost per 1,000 successfully completed agentic tasks ($)
The takeaway is mechanical. Same list price, fewer errors, fewer retries, fewer tokens per completed task. If you care about a budget line rather than a benchmark, 4.7 is effectively a price cut.
The four-step Monday migration
Do not upgrade everything at once. Do not publish a model bump as a changelog line. Do this.
- Pick the target agent. The one closest to either revenue or cost. Usually a support pipeline, a sales research agent, or a code-review bot. One agent, one owner.
- Run a two-week A/B. Route half your traffic to 4.7 and keep half on 4.6. Measure three things only: task completion rate, tool-call error count, and tokens per successful task. Ignore style and tone.
- Sunset 4.6. If completion rate is higher and tool-call errors drop by more than 30 percent in your environment, roll 4.7 forward as the default. Running two models in parallel indefinitely doubles your eval surface.
- Rewrite your eval harness. The step most teams skip. Older evals grade on tone or exact-match output. 4.7 is good enough that the only gap worth measuring is "did the tool call succeed and did the intended side effect land."
When to wait
- You fine-tuned against 4.6. Weights do not transfer. Plan a retraining run before you swap.
- Your SLAs require version pinning. Regulated industries often contractually require 90 days on a pinned version.
- You run inference through a reseller. If your vendor has not routed 4.7, ask them when. Do not self-route around them.
The one-line takeaway
Opus 4.7 is a swap-the-engine upgrade. The work is mostly organizational: pick one agent, run the A/B, rewrite the eval harness, move on. Do it this week or the compounding cost is real.
Pick the next useful thing.
Build a Safe vs Risky AI Chatbot Detector Game with Your Kid
A 60-minute family activity that teaches kids to spot risky chatbot answers with zero screens required for the core lesson.
Turn Apple Watch Sleep Data into One Better Week with GPT-5.5
A five-minute Sunday ritual using Apple Watch sleep data and GPT-5.5 to pick one practical behavior change.
The $65 Billion Anthropic Bet: What It Means for Your Stack
What Google and Amazon investment means for pricing, tooling, and your 2026 agent roadmap.
Three deep dives. Four useful moves. One email worth opening.
PromptHacker turns the AI firehose into practical next steps for work, health, family, and everything time keeps trying to steal.
No comments yet