Claude Sonnet 5 Is Live: Real Agentic Gains at a Price Built for Daily Use
The exact benchmark numbers separating Sonnet 5 from Sonnet 4.6 and Opus 4.8, and what they mean for real work
What matters today
The exact benchmark numbers separating Sonnet 5 from Sonnet 4.6 and Opus 4.8, and what they mean for real work
Key points
- What Actually Changed
- The Pricing Detail Most Coverage Missed
- Where It Runs
- Sonnet 5 vs. Opus 4.8: The Actual Decision Rule
- Use Case One: Contract Review, Start to Finish
What You'll Learn
- The exact benchmark numbers separating Sonnet 5 from Sonnet 4.6 and Opus 4.8, and what they mean for real work
- Where Sonnet 5 runs today: claude.ai, Claude Code, AWS Bedrock, Google Vertex AI, and Microsoft Foundry
- A clear rule for choosing Sonnet 5 over Opus 4.8, with the one setting that controls both cost and quality
- A step-by-step contract review workflow you can run this week, plus a second use case for competitive research
- The pricing catch buried in Anthropic's footnotes that changes how much you actually pay per task
Every few months, a new model launches with a press release full of "significant improvements" and no numbers you can act on. Sonnet 5, which Anthropic shipped on June 30, 2026, is not that. It comes with a benchmark table, an effort dial that trades cost for quality in real time, and a price that stays flat instead of climbing.
The decision in front of Executives is not whether to use Claude. It is which model to route each job to, and whether paying Opus prices is still necessary for tasks that used to require it. Get that wrong and you either overpay for routine work or undershoot on tasks that need real reasoning depth.
Below: the numbers behind the "agentic" claim, where the model runs, when to reach for Opus 4.8 instead, and a full walkthrough of a contract review workflow built for Sonnet 5's new effort levels.
What Actually Changed
"Agentic" gets thrown around a lot. In practice it means a model that plans a multi-step task, uses tools like a browser or terminal, checks its own work, and keeps going without you re-prompting it after every step. Anthropic's launch post describes testers watching Sonnet 5 finish jobs "where previous Sonnet models would stop short," including a case where the model investigated a bug, wrote a test to reproduce it, fixed it, then stashed the fix to confirm the bug returned without it, all in a single unprompted pass.
The clearest apples-to-apples read sits in SWE-bench Pro, a real-world software engineering benchmark. It is the only benchmark in Anthropic's release notes with a published score for Sonnet 4.6, Sonnet 5, and Opus 4.8 alike, so it is the one comparison below that is not missing a data point for any of the three models.
Sonnet 4.6
58.1%
Sonnet 5
63.2%
Opus 4.8
69.2%
SWE-bench Pro score, the only benchmark here with a published number for all three models.
Sonnet 5 closes roughly half the gap to Opus 4.8 that Sonnet 4.6 carried on this benchmark, a 5.1-point jump against a 6-point gap still remaining above it. A few other benchmarks show real movement too, though Anthropic has not published a matching score for all three models on any of them, so they are worth knowing but are not a fair 3-way comparison. On OSWorld-Verified, a test of operating a real computer by clicking, typing, and navigating software, Sonnet 5 hits 81.2 percent, up from Sonnet 4.6's 78.5 percent (no published Opus 4.8 score on this one). On Terminal-Bench 2.1, a command-line task benchmark, Sonnet 5 jumps to 80.4 percent from Sonnet 4.6's 67.0 percent, its biggest release-over-release gain of the group (again, no published Opus 4.8 score). On Humanity's Last Exam with tools enabled, a graduate-level reasoning test, Sonnet 5 scores 57.4 percent, nearly matching Opus 4.8's 57.9 percent (no published Sonnet 4.6 score here). And on GDPval-AA v2, built from real professional tasks across finance, legal, and other GDP-heavy sectors, Sonnet 5 scores 1,618 against Opus 4.8's 1,615, the one case where the cheaper model wins outright (no published Sonnet 4.6 score on this one either). For the document-heavy, judgment-light knowledge work most Executives actually pay people to do, that last comparison is the one worth remembering: Anthropic's own numbers say the cheaper model can win on average.
The Pricing Detail Most Coverage Missed
The headline number is simple: $2 per million input tokens and $10 per million output tokens through August 31, 2026, then $3 and $15 after that, matching Sonnet 4.6 with no price increase. A token is roughly three-quarters of a word, so a million tokens covers about 750,000 words moving in or out of the model.
The detail buried in Anthropic's footnotes matters more than it looks. Sonnet 5 runs on an updated tokenizer (the tool that breaks text into billable tokens), the same one introduced with Opus 4.7, and the same input can now map to 1.0 to 1.35 times more tokens than before. Anthropic says the introductory price is set so the transition is "roughly cost-neutral," meaning the lower rate offsets the higher token count rather than delivering a straightforward discount. After August 31, run a before-and-after cost comparison on one sample task instead of trusting the sticker price alone.
$2 / $10
per million input/output tokens through August 31, 2026, then $3 and $15, matching Sonnet 4.6 with no price increase
Sonnet 5 also exposes four effort levels, a single setting that trades tokens spent reasoning for both quality and cost:
Low
Cheapest, fastest, clearest value vs. Sonnet 4.6
Medium
Recommended starting point for most tasks
High
More reasoning tokens, still strong value
Xhigh
Can cost more than Opus 4.8: one hard task, not a default
Where It Runs
Sonnet 5 is the default model for Free and Pro plans on claude.ai as of June 30, and it is selectable on Max, Team, and Enterprise. It runs in Claude Code, on the Claude Developer Platform (Anthropic's API), on AWS Bedrock, on Google Vertex AI, and, as of June 29, 2026, in general availability on Microsoft Foundry, Microsoft's enterprise AI platform built on Azure. Microsoft bills Claude usage inside Foundry as one consolidated line item on the existing Azure bill, using Azure-native authentication and governance, which matters if procurement approval is the real blocker to adoption at your company, not the model's capability.
One correction worth flagging: the 1 million token context window (roughly 750,000 words the model can hold in working memory at once) is not limited to Google Vertex AI. It is the standard window for Sonnet 5 across the Claude API, AWS Bedrock, Vertex AI, and Microsoft Foundry alike, with no special access request required, enough to load an entire contract portfolio or a quarter's worth of board decks into one conversation with room left for follow-up questions.
Sonnet 5 vs. Opus 4.8: The Actual Decision Rule
Use Sonnet 5 at medium or high effort for anything repetitive, checkable, or based on information you already have:
- Contract first-pass review
- Meeting recaps
- Sales outreach drafts
- Code fixes
- Competitive research
- Multi-step tool use, like updating a CRM and sending a follow-up email in one task
Anthropic cites exactly that kind of workflow in its launch post: one partner handed Sonnet 5 a two-part job, update Salesforce account tiers, then send a launch announcement to enterprise contacts, and it finished both steps end to end without stalling halfway, something the previous model could not reliably do.
Reserve Opus 4.8 for work where a wrong answer is expensive and hard to catch: the hardest technical debugging, board-level financial modeling, or any task needing the last few points of accuracy, at $5 per million input tokens and $25 per million output tokens, two and a half times Sonnet 5's standard rate. For a task demanding enough that even Opus 4.8 struggles, Anthropic's newest reasoning-focused model, Fable 5, exists for that tier of problem. It gets its own deep dive elsewhere in this issue, but the short version: keep it as the fallback for the hardest reasoning tasks.
One genuine limitation before routing sensitive work to Sonnet 5: Anthropic deliberately did not train it on cybersecurity tasks. Its own testing shows Sonnet 5 could not develop a single working exploit against a set of known Firefox vulnerabilities, while Opus 4.8 and Fable 5 performed meaningfully better on the same test. Any sanctioned penetration testing or security research still belongs on Opus 4.8.
Use Case One: Contract Review, Start to Finish
Contract review is the clearest fit for Sonnet 5: high-volume, pattern-based, and something a human reviews anyway before it goes final. Legal AI platform Harvey added Sonnet 5 to its lineup the same day Anthropic launched it and ran it through BigLaw Bench, a legal-specific benchmark built from real law firm work. Sonnet 5 scored 91.3 percent, the highest score Harvey has recorded across every Sonnet and Opus model it has tested, with its strongest results in risk assessment and compliance, case management, and transactional drafting.
"Sonnet 5 brings a meaningful jump in legal quality over Sonnet 4.6," said Niko Grupen, Harvey's Head of Applied Research, in the company's launch post. "In early testing, it was both more accurate and more precise than its predecessor, delivering stronger answers in fewer words." Fewer words per answer means fewer output tokens per task, compounding with Sonnet 5's lower per-token price into a real cost reduction, not just a quality bump.
A workflow to run today in Claude, using medium effort as the starting point:
You are reviewing a vendor services agreement on behalf of the buyer. The full contract is attached. Do the following in order: 1. Summarize the agreement in 5 bullet points: parties, term length, total contract value, termination rights, and governing law. 2. Flag every clause that deviates from standard buyer-favorable terms, specifically: auto-renewal without a cancellation window, unlimited liability, one-sided indemnification, unilateral price increases, and IP assignment language that is not mutual. 3. For each flagged clause, quote the exact contract language, explain in plain terms why it favors the vendor, and suggest specific replacement language. 4. Output a final risk rating (Low, Medium, or High) for the contract with a one-sentence justification. Do not summarize clauses that are standard and unremarkable. Only flag what actually deviates from buyer-favorable norms.
Step one builds a structural map before hunting for problems, so the model does not fixate on one flashy clause and miss a quieter one. Step two names specific risk patterns instead of asking Sonnet 5 to "check for anything concerning," a vague instruction that invites vague output. Step three requires a verbatim quote for every flag, so a reviewer can verify the finding against the source instead of trusting a paraphrase. Step four produces one number an Executive can act on without reading 4 pages of output.
Run this on a 15-page vendor agreement and expect output in well under 2 minutes at medium effort, versus the 30 to 45 minutes a first-pass human review typically takes on a contract that length. The edge case: Harvey's testing found Sonnet 5 still struggles on dense work like tax and structured finance clauses, a limitation shared across every frontier model, not unique to Sonnet 5. Trust the first pass on a standard vendor agreement or NDA. Route tax structuring or complex derivatives language to a specialist instead.
Use Case Two: Competitive Research Synthesis
Sonnet 5's OSWorld-Verified score of 81.2 percent reflects real gains in computer use: the model can navigate a live website, read the page, and act on it inside one task, not just answer questions about text you paste in. That turns competitive research from a manual afternoon into a task you assign and check later.
Point Claude, with computer use or browsing enabled, at 3 named competitor websites and ask it to extract current pricing tiers, feature lists, and any publicly stated customer counts or funding news from the past 90 days, then assemble a single comparison table with your product as the baseline column. At medium effort, expect a first pass in 5 to 10 minutes, covering ground that would otherwise mean a dozen open browser tabs and manual copy-pasting into a spreadsheet.
Build in a check step every time: ask Sonnet 5 to cite the specific page URL for every pricing figure, since pricing pages change often and a stale listing can produce a confidently wrong number. Anthropic's safety testing found Sonnet 5 has lower hallucination and sycophancy rates than Sonnet 4.6, but lower is not zero, and a wrong competitor price feeding into a board deck is worth a 60-second human spot-check first.
Action Steps Summary
1. Switch your default model. On claude.ai Free or Pro, Sonnet 5 is already default; on Max, Team, or Enterprise, select it manually for coding, contract review, and research tasks.
2. Start every new task at medium effort. Move to high or xhigh only if output quality falls short, since xhigh can cost more than Opus 4.8 for similar results.
3. Run the contract review prompt on one live document. Use the template above on a current vendor agreement or NDA and time the first pass against your usual human review time.
4. Assign one competitive research task this week. Pick 3 named competitors and let Sonnet 5 build the comparison table while you work on something else.
5. Reserve Opus 4.8 deliberately. Keep it for board-level analysis, high-stakes debugging, and sanctioned security work, and treat Fable 5 as the fallback for reasoning tasks that outgrow both.
6. Set a calendar reminder for August 31. Standard pricing of $3 and $15 per million tokens takes effect that day, and the new tokenizer means actual per-task cost may shift more than the sticker price suggests.
Three deep dives. Four useful moves. One email worth opening.
PromptHacker turns the AI firehose into practical next steps for work, health, family, and everything time keeps trying to steal.