Claude Opus 4.8 Evaluation: Improved Strengths,…

The Claude Opus 4.8 Review: Better At What’s It Good At, Worse At What It’s Not shows that Claude Opus 4.8 arrives with actual technical benchmarks showing real-world gains for code, workflow, and prototyping tasks, as Thezvi reports. That jump in real coding ability—from 64.3 to 69.2 on the SWE-bench Pro benchmark. Reflects how much more routine coding and minor research the model now nails, especially compared to earlier, more pattern-matching LLMs. And since pricing stays locked at $5 for input and $25 for output per million tokens, Opus 4.8 lets users get more value daily without any extra cost, according to both Thezvi and Lennysnewsletter.

This Claude Opus 4.8 review covers how the model delivers significant improvements in routine coding, workflow automation, and rapid prototyping. However, it also details persistent weaknesses in ambiguous or highly strategic tasks. Confirming the theme that Claude Opus 4.8 is better at what it’s already good at, but worse or unchanged where it previously struggled.

Main Community Insights

Lesswrong and Community Analysis

The Lesswrong community logs both measurable step-ups and a familiar qualitative ceiling for Claude Opus 4.8. Participants in public leaderboards and private logs echo one message: Opus 4.8 truly earns its higher SWE-bench Pro scores. But contributors frequently spot “edge failures” when rules blur or instructions change mid-way. Annotated LW transcripts highlight the types of ambiguity that stump Opus 4.8—non-obvious conditional logic, winding chain-of-thought, or semantic curveballs.

Effort controls and subagent features help shrink gaps for basic jobs, but they just can’t cover the planning skill needed for rigorous research or business logic.

Lennysnewsletter Business Perspective

Lennysnewsletter zeros in on Opus 4.8’s real-world impact: it’s set up for rapid business prototyping, fast feature delivery, and parallelized workflows on Claude.ai and Cowork. According to this Claude Opus 4.8 review, launching greenfield prototypes or handling one-off experiments is now easier.

Testing Claude Opus 4.8: Methods and Findings

Creating the tests

Test designers leaned on cross-validation for recent reviews—using logs and case histories from Thezvi, Lennysnewsletter, and Lesswrong to ensure standards for both public and private benchmarks.

Test results

The big jump in SWE-bench Pro scores—from 64.3 to 69.2—captures real, stepwise progress, according to Thezvi’s logs. However, reviewers are clear: routine coding shines, but once the test’s ambiguity rises, Opus 4.8 falters. Code completion and automation reward the upgrade, while strategy-heavy work or deep planning still reveal points of drift, per Lennysnewsletter’s findings. Lesswrong’s crowdsourcing confirms a drop in syntax and off-by-one errors in predictable scenarios, supporting the main findings in this Claude Opus 4.8 Review: Better At What’s It Good At, Worse At What It’s Not.

Andon Labs@andonlabs

Learnings from testing Claude Opus 4.8:

> Much worse than Opus 4.7 and GPT 5.5 on Vending Bench
> More aligned than previous Claude models (Opus 4.6+ and Mythos)
> Also worse on Blueprint-Bench
> Scared of getting caught
> Max reasoning is not the best reasoning effort pic.twitter.com/9yn58xsJL9

May 28, 2026View on X

Claude Opus 4.8 Benchmarks and Comparisons

Benchmark Numbers: 4.8 vs 4.7 vs GPT-5.5

Model	SWE-bench Pro Score	Input/Output Cost per 1M Tokens	Fast Mode Cost
Claude Opus 4.8	69.2	$5 / $25	Cheaper than 4.7 ($30/$150)
Claude Opus 4.7	64.3	$5 / $25	$30 / $150
GPT-5.5	Unavailable	Not disclosed	Not disclosed

Thezvi’s documentation clarifies that at $5 for input and $25 for output per million tokens, Opus 4.8 matches 4.7’s sticker price but now undercuts its fast mode—$30/$150 for 4.7—keeping batch jobs affordable and live-use possible for smaller teams.

What’s Actually New (Beyond Numbers)

Lennysnewsletter breaks down how Claude.ai and Cowork now come with far more practical upgrades for developers. Notably, launching greenfield prototypes or handling one-off experiments is now easier with Claude Opus 4.8.

$5 — Cost per million input tokens.

Claude Opus 4.8 vs. Sonnet 4.6

Should you use Claude Opus 4.8 or Sonnet 4.6?

Lennysnewsletter’s testing puts Opus 4.8 ahead of Sonnet 4.6 for routine code and quick prototyping. For stepwise logic or clarity, many pick Sonnet, while automation efforts now tilt toward Opus. For most workflow tasks, this Claude Opus 4.8 review concludes that Opus is the winner.

Claude Opus 4.8 vs. GPT-5.5

How does Opus 4.8 compare to GPT-5.5?

No published SWE-bench Pro score exists for GPT-5.5, creating a gap for direct comparison, Lesswrong’s documentation reveals. And while Thezvi and Lennysnewsletter clarify Opus pricing and workflows, GPT-5.5’s costs and real coding wins remain in a black box. Testers compare what they can: Opus 4.8’s new automatic workflows, subagent parallelization, and input/output management—features not yet matched in public GPT-5.5 benchmarks. In heavy business planning and complexity, Lennysnewsletter’s logs suggest that neither consistently bests the other for deep strategy. Opus 4.8 wins for speed, Sonnet for clarity, GPT-5.5 for theoretical breadth—but until SWE-bench Pro data arrives, the leaderboard is unsettled.

Best Use Cases for Claude Opus 4.8

All three main sources—Thezvi, Lennysnewsletter, and Lesswrong—for Opus 4.8 converge on the same uses: it’s built to knock out jobs where your workflow and objectives stay clear and measurable. As evidenced in this Claude Opus 4.8 review, it’s best for routine automation, coding tasks, brisk prototype development, and managed workflows.

Future Developments and Anthropic’s Roadmap

Opus 4.8’s next cycle—according to Thezvi—will target exactly these edge-case and context issues, hinting at future reviews potentially shifting the narrative in “Claude Opus 4.8 Review: Better At What’s It Good At, Worse At What It’s Not.”

Primary Takeaways

Records from Thezvi confirm: Opus 4.8’s 69.2 SWE-bench Pro score fortifies its lead for code-driven and structured workflow tasks. Holding steady at $5/$25 standard pricing and adding a cheaper fast mode.

Community Reaction and Live Feedback

Live Lesswrong threads and Thezvi’s collaborative logs show mixed reactions: optimism about automation, but clear warnings that the model’s reasoning boundaries haven’t budged. Researchers and power users agree: productivity jumps for coding and rote automation, but edge-case fuzziness and hallucinations fuel caution. The coding is more consistent, yet, as repeated across multiple Claude Opus 4.8 review summaries, notable weaknesses remain for tasks requiring broad reasoning or deep strategic planning.

Comparison with Other AI Models

Direct comparative analysis from Lesswrong and Lennysnewsletter shows Opus 4.8 outpaces Sonnet 4.6 for everyday code and workflow tasks—but the gap flexes depending on job details. GPT-5.5, with key metrics and pricing missing in action, hasn’t even proven its large-team strengths yet. Claude Opus 4.8 wins not because it’s best at everything, but because it outperforms rivals for tightly scoped, repeatable jobs businesses need now. In some niche logic tasks, Sonnet 4.6 still takes gold, while GPT-5.5 may have untapped breadth (though without SWE-bench results, it can’t truly challenge Opus for the coding title).

Opus 4.8 Delivers—But Not Universally

Opus 4.8 signals steady progress—not radical reinvention—by doubling down on what it already does well: process automation, code generation, and fast prototyping for cost-conscious teams. The model easily outpaces earlier versions and main rivals for bulk code, routine ops. Lightning-quick launches, all while staying affordable and scaling cleanly, per Thezvi and Lennysnewsletter.

This Claude Opus 4.8 review illustrates its strengths and capabilities, confirming it is among the best choices. For deeper perspective on current AI model limits, risks, and user experiences, see Best AI Models Still Encourage ‘Harmful Intimacy’ With C.

Search

Claude Opus 4.8 Evaluation: Improved Strengths, Diminished Weaknesses