Opus 4.8 on the full 99-task BioMysteryBench

Reproduction of Anthropic's BioMysteryBench using claude-opus-4-8 on all 99 problems × 3 arms (A: baseline, B: Omnii hint, C: Omnii + workflow). Single trial per cell. Judging is heuristic exact-match first, then Haiku-4.5 LLM-rescore for any substantive answer the heuristic couldn't decide.

Headline numbers (after LLM rescore)

SubsetA — baselineB — Omnii hintC — Omnii + workflow
all 9959/99 (60%)63/99 (64%)58/99 (59%)
human-solvable (76)51/76 (67%)55/76 (72%)50/76 (66%)
human-difficult (23)8/23 (35%)8/23 (35%)8/23 (35%)

Arm-vs-A movement (full 99)

ComparisonFlips (A fail → arm pass)Regressions (A pass → arm fail)Net
B vs A62+4
C vs A45-1

How to read this site

Judge method breakdown

exact: 170 · llm_rescore: 93 · llm: 27 · exact_v2: 7

What we found