Opus 4.8 on the full 99-task BioMysteryBench

Reproduction of Anthropic's BioMysteryBench using claude-opus-4-8 on all 99 problems × 3 arms (A: baseline, B: Omnii hint, C: Omnii + workflow). Single trial per cell. Judging is heuristic exact-match first, then Haiku-4.5 LLM-rescore for any substantive answer the heuristic couldn't decide.

Headline numbers (after LLM rescore)

Subset	A — baseline	B — Omnii hint	C — Omnii + workflow
all 99	59/99 (60%)	63/99 (64%)	58/99 (59%)
human-solvable (76)	51/76 (67%)	55/76 (72%)	50/76 (66%)
human-difficult (23)	8/23 (35%)	8/23 (35%)	8/23 (35%)

Arm-vs-A movement (full 99)

Comparison	Flips (A fail → arm pass)	Regressions (A pass → arm fail)	Net
B vs A	6	2	+4
C vs A	4	5	-1

How to read this site

Sweep viewer — per-task drill-down. Each task card shows the agent's submitted answer vs the rubric, judge verdict, and a turn-by-turn agent trace (with Omnii calls highlighted).
99 problems — all questions with their data file manifests and (spoiler-folded) rubrics.
Agent prompts — the actual prompt text the subagent receives, including the v1 (mpoli's two-flips) vs v2 (this run) diff.
Two-flips report · Embedding case studies · Project summary

Judge method breakdown

exact: 170 · llm_rescore: 93 · llm: 27 · exact_v2: 7

What we found

Omnii arm B gives a real +4 pp lift on the solvable set (72% vs 67% baseline). Arm C is a wash.
Zero Omnii effect on the difficult set — all arms tie at 35%.
The C-arm advantage from mpoli's 18-task corrected sweep didn't replicate at full scale.
We're 10 pp below the blog's reported 77% solvable; ~17% of cells were infrastructure failures (no answer / killed), and we run 1 trial vs the blog's 5.