Opus 4.8 on the full 99-task BioMysteryBench
Reproduction of Anthropic's BioMysteryBench using claude-opus-4-8
on all 99 problems × 3 arms (A: baseline, B: Omnii hint, C: Omnii + workflow).
Single trial per cell. Judging is heuristic exact-match first, then Haiku-4.5
LLM-rescore for any substantive answer the heuristic couldn't decide.
Headline numbers (after LLM rescore)
| Subset | A — baseline | B — Omnii hint | C — Omnii + workflow |
|---|---|---|---|
| all 99 | 59/99 (60%) | 63/99 (64%) | 58/99 (59%) |
| human-solvable (76) | 51/76 (67%) | 55/76 (72%) | 50/76 (66%) |
| human-difficult (23) | 8/23 (35%) | 8/23 (35%) | 8/23 (35%) |
Arm-vs-A movement (full 99)
| Comparison | Flips (A fail → arm pass) | Regressions (A pass → arm fail) | Net |
|---|---|---|---|
| B vs A | 6 | 2 | +4 |
| C vs A | 4 | 5 | -1 |
How to read this site
- Sweep viewer — per-task drill-down. Each task card shows the agent's submitted answer vs the rubric, judge verdict, and a turn-by-turn agent trace (with Omnii calls highlighted).
- 99 problems — all questions with their data file manifests and (spoiler-folded) rubrics.
- Agent prompts — the actual prompt text the subagent receives, including the v1 (mpoli's two-flips) vs v2 (this run) diff.
- Two-flips report · Embedding case studies · Project summary
Judge method breakdown
exact: 170 · llm_rescore: 93 · llm: 27 · exact_v2: 7
What we found
- Omnii arm B gives a real +4 pp lift on the solvable set (72% vs 67% baseline). Arm C is a wash.
- Zero Omnii effect on the difficult set — all arms tie at 35%.
- The C-arm advantage from mpoli's 18-task corrected sweep didn't replicate at full scale.
- We're 10 pp below the blog's reported 77% solvable; ~17% of cells were infrastructure failures (no answer / killed), and we run 1 trial vs the blog's 5.