BioMysteryBench viewer

99 problems · 76 human-solvable · 23 human-difficult. Per-task model results are not published; this viewer surfaces our local Omnii triage plus our corrected sweep performance and traces.

Aggregate model results (from Anthropic blog)

ModelHuman-solvable (n=76)Human-difficult (n=23)Reliability

5 trials/problem. Per-task breakdowns aren't published — source.

Corrected Omnii sweeps

SweepModelCellsArm performanceFlips vs ARegressions vs A

Each expanded task card below includes the answer, grade, local transcript path, remote root, and expandable raw JSONL trace for each corrected run cell.