BioMysteryBench viewer

99 problems · 76 human-solvable · 23 human-difficult. Per-task model results are not published; this viewer surfaces our local Omnii triage plus our corrected sweep performance and traces.

Aggregate model results (from Anthropic blog)

Model	Human-solvable (n=76)	Human-difficult (n=23)	Reliability

5 trials/problem. Per-task breakdowns aren't published — source.

Corrected Omnii sweeps

Sweep	Model	Cells	Arm performance	Flips vs A	Regressions vs A

Each expanded task card below includes the answer, grade, local transcript path, remote root, and expandable raw JSONL trace for each corrected run cell.

only with any run grades only corrected sweep tasks

Built from data/biomystery-full/problems.csv, results/outputs/triage_v3.json, results/outputs/omnii_usefulness_ranking.tsv, results/transcripts/**/grade.json, and results/runs_targeted_omnii_*/**/{answer.txt,grade.json,transcript.jsonl}. Rebuild: python3 viewer/build.py.