99 problems · 76 human-solvable · 23 human-difficult. Per-task model results are not published; this viewer surfaces our local Omnii triage plus our corrected sweep performance and traces.
| Model | Human-solvable (n=76) | Human-difficult (n=23) | Reliability |
|---|
5 trials/problem. Per-task breakdowns aren't published — source.
| Sweep | Model | Cells | Arm performance | Flips vs A | Regressions vs A |
|---|
Each expanded task card below includes the answer, grade, local transcript path, remote root, and expandable raw JSONL trace for each corrected run cell.