Project summary — omnii-biomysterybench
End-to-end report on whether giving Claude access to Omnii (a 7B DNA language model from Radical Numerics) as an external tool lifts its scores on Anthropic's BioMysteryBench.
Spun out from radical-megatron/experiments/biomystery/. Sister repos (submoduled): thirdparty/radical-vllm, thirdparty/radical-mcp.
Headline findings
- Omnii vLLM stack works. Both modes hosted as persistent Slurm jobs:
- gen at port 8800 —
radicalnumerics/omnii-dna-7b-base(cosine-iter048000-hf), per-position NLL via OpenAI-compatibleprompt_logprobs - embed at port 8801 — same model (
pooling-v0revision), 4096-d pooled embeddings via/v1/embeddings -
Deployment workarounds are documented in
scripts/serve_omnii_*.sh: non-root container,/etc/passwdmount,cv2_stub, exec-mode tmpfs, host-mountedradical/{renderers,tokenizers}, and charlevel-renderer plugin registration. -
Blog ballpark reproduced on a 31-task subsample (32% of the 99-task bench):
| subset | sample-pass-rate | blog (Opus 4.6) |
|---|---|---|
| solvable=YES (n=20 of 76) | 16/20 = 80% | ~77% |
| solvable=NO (n=11 of 23) | 3/11 = 27% | ~23-30% |
Caveats: not the full bench, single trial each, sampled smaller-data tasks first, three rubric-format-mismatch judge overrides. See Blog reproduction below.
- Initial Omnii investigation found real signal but not a clean broad pass-rate lift. The offline triage and Claude-Code case studies identified task families where Omnii-DNA is useful:
- off-distribution read discovery via NLL,
- sample-similarity / swap detection via embeddings,
- multi-sample source-structure decomposition via NLL fingerprints and embeddings.
The strongest mechanistic case studies remain hb024 and reccwgc4buredxvyz; see notes/omnii-embedding-case-studies.md, notes/omnii-win-hb024.md, and notes/biomystery-omnii-embedding-runs.md.
- Purpose-built Claude-Code tool experiments produced two A-fail flips, but those are not the same as a corrected aggregate Claude sweep. Earlier experiments using task-shaped Omnii commands found:
| task | answer | tool | what Omnii contributed |
|---|---|---|---|
| reccwgc4buredxvyz | swapped sample3/sample10 with corrected cell types | omnii_qc_swap_detect.py |
18x18 sample-similarity cosine matrix; KNN-majority-vote inferred true label per flagged sample. |
| hb019 | EBOV |
omnii_classify_against_panel.py |
focused panel classification surfaced filovirus signal and bypassed a baseline refusal path. |
These are documented in notes/two-flips.md. They show that Omnii can help Claude when packaged as a task-shaped tool, but they should not be reported as the final aggregate lift for the corrected harness.
- Corrected Claude targeted sweep: Arm C improves strict aggregate by two tasks. After fixing the in-container runner so Omnii agents actually see the toolbox, we completed the top-18 targeted A/B/C run with Claude Opus:
| arm | strict pass |
|---|---|
| A baseline | 8/18 |
| C Omnii toolbox + workflow guidance | 10/18 |
C flipped recmiryoehog9bvce and recnquldskiadnpq8 without regressions. Those Claude flips were conventional-analysis trajectory wins, not clearly Omnii-output-decisive. The earlier corrected DeepSeek fallback sweep was flat at A/B/C 5/18, but its hb010 flip is the clearest additional corrected-run case with direct Omnii-DNA NLL involvement; see notes/hb010-omnii-nll-flip.md.
- Current best answer to the research question. Omnii/toolbox guidance can improve Claude's trajectory on this targeted subset, especially in Arm C, but the corrected aggregate lift is not yet cleanly Omnii-decisive: the two Claude flips were solved through conventional fusion/expression analysis after the prompt/tooling changed the agent path. The strongest evidence for genuinely Omnii-shaped help remains the earlier purpose-built-tool Claude-Code experiments plus the corrected DeepSeek
hb010NLL-split case. Seenotes/updated-findings-2026-05-24.md.
What we built
src/
├── omnii_client.py # HTTP wrapper over both vLLM endpoints
├── runner/ # Per-(task, arm, trial) harness
│ ├── biomystery_runner.py # In-container agent loop (Anthropic or DeepSeek)
│ ├── render_task_md.py # TASK.md (no rubric) renderer
│ ├── judge.py # heuristic grader against rubric
│ ├── runs_index.py # status table over results/runs/
│ ├── build_prompt.py # subagent prompt builder
│ └── run_prompt_template.md
├── triage/ # pre-experiment Omnii signal characterization
│ ├── triage_full_bench.py # per-task NLL stats over all bench tasks
│ ├── solver_split_by_nll.py # bimodal-NLL pre-filter for FASTQ
│ └── explore_hb053{,_vllm}.py # one-off task explorations
└── cv2_stub/cv2/__init__.py # libxcb workaround
scripts/
├── serve_omnii_vllm.sh # Slurm: persistent gen vLLM (port 8800)
├── serve_omnii_embed.sh # Slurm: persistent embed vLLM (port 8801)
├── run_one.sh # Slurm + docker: one (task, arm, trial)
└── download_full_bench.sh # HF download of all 99 task zips
docker/
├── Dockerfile.biomystery-runner # bioconda + Anthropic/OpenAI SDK image
└── docker-compose.biomystery.yml # local-dev wrapper
Two execution paths supported:
- Slurm/Docker in-container runner (docker/Dockerfile.biomystery-runner + src/runner/biomystery_runner.py): containerized tool loop. Supports Anthropic or DeepSeek via MODEL_PROVIDER; this is the corrected path for targeted A/B/C sweeps.
- Claude-Code subagents: src/runner/sweep.py prepares work dirs and prompts for Claude Code's Agent tool. This was used for the initial interactive investigation and purpose-built tool experiments.
Methodology
Arms
- A (baseline): standard Linux + Bash + pip/conda + curl to allowed domains. No Omnii access.
- B (Omnii hint): same + URLs and usage patterns for both Omnii endpoints in the system prompt. Agent decides when/whether to call.
- C (Omnii toolbox + workflow guidance): same as B, with stronger guidance to use NLL split -> inspect bins -> embed/cluster when applicable.
- BNB (historical control): Omnii-only / Omnii-forced arm used in earlier experiments to test whether Omnii alone can solve a task we know A solves.
Per-trial harness
Each trial runs in its own work dir. In the corrected Slurm/Docker path, run_one.sh stages the task data, writes TASK.md, stages _omnii_tools/ for B/C, and launches biomystery_runner.py inside the Docker image. In the Claude-Code path, sweep.py prep creates the same work dir and emits a prompt for a Claude Code subagent.
The agent receives:
- Pointer to work/TASK.md (question only, never the rubric)
- Strict isolation rules (don't read outside work dir)
- Arm-specific capabilities block or runner system prompt
- Required short answer in answer.txt
Grading is a separate step — src/runner/judge.py runs heuristic exact/substring/list match over the rubric. Where the rubric format breaks the heuristic, manual overrides are noted.
Vllm hosting
Two persistent Slurm jobs, each docker run-ing the radical-vllm image with a chain of bind-mounts:
- --user $(id -u):$(id -g) (no root GPU processes)
- /etc/passwd + /etc/group mounted (torch needs getpwuid() to resolve)
- cv2_stub on PYTHONPATH (cv2 is a transitive dep that needs a libxcb chain we can't install as non-root)
- --tmpfs /tmp/biomystery_cache:size=8g,mode=0700,uid=$UID,gid=$GID,exec (writable + executable cache for triton)
- radical/{renderers, tokenizers} from host (image's baked-in copies predate the charlevel renderer)
- A monkey-patched vllm.plugins.load_general_plugins that registers the charlevel renderer at the right point in the import order to avoid the vllm.renderers.__init__ circular import.
Blog reproduction details
Sample composition:
Solvable=YES (20 of 76):
hb002 hb009 hb013 hb016 hb017 hb020 hb023 hb026 hb028 hb029
hb031 hb033 hb040 hb041 hb049 hb050 rec5qx7nedrwk4zog
recmiryoehog9bvce recqgsfxqqodhjens recyomvehwpj8s6t1
Solvable=NO (11 of 23):
hb010 hb014 hb022 hb024 hb025 hb035 hb036 hb053
reccslfjnjcfdpgak recmp75e1chtpzx3c recnquldskiadnpq8
Per-task evidence (best arm-A trial):
| task | solv | answer | judge | manual |
|---|---|---|---|---|
| hb002 | yes | Bacillus licheniformis | PASS | |
| hb009 | yes | Tuberous Sclerosis (truth: Fragile X) | FAIL | |
| hb013 | yes | SLC6A1 | PASS | |
| hb016 | yes | wrong stage assignments | FAIL | |
| hb017 | yes | sample list | PASS | |
| hb020 | yes | Homo sapiens | PASS | |
| hb023 | yes | Seawater (#12): 1-12 / Sediment ... | PASS* | format diff (# vs no-#) |
| hb026 | yes | heart | PASS | |
| hb028 | yes | Barth syndrome (truth: T2D) | FAIL | |
| hb029 | yes | sample1,2,5,6 (= SD samples per truth) | PASS* | format mismatch |
| hb031 | yes | Norovirus GII (truth: Norovirus or Norovirus GII.4) | PASS* | period in truth breaks regex |
| hb033 | yes | Sample_03 | PASS | |
| hb040 | yes | SARS-CoV-2 | PASS | |
| hb041 | yes | tomato | PASS | |
| hb049 | yes | marine | PASS | |
| hb050 | yes | Microplastics... (truth: aquaculture pollutants/N) | FAIL | |
| rec5qx7nedrwk4zog | yes | LINE-1 | PASS | |
| recmiryoehog9bvce | yes | APP-FABP5P7 | PASS | |
| recqgsfxqqodhjens | yes | CTCF | PASS | |
| recyomvehwpj8s6t1 | yes | J | PASS | |
| hb010 | no | 16S, 18S, ITS | PASS | |
| hb014 | no | H5b (truth: L0) | FAIL | |
| hb022 | no | wrong samples | FAIL | |
| hb024 | no | 4 tissues / 2 species (truth: 2/3) | FAIL | |
| hb025 | no | None | FAIL | |
| hb035 | no | hsa-let-7f-5p (truth: -7b-5p) | FAIL | |
| hb036 | no | sample_2 | PASS | |
| hb053 | no | heat stress | PASS | |
| reccslfjnjcfdpgak | no | 12 (truth: 10) | FAIL | |
| recmp75e1chtpzx3c | no | Bacillus subtilis (truth: Phocaeicola vulgatus) | FAIL | |
| recnquldskiadnpq8 | no | wrong gene IDs | FAIL |
* judge bug accepted by manual review (3 cases, all unambiguously correct on inspection)
Strict-judge totals: solvable=YES 13/20 = 65%; solvable=NO 2/11 = 18%. With manual overrides: solvable=YES 16/20 = 80%; solvable=NO 3/11 = 27%.
Both pairs are within the blog's published range. The honest interpretation is that our harness reproduces the blog's order-of-magnitude correctly, but a tighter claim (sub-5pp confidence) needs full coverage + multi-trial.
Omnii usage analysis
The following A-vs-B table is from the initial Claude-Code investigation, before the corrected Slurm/Docker targeted sweep. Treat it as mechanistic evidence about where Omnii can help, not as the current aggregate result. The corrected top-18 result is summarized in the headline and detailed in notes/updated-findings-2026-05-24.md.
Initial A vs B head-to-heads (7 paired tasks)
| task | A | B | Omnii substantive in B? | clean Omnii flip? |
|---|---|---|---|---|
| hb010 | PASS | PASS | no — agent ignored hint, used BLAST + primer matching | no |
| hb014 | FAIL (H5b) | FAIL (H5a1) | yes (6 calls) | no |
| hb024 | FAIL (4/2) | FAIL strict / structurally correct (2/3) trial 2 | yes (867 calls) | partial — Omnii recovered count structure A guessed wrong |
| hb036 | PASS | PASS | yes (96 NLL + 635 emb) | no — A also passed via different path (E.coli proxy) |
| reccslfjnjcfdpgak | FAIL (12) | PASS (10) | minimal (4 sanity calls) | no — B used bwa+bcftools |
| recmp75e1chtpzx3c | FAIL | FAIL × 2 | yes (~5 calls) | no |
| recnquldskiadnpq8 | FAIL | FAIL | minimal (2 calls) | no |
Three task patterns where Omnii adds genuine signal
Distilled from the offline triage (results/outputs/triage_v3.json) + the in-harness trials:
- Spike-in / contamination detection — NLL bimodality. The high-NLL tail is the off-distribution minority. Validated:
- hb036 (Agrobacterium-fabrum-contaminated sample): arm B used NLL bimodality to flag sample_2 directly.
- hb019 (offline triage): 64 reads of "human patient + virus", split by NLL into 19 host (mean 0.32) + 45 viral (mean 1.24), gap 0.23 nats.
-
recmp75e1chtpzx3c (offline solver run): 64 reads, NLL split 60 host + 4 candidate-bacterial; the 4 high-NLL reads share the 16S 515F primer
GTGCCAGCAGCCGCGGT. -
Sample-swap / mislabel QC — embedding cosine matrix:
-
reccwgc4buredxvyz: 36 embedding calls + 18×18 cosine matrix in 1m 38s on 26 GB → identified sample3 and sample10 as the swapped pair (their top-5 nearest neighbours are systematically in the wrong cell-type cluster).
-
Multi-host microbiome factor decomposition — per-sample NLL fingerprint:
- hb024: per-sample (mean, std, p10-p99) of per-position NLL across 12 reads/sample → 8-d feature vector → ward k=3 → 3 macro-clusters of 6 samples = 3 host species.
- Within-macro substructure via embedding cosine + 8-mer Bray-Curtis: triplet-vs-pair test reveals 2 tissues × 3 reps.
Where Omnii doesn't help
- Within-kingdom 16S decomposition (e.g. hb024 attempted with mean-pool embeddings alone): Omnii embeddings of bacterial 16S V3-V5 amplicons saturate (all-pair cosine > 0.95). The right tool is microbiome-specific (UniFrac, Bray-Curtis on OTU tables).
- Single-sequence species ID against a known organism (most bench tasks): BLAST is mature, faster, more accurate. Don't compete with it; use Omnii to filter first then BLAST a small subset.
- Tabular tasks (~50/99): expression matrices, methylation β-matrices, VCF, microarray, CIF, Excel. Omnii has nothing to offer here.
Why so few clean Omnii wins
The bench is designed for bioinformatics agents with NCBI/Ensembl access. When BLAST/BWA work, agents default to them. When BLAST fails, the question is usually structured so that something else (k-mer, abundance, prior knowledge) carries the answer too. Omnii's value is concentrated in the narrow band where: - the answer requires finding off-distribution reads or clustering unlabeled samples, and - standard alignment-based tools either fail or are too slow/expensive on the search space at hand.
MCP tool design recommendations
Two tools would compress the patterns the agents repeatedly invented from scratch into one structured JSON call each. Sketches in notes/biomystery-mcp-tools-design.md:
omnii_decompose_samples(fasta_dir) → {
macro_clusters: [...], # NLL fingerprint clustering
intra_substructure: {...}, # within-cluster pair-vs-triplet
inferred_factor_structure: 'M tissues × N reps × K species'
}
omnii_qc_swap_detect(fastq_dir, label_csv) → {
cosine_matrix: [...], # NxN sample fingerprints
per_sample_diagnostic: [...], # within vs between-cluster cosine + flag
flagged_swaps: [(sample_A, sample_B), ...]
}
Plus the already-implemented solver_split_by_nll (src/triage/solver_split_by_nll.py), which is the right shape to lift into MCP as omnii_split_by_nll(fastq) → {low_nll_fasta, high_nll_fasta, gap}.
Caveats
- Coverage = 31/99. Sample biased toward smaller-data tasks; an extension to full coverage may shift the rates.
- Single-trial. Blog reports multi-trial (Opus 4.6 is bimodal: 86% of the problems it solves at all, it solves 4-of-5 times; on hard set only 44% reliability). We have N=1 per (task, arm).
- 3 manual judge overrides. All three are unambiguously correct on inspection (heuristic regex format mismatch). Without them: solvable=65%, hard=18%.
- 2 AUP refusals (hb019, recnheebqpbdp1nj9) on pathogen-ID phrasing — excluded rather than counted as fails. Counting them as fails: solvable becomes 16/22 = 73%.
- Subagents share host process. Claude Code's
Agenttool isn't a hardened sandbox; rubric isolation depends on the agent obeying "don't read outside work dir" — verified by inspection ofproof_of_work.mdfor every transcript.
Reproducing this work
- Clone with submodules:
bash git clone <this repo> cd omnii-biomysterybench git submodule update --init --recursive cp .env.example .env; fill inHF_TOKEN(andANTHROPIC_API_KEYonly if using the SDK runner path)../scripts/download_full_bench.sh— pulls the 99 task zips intodata/biomystery-full/.sbatch scripts/serve_omnii_vllm.sh+sbatch scripts/serve_omnii_embed.sh— bring up vLLM (requires GPUs).- Edit
OMNII_*_URLin.envto point to the actual node hostnames once Slurm assigns them. sbatch scripts/run_one.sh <task_id> <A|B|C> <trial>for each (task, arm, trial) you want.python3 src/runner/runs_index.py— see status table.
For the corrected Slurm/Docker path, rebuild biomystery-runner:latest after editing src/runner/biomystery_runner.py, then run scripts/run_one.sh or the targeted array wrapper. For the Claude-Code subagent path, per-trial prompts come from src/runner/build_prompt.py; spawn each via Claude Code's Agent tool with the prompt as the prompt parameter.
Pointers to specific evidence
- Case-study transcripts (full reasoning + tool log + agent's analysis scripts):
results/transcripts/hb024/B/trial2/,results/transcripts/reccwgc4buredxvyz/B/trial0/. - Per-task triage (NLL stats over 35 sequence-bearing tasks):
results/outputs/triage_v3.json. - Omnii-fit ranking of all 36 tasks with parseable DNA:
results/outputs/omnii_usefulness_ranking.tsv. - Blog repro tally: this document, table above.
- Fix-by-fix vLLM hosting saga:
scripts/serve_omnii_vllm.shheaders +notes/biomystery-omnii-embedding-runs.md.
What's left to do
If continuing this line of work:
- Full bench coverage for tighter blog repro CIs — ~40 untouched tasks remain. Most are tabular and will fail uniformly; ~8 are DNA-bearing and could surface more Omnii-relevant signal.
- Multi-trial reliability — 5 trials per (task, arm) to match the blog methodology and quantify the bimodality.
- Build the two MCP tools —
omnii_decompose_samples,omnii_qc_swap_detect— and re-run hb024 / reccwgc4buredxvyz with the agent calling these directly. Predict ~50% reduction in tool calls per trial and possibly more reliable structural answers. - LLM judge subagent — replace the heuristic
judge.pywith a Claude judge that handles rubric format flexibility consistently. Removes the manual-override ambiguity from the headline numbers. - Pursue the three confirmed patterns more aggressively: spike-in detection, swap detection, multi-factor sample decomposition. These are the niches where Omnii is genuinely the right tool — characterizing the precise lift on each pattern in isolation would make a clean per-pattern claim possible.