Project summary — omnii-biomysterybench

End-to-end report on whether giving Claude access to Omnii (a 7B DNA language model from Radical Numerics) as an external tool lifts its scores on Anthropic's BioMysteryBench.

Spun out from radical-megatron/experiments/biomystery/. Sister repos (submoduled): thirdparty/radical-vllm, thirdparty/radical-mcp.

Headline findings

  1. Omnii vLLM stack works. Both modes hosted as persistent Slurm jobs:
  2. gen at port 8800 — radicalnumerics/omnii-dna-7b-base (cosine-iter048000-hf), per-position NLL via OpenAI-compatible prompt_logprobs
  3. embed at port 8801 — same model (pooling-v0 revision), 4096-d pooled embeddings via /v1/embeddings
  4. Deployment workarounds are documented in scripts/serve_omnii_*.sh: non-root container, /etc/passwd mount, cv2_stub, exec-mode tmpfs, host-mounted radical/{renderers,tokenizers}, and charlevel-renderer plugin registration.

  5. Blog ballpark reproduced on a 31-task subsample (32% of the 99-task bench):

subset sample-pass-rate blog (Opus 4.6)
solvable=YES (n=20 of 76) 16/20 = 80% ~77%
solvable=NO (n=11 of 23) 3/11 = 27% ~23-30%

Caveats: not the full bench, single trial each, sampled smaller-data tasks first, three rubric-format-mismatch judge overrides. See Blog reproduction below.

  1. Initial Omnii investigation found real signal but not a clean broad pass-rate lift. The offline triage and Claude-Code case studies identified task families where Omnii-DNA is useful:
  2. off-distribution read discovery via NLL,
  3. sample-similarity / swap detection via embeddings,
  4. multi-sample source-structure decomposition via NLL fingerprints and embeddings.

The strongest mechanistic case studies remain hb024 and reccwgc4buredxvyz; see notes/omnii-embedding-case-studies.md, notes/omnii-win-hb024.md, and notes/biomystery-omnii-embedding-runs.md.

  1. Purpose-built Claude-Code tool experiments produced two A-fail flips, but those are not the same as a corrected aggregate Claude sweep. Earlier experiments using task-shaped Omnii commands found:
task answer tool what Omnii contributed
reccwgc4buredxvyz swapped sample3/sample10 with corrected cell types omnii_qc_swap_detect.py 18x18 sample-similarity cosine matrix; KNN-majority-vote inferred true label per flagged sample.
hb019 EBOV omnii_classify_against_panel.py focused panel classification surfaced filovirus signal and bypassed a baseline refusal path.

These are documented in notes/two-flips.md. They show that Omnii can help Claude when packaged as a task-shaped tool, but they should not be reported as the final aggregate lift for the corrected harness.

  1. Corrected Claude targeted sweep: Arm C improves strict aggregate by two tasks. After fixing the in-container runner so Omnii agents actually see the toolbox, we completed the top-18 targeted A/B/C run with Claude Opus:
arm strict pass
A baseline 8/18
C Omnii toolbox + workflow guidance 10/18

C flipped recmiryoehog9bvce and recnquldskiadnpq8 without regressions. Those Claude flips were conventional-analysis trajectory wins, not clearly Omnii-output-decisive. The earlier corrected DeepSeek fallback sweep was flat at A/B/C 5/18, but its hb010 flip is the clearest additional corrected-run case with direct Omnii-DNA NLL involvement; see notes/hb010-omnii-nll-flip.md.

  1. Current best answer to the research question. Omnii/toolbox guidance can improve Claude's trajectory on this targeted subset, especially in Arm C, but the corrected aggregate lift is not yet cleanly Omnii-decisive: the two Claude flips were solved through conventional fusion/expression analysis after the prompt/tooling changed the agent path. The strongest evidence for genuinely Omnii-shaped help remains the earlier purpose-built-tool Claude-Code experiments plus the corrected DeepSeek hb010 NLL-split case. See notes/updated-findings-2026-05-24.md.

What we built

src/
├── omnii_client.py            # HTTP wrapper over both vLLM endpoints
├── runner/                    # Per-(task, arm, trial) harness
│   ├── biomystery_runner.py        # In-container agent loop (Anthropic or DeepSeek)
│   ├── render_task_md.py           # TASK.md (no rubric) renderer
│   ├── judge.py                    # heuristic grader against rubric
│   ├── runs_index.py               # status table over results/runs/
│   ├── build_prompt.py             # subagent prompt builder
│   └── run_prompt_template.md
├── triage/                    # pre-experiment Omnii signal characterization
│   ├── triage_full_bench.py        # per-task NLL stats over all bench tasks
│   ├── solver_split_by_nll.py      # bimodal-NLL pre-filter for FASTQ
│   └── explore_hb053{,_vllm}.py    # one-off task explorations
└── cv2_stub/cv2/__init__.py   # libxcb workaround

scripts/
├── serve_omnii_vllm.sh        # Slurm: persistent gen vLLM (port 8800)
├── serve_omnii_embed.sh       # Slurm: persistent embed vLLM (port 8801)
├── run_one.sh                 # Slurm + docker: one (task, arm, trial)
└── download_full_bench.sh     # HF download of all 99 task zips

docker/
├── Dockerfile.biomystery-runner   # bioconda + Anthropic/OpenAI SDK image
└── docker-compose.biomystery.yml  # local-dev wrapper

Two execution paths supported: - Slurm/Docker in-container runner (docker/Dockerfile.biomystery-runner + src/runner/biomystery_runner.py): containerized tool loop. Supports Anthropic or DeepSeek via MODEL_PROVIDER; this is the corrected path for targeted A/B/C sweeps. - Claude-Code subagents: src/runner/sweep.py prepares work dirs and prompts for Claude Code's Agent tool. This was used for the initial interactive investigation and purpose-built tool experiments.

Methodology

Arms

Per-trial harness

Each trial runs in its own work dir. In the corrected Slurm/Docker path, run_one.sh stages the task data, writes TASK.md, stages _omnii_tools/ for B/C, and launches biomystery_runner.py inside the Docker image. In the Claude-Code path, sweep.py prep creates the same work dir and emits a prompt for a Claude Code subagent.

The agent receives: - Pointer to work/TASK.md (question only, never the rubric) - Strict isolation rules (don't read outside work dir) - Arm-specific capabilities block or runner system prompt - Required short answer in answer.txt

Grading is a separate step — src/runner/judge.py runs heuristic exact/substring/list match over the rubric. Where the rubric format breaks the heuristic, manual overrides are noted.

Vllm hosting

Two persistent Slurm jobs, each docker run-ing the radical-vllm image with a chain of bind-mounts: - --user $(id -u):$(id -g) (no root GPU processes) - /etc/passwd + /etc/group mounted (torch needs getpwuid() to resolve) - cv2_stub on PYTHONPATH (cv2 is a transitive dep that needs a libxcb chain we can't install as non-root) - --tmpfs /tmp/biomystery_cache:size=8g,mode=0700,uid=$UID,gid=$GID,exec (writable + executable cache for triton) - radical/{renderers, tokenizers} from host (image's baked-in copies predate the charlevel renderer) - A monkey-patched vllm.plugins.load_general_plugins that registers the charlevel renderer at the right point in the import order to avoid the vllm.renderers.__init__ circular import.

Blog reproduction details

Sample composition:

Solvable=YES (20 of 76):
  hb002 hb009 hb013 hb016 hb017 hb020 hb023 hb026 hb028 hb029
  hb031 hb033 hb040 hb041 hb049 hb050 rec5qx7nedrwk4zog
  recmiryoehog9bvce recqgsfxqqodhjens recyomvehwpj8s6t1

Solvable=NO  (11 of 23):
  hb010 hb014 hb022 hb024 hb025 hb035 hb036 hb053
  reccslfjnjcfdpgak recmp75e1chtpzx3c recnquldskiadnpq8

Per-task evidence (best arm-A trial):

task solv answer judge manual
hb002 yes Bacillus licheniformis PASS
hb009 yes Tuberous Sclerosis (truth: Fragile X) FAIL
hb013 yes SLC6A1 PASS
hb016 yes wrong stage assignments FAIL
hb017 yes sample list PASS
hb020 yes Homo sapiens PASS
hb023 yes Seawater (#12): 1-12 / Sediment ... PASS* format diff (# vs no-#)
hb026 yes heart PASS
hb028 yes Barth syndrome (truth: T2D) FAIL
hb029 yes sample1,2,5,6 (= SD samples per truth) PASS* format mismatch
hb031 yes Norovirus GII (truth: Norovirus or Norovirus GII.4) PASS* period in truth breaks regex
hb033 yes Sample_03 PASS
hb040 yes SARS-CoV-2 PASS
hb041 yes tomato PASS
hb049 yes marine PASS
hb050 yes Microplastics... (truth: aquaculture pollutants/N) FAIL
rec5qx7nedrwk4zog yes LINE-1 PASS
recmiryoehog9bvce yes APP-FABP5P7 PASS
recqgsfxqqodhjens yes CTCF PASS
recyomvehwpj8s6t1 yes J PASS
hb010 no 16S, 18S, ITS PASS
hb014 no H5b (truth: L0) FAIL
hb022 no wrong samples FAIL
hb024 no 4 tissues / 2 species (truth: 2/3) FAIL
hb025 no None FAIL
hb035 no hsa-let-7f-5p (truth: -7b-5p) FAIL
hb036 no sample_2 PASS
hb053 no heat stress PASS
reccslfjnjcfdpgak no 12 (truth: 10) FAIL
recmp75e1chtpzx3c no Bacillus subtilis (truth: Phocaeicola vulgatus) FAIL
recnquldskiadnpq8 no wrong gene IDs FAIL

* judge bug accepted by manual review (3 cases, all unambiguously correct on inspection)

Strict-judge totals: solvable=YES 13/20 = 65%; solvable=NO 2/11 = 18%. With manual overrides: solvable=YES 16/20 = 80%; solvable=NO 3/11 = 27%.

Both pairs are within the blog's published range. The honest interpretation is that our harness reproduces the blog's order-of-magnitude correctly, but a tighter claim (sub-5pp confidence) needs full coverage + multi-trial.

Omnii usage analysis

The following A-vs-B table is from the initial Claude-Code investigation, before the corrected Slurm/Docker targeted sweep. Treat it as mechanistic evidence about where Omnii can help, not as the current aggregate result. The corrected top-18 result is summarized in the headline and detailed in notes/updated-findings-2026-05-24.md.

Initial A vs B head-to-heads (7 paired tasks)

task A B Omnii substantive in B? clean Omnii flip?
hb010 PASS PASS no — agent ignored hint, used BLAST + primer matching no
hb014 FAIL (H5b) FAIL (H5a1) yes (6 calls) no
hb024 FAIL (4/2) FAIL strict / structurally correct (2/3) trial 2 yes (867 calls) partial — Omnii recovered count structure A guessed wrong
hb036 PASS PASS yes (96 NLL + 635 emb) no — A also passed via different path (E.coli proxy)
reccslfjnjcfdpgak FAIL (12) PASS (10) minimal (4 sanity calls) no — B used bwa+bcftools
recmp75e1chtpzx3c FAIL FAIL × 2 yes (~5 calls) no
recnquldskiadnpq8 FAIL FAIL minimal (2 calls) no

Three task patterns where Omnii adds genuine signal

Distilled from the offline triage (results/outputs/triage_v3.json) + the in-harness trials:

  1. Spike-in / contamination detection — NLL bimodality. The high-NLL tail is the off-distribution minority. Validated:
  2. hb036 (Agrobacterium-fabrum-contaminated sample): arm B used NLL bimodality to flag sample_2 directly.
  3. hb019 (offline triage): 64 reads of "human patient + virus", split by NLL into 19 host (mean 0.32) + 45 viral (mean 1.24), gap 0.23 nats.
  4. recmp75e1chtpzx3c (offline solver run): 64 reads, NLL split 60 host + 4 candidate-bacterial; the 4 high-NLL reads share the 16S 515F primer GTGCCAGCAGCCGCGGT.

  5. Sample-swap / mislabel QC — embedding cosine matrix:

  6. reccwgc4buredxvyz: 36 embedding calls + 18×18 cosine matrix in 1m 38s on 26 GB → identified sample3 and sample10 as the swapped pair (their top-5 nearest neighbours are systematically in the wrong cell-type cluster).

  7. Multi-host microbiome factor decomposition — per-sample NLL fingerprint:

  8. hb024: per-sample (mean, std, p10-p99) of per-position NLL across 12 reads/sample → 8-d feature vector → ward k=3 → 3 macro-clusters of 6 samples = 3 host species.
  9. Within-macro substructure via embedding cosine + 8-mer Bray-Curtis: triplet-vs-pair test reveals 2 tissues × 3 reps.

Where Omnii doesn't help

Why so few clean Omnii wins

The bench is designed for bioinformatics agents with NCBI/Ensembl access. When BLAST/BWA work, agents default to them. When BLAST fails, the question is usually structured so that something else (k-mer, abundance, prior knowledge) carries the answer too. Omnii's value is concentrated in the narrow band where: - the answer requires finding off-distribution reads or clustering unlabeled samples, and - standard alignment-based tools either fail or are too slow/expensive on the search space at hand.

MCP tool design recommendations

Two tools would compress the patterns the agents repeatedly invented from scratch into one structured JSON call each. Sketches in notes/biomystery-mcp-tools-design.md:

omnii_decompose_samples(fasta_dir) → {
    macro_clusters: [...],          # NLL fingerprint clustering
    intra_substructure: {...},      # within-cluster pair-vs-triplet
    inferred_factor_structure: 'M tissues × N reps × K species'
}

omnii_qc_swap_detect(fastq_dir, label_csv) → {
    cosine_matrix: [...],           # NxN sample fingerprints
    per_sample_diagnostic: [...],   # within vs between-cluster cosine + flag
    flagged_swaps: [(sample_A, sample_B), ...]
}

Plus the already-implemented solver_split_by_nll (src/triage/solver_split_by_nll.py), which is the right shape to lift into MCP as omnii_split_by_nll(fastq) → {low_nll_fasta, high_nll_fasta, gap}.

Caveats

Reproducing this work

  1. Clone with submodules: bash git clone <this repo> cd omnii-biomysterybench git submodule update --init --recursive
  2. cp .env.example .env; fill in HF_TOKEN (and ANTHROPIC_API_KEY only if using the SDK runner path).
  3. ./scripts/download_full_bench.sh — pulls the 99 task zips into data/biomystery-full/.
  4. sbatch scripts/serve_omnii_vllm.sh + sbatch scripts/serve_omnii_embed.sh — bring up vLLM (requires GPUs).
  5. Edit OMNII_*_URL in .env to point to the actual node hostnames once Slurm assigns them.
  6. sbatch scripts/run_one.sh <task_id> <A|B|C> <trial> for each (task, arm, trial) you want.
  7. python3 src/runner/runs_index.py — see status table.

For the corrected Slurm/Docker path, rebuild biomystery-runner:latest after editing src/runner/biomystery_runner.py, then run scripts/run_one.sh or the targeted array wrapper. For the Claude-Code subagent path, per-trial prompts come from src/runner/build_prompt.py; spawn each via Claude Code's Agent tool with the prompt as the prompt parameter.

Pointers to specific evidence

What's left to do

If continuing this line of work:

  1. Full bench coverage for tighter blog repro CIs — ~40 untouched tasks remain. Most are tabular and will fail uniformly; ~8 are DNA-bearing and could surface more Omnii-relevant signal.
  2. Multi-trial reliability — 5 trials per (task, arm) to match the blog methodology and quantify the bimodality.
  3. Build the two MCP toolsomnii_decompose_samples, omnii_qc_swap_detect — and re-run hb024 / reccwgc4buredxvyz with the agent calling these directly. Predict ~50% reduction in tool calls per trial and possibly more reliable structural answers.
  4. LLM judge subagent — replace the heuristic judge.py with a Claude judge that handles rubric format flexibility consistently. Removes the manual-override ambiguity from the headline numbers.
  5. Pursue the three confirmed patterns more aggressively: spike-in detection, swap detection, multi-factor sample decomposition. These are the niches where Omnii is genuinely the right tool — characterizing the precise lift on each pattern in isolation would make a clean per-pattern claim possible.