Project summary — omnii-biomysterybench

End-to-end report on whether giving Claude access to Omnii (a 7B DNA language model from Radical Numerics) as an external tool lifts its scores on Anthropic's BioMysteryBench.

Spun out from radical-megatron/experiments/biomystery/. Sister repos (submoduled): thirdparty/radical-vllm, thirdparty/radical-mcp.

Headline findings

Omnii vLLM stack works. Both modes hosted as persistent Slurm jobs:
gen at port 8800 — radicalnumerics/omnii-dna-7b-base (cosine-iter048000-hf), per-position NLL via OpenAI-compatible prompt_logprobs
embed at port 8801 — same model (pooling-v0 revision), 4096-d pooled embeddings via /v1/embeddings
Deployment workarounds are documented in scripts/serve_omnii_*.sh: non-root container, /etc/passwd mount, cv2_stub, exec-mode tmpfs, host-mounted radical/{renderers,tokenizers}, and charlevel-renderer plugin registration.
Blog ballpark reproduced on a 31-task subsample (32% of the 99-task bench):

subset	sample-pass-rate	blog (Opus 4.6)
solvable=YES (n=20 of 76)	16/20 = 80%	~77%
solvable=NO (n=11 of 23)	3/11 = 27%	~23-30%

Caveats: not the full bench, single trial each, sampled smaller-data tasks first, three rubric-format-mismatch judge overrides. See Blog reproduction below.

Initial Omnii investigation found real signal but not a clean broad pass-rate lift. The offline triage and Claude-Code case studies identified task families where Omnii-DNA is useful:
off-distribution read discovery via NLL,
sample-similarity / swap detection via embeddings,
multi-sample source-structure decomposition via NLL fingerprints and embeddings.

The strongest mechanistic case studies remain hb024 and reccwgc4buredxvyz; see notes/omnii-embedding-case-studies.md, notes/omnii-win-hb024.md, and notes/biomystery-omnii-embedding-runs.md.

Purpose-built Claude-Code tool experiments produced two A-fail flips, but those are not the same as a corrected aggregate Claude sweep. Earlier experiments using task-shaped Omnii commands found:

task	answer	tool	what Omnii contributed
reccwgc4buredxvyz	swapped sample3/sample10 with corrected cell types	`omnii_qc_swap_detect.py`	18x18 sample-similarity cosine matrix; KNN-majority-vote inferred true label per flagged sample.
hb019	`EBOV`	`omnii_classify_against_panel.py`	focused panel classification surfaced filovirus signal and bypassed a baseline refusal path.

These are documented in notes/two-flips.md. They show that Omnii can help Claude when packaged as a task-shaped tool, but they should not be reported as the final aggregate lift for the corrected harness.

Corrected Claude targeted sweep: Arm C improves strict aggregate by two tasks. After fixing the in-container runner so Omnii agents actually see the toolbox, we completed the top-18 targeted A/B/C run with Claude Opus:

arm	strict pass
A baseline	8/18
C Omnii toolbox + workflow guidance	10/18

C flipped recmiryoehog9bvce and recnquldskiadnpq8 without regressions. Those Claude flips were conventional-analysis trajectory wins, not clearly Omnii-output-decisive. The earlier corrected DeepSeek fallback sweep was flat at A/B/C 5/18, but its hb010 flip is the clearest additional corrected-run case with direct Omnii-DNA NLL involvement; see notes/hb010-omnii-nll-flip.md.

Current best answer to the research question. Omnii/toolbox guidance can improve Claude's trajectory on this targeted subset, especially in Arm C, but the corrected aggregate lift is not yet cleanly Omnii-decisive: the two Claude flips were solved through conventional fusion/expression analysis after the prompt/tooling changed the agent path. The strongest evidence for genuinely Omnii-shaped help remains the earlier purpose-built-tool Claude-Code experiments plus the corrected DeepSeek hb010 NLL-split case. See notes/updated-findings-2026-05-24.md.

What we built

src/
├── omnii_client.py            # HTTP wrapper over both vLLM endpoints
├── runner/                    # Per-(task, arm, trial) harness
│   ├── biomystery_runner.py        # In-container agent loop (Anthropic or DeepSeek)
│   ├── render_task_md.py           # TASK.md (no rubric) renderer
│   ├── judge.py                    # heuristic grader against rubric
│   ├── runs_index.py               # status table over results/runs/
│   ├── build_prompt.py             # subagent prompt builder
│   └── run_prompt_template.md
├── triage/                    # pre-experiment Omnii signal characterization
│   ├── triage_full_bench.py        # per-task NLL stats over all bench tasks
│   ├── solver_split_by_nll.py      # bimodal-NLL pre-filter for FASTQ
│   └── explore_hb053{,_vllm}.py    # one-off task explorations
└── cv2_stub/cv2/__init__.py   # libxcb workaround

scripts/
├── serve_omnii_vllm.sh        # Slurm: persistent gen vLLM (port 8800)
├── serve_omnii_embed.sh       # Slurm: persistent embed vLLM (port 8801)
├── run_one.sh                 # Slurm + docker: one (task, arm, trial)
└── download_full_bench.sh     # HF download of all 99 task zips

docker/
├── Dockerfile.biomystery-runner   # bioconda + Anthropic/OpenAI SDK image
└── docker-compose.biomystery.yml  # local-dev wrapper

Two execution paths supported: - Slurm/Docker in-container runner (docker/Dockerfile.biomystery-runner + src/runner/biomystery_runner.py): containerized tool loop. Supports Anthropic or DeepSeek via MODEL_PROVIDER; this is the corrected path for targeted A/B/C sweeps. - Claude-Code subagents: src/runner/sweep.py prepares work dirs and prompts for Claude Code's Agent tool. This was used for the initial interactive investigation and purpose-built tool experiments.

Methodology

Arms

A (baseline): standard Linux + Bash + pip/conda + curl to allowed domains. No Omnii access.
B (Omnii hint): same + URLs and usage patterns for both Omnii endpoints in the system prompt. Agent decides when/whether to call.
C (Omnii toolbox + workflow guidance): same as B, with stronger guidance to use NLL split -> inspect bins -> embed/cluster when applicable.
BNB (historical control): Omnii-only / Omnii-forced arm used in earlier experiments to test whether Omnii alone can solve a task we know A solves.

Per-trial harness

Each trial runs in its own work dir. In the corrected Slurm/Docker path, run_one.sh stages the task data, writes TASK.md, stages _omnii_tools/ for B/C, and launches biomystery_runner.py inside the Docker image. In the Claude-Code path, sweep.py prep creates the same work dir and emits a prompt for a Claude Code subagent.

The agent receives: - Pointer to work/TASK.md (question only, never the rubric) - Strict isolation rules (don't read outside work dir) - Arm-specific capabilities block or runner system prompt - Required short answer in answer.txt

Grading is a separate step — src/runner/judge.py runs heuristic exact/substring/list match over the rubric. Where the rubric format breaks the heuristic, manual overrides are noted.

Vllm hosting

Two persistent Slurm jobs, each docker run-ing the radical-vllm image with a chain of bind-mounts: - --user $(id -u):$(id -g) (no root GPU processes) - /etc/passwd + /etc/group mounted (torch needs getpwuid() to resolve) - cv2_stub on PYTHONPATH (cv2 is a transitive dep that needs a libxcb chain we can't install as non-root) - --tmpfs /tmp/biomystery_cache:size=8g,mode=0700,uid=$UID,gid=$GID,exec (writable + executable cache for triton) - radical/{renderers, tokenizers} from host (image's baked-in copies predate the charlevel renderer) - A monkey-patched vllm.plugins.load_general_plugins that registers the charlevel renderer at the right point in the import order to avoid the vllm.renderers.__init__ circular import.

Blog reproduction details

Sample composition:

Solvable=YES (20 of 76):
  hb002 hb009 hb013 hb016 hb017 hb020 hb023 hb026 hb028 hb029
  hb031 hb033 hb040 hb041 hb049 hb050 rec5qx7nedrwk4zog
  recmiryoehog9bvce recqgsfxqqodhjens recyomvehwpj8s6t1

Solvable=NO  (11 of 23):
  hb010 hb014 hb022 hb024 hb025 hb035 hb036 hb053
  reccslfjnjcfdpgak recmp75e1chtpzx3c recnquldskiadnpq8

Per-task evidence (best arm-A trial):

task	solv	answer	judge	manual
hb002	yes	Bacillus licheniformis	PASS
hb009	yes	Tuberous Sclerosis (truth: Fragile X)	FAIL
hb013	yes	SLC6A1	PASS
hb016	yes	wrong stage assignments	FAIL
hb017	yes	sample list	PASS
hb020	yes	Homo sapiens	PASS
hb023	yes	Seawater (#12): 1-12 / Sediment ...	PASS*	format diff (`#` vs no-`#`)
hb026	yes	heart	PASS
hb028	yes	Barth syndrome (truth: T2D)	FAIL
hb029	yes	sample1,2,5,6 (= SD samples per truth)	PASS*	format mismatch
hb031	yes	Norovirus GII (truth: Norovirus or Norovirus GII.4)	PASS*	period in truth breaks regex
hb033	yes	Sample_03	PASS
hb040	yes	SARS-CoV-2	PASS
hb041	yes	tomato	PASS
hb049	yes	marine	PASS
hb050	yes	Microplastics... (truth: aquaculture pollutants/N)	FAIL
rec5qx7nedrwk4zog	yes	LINE-1	PASS
recmiryoehog9bvce	yes	APP-FABP5P7	PASS
recqgsfxqqodhjens	yes	CTCF	PASS
recyomvehwpj8s6t1	yes	J	PASS
hb010	no	16S, 18S, ITS	PASS
hb014	no	H5b (truth: L0)	FAIL
hb022	no	wrong samples	FAIL
hb024	no	4 tissues / 2 species (truth: 2/3)	FAIL
hb025	no	None	FAIL
hb035	no	hsa-let-7f-5p (truth: -7b-5p)	FAIL
hb036	no	sample_2	PASS
hb053	no	heat stress	PASS
reccslfjnjcfdpgak	no	12 (truth: 10)	FAIL
recmp75e1chtpzx3c	no	Bacillus subtilis (truth: Phocaeicola vulgatus)	FAIL
recnquldskiadnpq8	no	wrong gene IDs	FAIL

* judge bug accepted by manual review (3 cases, all unambiguously correct on inspection)

Strict-judge totals: solvable=YES 13/20 = 65%; solvable=NO 2/11 = 18%. With manual overrides: solvable=YES 16/20 = 80%; solvable=NO 3/11 = 27%.

Both pairs are within the blog's published range. The honest interpretation is that our harness reproduces the blog's order-of-magnitude correctly, but a tighter claim (sub-5pp confidence) needs full coverage + multi-trial.

Omnii usage analysis

The following A-vs-B table is from the initial Claude-Code investigation, before the corrected Slurm/Docker targeted sweep. Treat it as mechanistic evidence about where Omnii can help, not as the current aggregate result. The corrected top-18 result is summarized in the headline and detailed in notes/updated-findings-2026-05-24.md.

Initial A vs B head-to-heads (7 paired tasks)

task	A	B	Omnii substantive in B?	clean Omnii flip?
hb010	PASS	PASS	no — agent ignored hint, used BLAST + primer matching	no
hb014	FAIL (H5b)	FAIL (H5a1)	yes (6 calls)	no
hb024	FAIL (4/2)	FAIL strict / structurally correct (2/3) trial 2	yes (867 calls)	partial — Omnii recovered count structure A guessed wrong
hb036	PASS	PASS	yes (96 NLL + 635 emb)	no — A also passed via different path (E.coli proxy)
reccslfjnjcfdpgak	FAIL (12)	PASS (10)	minimal (4 sanity calls)	no — B used bwa+bcftools
recmp75e1chtpzx3c	FAIL	FAIL × 2	yes (~5 calls)	no
recnquldskiadnpq8	FAIL	FAIL	minimal (2 calls)	no

Three task patterns where Omnii adds genuine signal

Distilled from the offline triage (results/outputs/triage_v3.json) + the in-harness trials:

Spike-in / contamination detection — NLL bimodality. The high-NLL tail is the off-distribution minority. Validated:
hb036 (Agrobacterium-fabrum-contaminated sample): arm B used NLL bimodality to flag sample_2 directly.
hb019 (offline triage): 64 reads of "human patient + virus", split by NLL into 19 host (mean 0.32) + 45 viral (mean 1.24), gap 0.23 nats.
recmp75e1chtpzx3c (offline solver run): 64 reads, NLL split 60 host + 4 candidate-bacterial; the 4 high-NLL reads share the 16S 515F primer GTGCCAGCAGCCGCGGT.
Sample-swap / mislabel QC — embedding cosine matrix:
reccwgc4buredxvyz: 36 embedding calls + 18×18 cosine matrix in 1m 38s on 26 GB → identified sample3 and sample10 as the swapped pair (their top-5 nearest neighbours are systematically in the wrong cell-type cluster).
Multi-host microbiome factor decomposition — per-sample NLL fingerprint:
hb024: per-sample (mean, std, p10-p99) of per-position NLL across 12 reads/sample → 8-d feature vector → ward k=3 → 3 macro-clusters of 6 samples = 3 host species.
Within-macro substructure via embedding cosine + 8-mer Bray-Curtis: triplet-vs-pair test reveals 2 tissues × 3 reps.

Where Omnii doesn't help

Within-kingdom 16S decomposition (e.g. hb024 attempted with mean-pool embeddings alone): Omnii embeddings of bacterial 16S V3-V5 amplicons saturate (all-pair cosine > 0.95). The right tool is microbiome-specific (UniFrac, Bray-Curtis on OTU tables).
Single-sequence species ID against a known organism (most bench tasks): BLAST is mature, faster, more accurate. Don't compete with it; use Omnii to filter first then BLAST a small subset.
Tabular tasks (~50/99): expression matrices, methylation β-matrices, VCF, microarray, CIF, Excel. Omnii has nothing to offer here.

Why so few clean Omnii wins

The bench is designed for bioinformatics agents with NCBI/Ensembl access. When BLAST/BWA work, agents default to them. When BLAST fails, the question is usually structured so that something else (k-mer, abundance, prior knowledge) carries the answer too. Omnii's value is concentrated in the narrow band where: - the answer requires finding off-distribution reads or clustering unlabeled samples, and - standard alignment-based tools either fail or are too slow/expensive on the search space at hand.

MCP tool design recommendations

Two tools would compress the patterns the agents repeatedly invented from scratch into one structured JSON call each. Sketches in notes/biomystery-mcp-tools-design.md:

omnii_decompose_samples(fasta_dir) → {
    macro_clusters: [...],          # NLL fingerprint clustering
    intra_substructure: {...},      # within-cluster pair-vs-triplet
    inferred_factor_structure: 'M tissues × N reps × K species'
}

omnii_qc_swap_detect(fastq_dir, label_csv) → {
    cosine_matrix: [...],           # NxN sample fingerprints
    per_sample_diagnostic: [...],   # within vs between-cluster cosine + flag
    flagged_swaps: [(sample_A, sample_B), ...]
}

Plus the already-implemented solver_split_by_nll (src/triage/solver_split_by_nll.py), which is the right shape to lift into MCP as omnii_split_by_nll(fastq) → {low_nll_fasta, high_nll_fasta, gap}.

Caveats

Coverage = 31/99. Sample biased toward smaller-data tasks; an extension to full coverage may shift the rates.
Single-trial. Blog reports multi-trial (Opus 4.6 is bimodal: 86% of the problems it solves at all, it solves 4-of-5 times; on hard set only 44% reliability). We have N=1 per (task, arm).
3 manual judge overrides. All three are unambiguously correct on inspection (heuristic regex format mismatch). Without them: solvable=65%, hard=18%.
2 AUP refusals (hb019, recnheebqpbdp1nj9) on pathogen-ID phrasing — excluded rather than counted as fails. Counting them as fails: solvable becomes 16/22 = 73%.
Subagents share host process. Claude Code's Agent tool isn't a hardened sandbox; rubric isolation depends on the agent obeying "don't read outside work dir" — verified by inspection of proof_of_work.md for every transcript.

Reproducing this work

Clone with submodules: bash git clone <this repo> cd omnii-biomysterybench git submodule update --init --recursive
cp .env.example .env; fill in HF_TOKEN (and ANTHROPIC_API_KEY only if using the SDK runner path).
./scripts/download_full_bench.sh — pulls the 99 task zips into data/biomystery-full/.
sbatch scripts/serve_omnii_vllm.sh + sbatch scripts/serve_omnii_embed.sh — bring up vLLM (requires GPUs).
Edit OMNII_*_URL in .env to point to the actual node hostnames once Slurm assigns them.
sbatch scripts/run_one.sh <task_id> <A|B|C> <trial> for each (task, arm, trial) you want.
python3 src/runner/runs_index.py — see status table.

For the corrected Slurm/Docker path, rebuild biomystery-runner:latest after editing src/runner/biomystery_runner.py, then run scripts/run_one.sh or the targeted array wrapper. For the Claude-Code subagent path, per-trial prompts come from src/runner/build_prompt.py; spawn each via Claude Code's Agent tool with the prompt as the prompt parameter.

Pointers to specific evidence

Case-study transcripts (full reasoning + tool log + agent's analysis scripts): results/transcripts/hb024/B/trial2/, results/transcripts/reccwgc4buredxvyz/B/trial0/.
Per-task triage (NLL stats over 35 sequence-bearing tasks): results/outputs/triage_v3.json.
Omnii-fit ranking of all 36 tasks with parseable DNA: results/outputs/omnii_usefulness_ranking.tsv.
Blog repro tally: this document, table above.
Fix-by-fix vLLM hosting saga: scripts/serve_omnii_vllm.sh headers + notes/biomystery-omnii-embedding-runs.md.

What's left to do

If continuing this line of work:

Full bench coverage for tighter blog repro CIs — ~40 untouched tasks remain. Most are tabular and will fail uniformly; ~8 are DNA-bearing and could surface more Omnii-relevant signal.
Multi-trial reliability — 5 trials per (task, arm) to match the blog methodology and quantify the bimodality.
Build the two MCP tools — omnii_decompose_samples, omnii_qc_swap_detect — and re-run hb024 / reccwgc4buredxvyz with the agent calling these directly. Predict ~50% reduction in tool calls per trial and possibly more reliable structural answers.
LLM judge subagent — replace the heuristic judge.py with a Claude judge that handles rubric format flexibility consistently. Removes the manual-override ambiguity from the headline numbers.
Pursue the three confirmed patterns more aggressively: spike-in detection, swap detection, multi-factor sample decomposition. These are the niches where Omnii is genuinely the right tool — characterizing the precise lift on each pattern in isolation would make a clean per-pattern claim possible.