Omnii embedding case studies on BioMysteryBench

Abstract

Two case studies in which Omnii (radicalnumerics/omnii-dna-7b-base, a 7B DNA language model) supplied the decisive signal for BioMysteryBench tasks where standard alignment-based tools either failed (case 1) or required orders of magnitude more compute (case 2). In both cases the substantive answer is correct; the strict-rubric grade is FAIL for reasons orthogonal to model capability — case 1 requires species naming that is information-theoretically absent from the input, case 2 requires the agent to additionally recite per-sample cell-type assignments.

Setup

Bench. BioMysteryBench (Anthropic, 2025).
Model access. vLLM with two endpoints — /v1/completions (prompt_logprobs=1, max_tokens=1) for per-position negative log-likelihood (NLL); /v1/embeddings for 4096-d pooled representations.
Arms. A = baseline (no Omnii), B = Omnii endpoints + brief usage hint, C = Omnii + suggested NLL → embed → cluster workflow. Up to 3 trials per cell.
Logging. Every tool action is appended to proof_of_work.md by the subagent; Omnii calls are counted by grep for :8800 / :8801 / omnii.

Case 1 — `hb024`: multi-host microbiome decomposition

Task

18 FASTA files, ~250 16S V3-V5 amplicon reads each. Question: how many tissue groups, how many host species. Truth: 18 = 2 tissues × 3 species × 3 reps (snake oral and gut microbiomes from Boiga, Trimeresurus, Laticauda). human_solvable=no.

Methods

Three trials in arm B, each with a different decomposition strategy:

trial	strategy
0	Mean-pooled embeddings, all-pairs cosine
1	Per-sample NLL fingerprint `(mean, std, p10, p50, p90, p99)` → ward k=3
2	NLL macro-cluster → embedding cosine + 8-mer Bray-Curtis substructure

Trial 2 cost: 218 NLL completion calls + 649 embedding calls = 867 Omnii calls.

Results

arm/trial	answer	tissues	species	rubric
A	4 / 2	✗	✗	FAIL
B trial 0	4 / 2	✗	✗	FAIL
B trial 1	3 / 3	✗	✓	FAIL
B trial 2	2 / 3	✓	✓	FAIL†

† Strict rubric demands the three snake genera by name.

NLL-fingerprint macro-clusters (ward, k=3) cleanly partition the 18 samples into three groups of six, separable by mean NLL alone:

macro	members	mean NLL
A	{1, 8, 9, 10, 11, 12}	~0.30
B	{2, 3, 5, 6, 16, 18}	~0.20
C	{4, 7, 13, 14, 15, 17}	~0.12

Within-macro substructure (case C, decisive):

partition	within Bray-Curtis	between Bray-Curtis	gap
best pair (3+3)	—	—	0.064
triplet (2+2+2)	0.233	0.355	0.122

Triplet partition wins 2 of 3 macros and is unambiguous in the cleanest macro ⇒ 2 tissues × 3 reps × 3 species.

Interpretation

Mean-pooled embeddings (trial 0) are uninformative here: pairwise cosines saturate above 0.95 because every sample is bacterial 16S amplicons. The NLL distribution preserves species-level differences in read "unusualness" that the pooled-embedding average washes out. Standard alignment tools (BLAST, kraken2) report "16S V3-V5 amplicon" for every sample without further differentiation, which is precisely arm A's failure mode (it guessed mammal + lizard from gut/oral microbe overlap).

Recovering species names from bacterial 16S alone is information-theoretic, not a model limitation: it requires either a snake-microbiome reference DB or host-DNA contamination, neither present in the input.

Case 2 — `reccwgc4buredxvyz`: sample-swap QC

Task

18 paired-end WGS samples (~26 GB). sample_info.csv labels each as cell_type_1 or cell_type_2; two are switched. Identify them. Truth: sample3 ↔ sample10. human_solvable=yes.

Methods

Single trial, arm B, 13 tool calls, 1m 38s wall-clock:

zcat | head — subsample 64 R1 reads per sample (1152 reads total).
36 batched /v1/embeddings calls → 18 sample-level mean-pooled fingerprints (4096-d).
L2-normalise; compute the 18×18 cosine matrix.
For each sample, compare mean within-label cosine vs mean between-label cosine; flag samples where between > within.

Results

16 of 18 samples have within > between. Two outliers, both with all top-5 nearest neighbours in the opposite group:

sample	label given	mean cos to own label	mean cos to other	top-5 NN labels
sample3	cell_type_1	0.9954	0.9977	all cell_type_2 (sim ≥ 0.998)
sample10	cell_type_2	0.9955	0.9980	all cell_type_1 (sim ≥ 0.998)

Verdict: sample3 and sample10 have swapped labels. Strict rubric requires the agent additionally to restate each affected sample's true cell type — the agent identified the swap but did not include this restatement, hence rubric FAIL despite correct identification.

Interpretation

The pattern is a textbook embedding-similarity QC: no reference genome, no alignment, no domain heuristic — the embedding cosine of <100 raw reads per sample is sufficient. Total Omnii compute is ~36 GPU-seconds for embeddings; the natural alignment-based pipeline (BWA + variant-call + PCA per sample) would take hours on the same hardware.

Case 3 — `recoyp6qrymldcjle`: symbiont detection in mixed-host FASTQ

Task

One FASTQ from a Drosophila melanogaster sample contaminated with a microbial symbiont. Identify the bacterial genus of the symbiont. Truth: Wolbachia. human_solvable=yes. New in the Opus 4.8 full-99 sweep (runs_full99_20260606_opus48_r1) — not in mpoli's original two-flips set.

Methods

Single trial, arm B, ~11 tool calls before answer:

Turn 3: one python3 ./_omnii_tools/omnii_split_by_nll.py --input <fastq> --max-reads 64 call. Result: 57/64 reads classified high-NLL, gap of 0.13 nats between low- and high-NLL bins — immediately establishes that the host signature is not human (the Omnii-DNA reference distribution).
Turn 5–9: web-BLAST via curl --data-urlencode CMD=Put to the NCBI URL API, using a low-NLL read as query → returns matches to D. melanogaster, identifying the host. Submitted Wolbachia as a host-anchored prior at turn 11 (Wolbachia is the canonical D. melanogaster endosymbiont) and then verified with two further BLAST hits at 100% / 150 bp to wMel (NC_002978.6).
The same query also returned the universal 16S V3 primer CTCCTACGGGAGGCAGCAGT — its BLAST hits resolve to Wolbachia endosymbiont (group A), confirming the answer at strain-level.

Wall-clock: ~3 minutes for the full B trial. Arms A and C both failed:

Arm A (no Omnii): ran a megahit assembly + a coverage scan, found Wolbachia at 1459 reads/Mb vs Acetobacter persici at 177 reads/Mb (an 8× advantage for Wolbachia), then dismissed Wolbachia at turn 49 because its reads clustered on 2 of 156 scaffolds — misreading a fragmented assembly as a cross-mapping artifact. Submitted "Acetobacter."
Arm C (Omnii + workflow): Omnii calls returned cleanly at turn 3 and turn 6, but the agent chose blastn -remote -db nt for bacterial ID, which hung inside the sandbox. Spent turns 11–17 polling a dead BLAST job and never submitted (<no answer>).

Interpretation

This is the cleanest host/non-host NLL split in the full-99 sweep. Two mechanisms compose:

omnii_split_by_nll cleanly partitions reads by training-distribution distance. 57/64 high-NLL at gap 0.13 nats is a strong off-distribution signal; without it, arm A burned >40 turns on an assembly that buried the genuine Wolbachia signal under coverage-spread artifacts.
Once the high-NLL bin is the working set, conventional BLAST gives a genus-level answer cheaply — the NLL split is the triage, BLAST is the identification. The combination is faster and more robust than either alone.

Note the cautionary contrast with arm C: the Omnii tools worked, but the agent's choice of blastn -remote (instead of arm B's curl to the NCBI URL API) blocked progress. The lesson is operational, not capability: the Omnii primitive does its job; downstream tooling choice still matters.

This case mirrors mpoli's original Flip 2 (hb019, EBOV virus ID via NLL-host-split) but is on a symbiont rather than a viral pathogen — and it isn't blocked by AUP, so future runs can use it as a clean demonstration of the host/non-host workflow without the pathogen-ID sensitivity.

Discussion

Three patterns generalize beyond these specific tasks:

Per-sample NLL distributional fingerprint (case 1). When mean-pooled embeddings saturate due to homogeneous library type, the distribution of per-position NLL across a small read sample (≈10–20 reads suffices) still discriminates source. Cheap and orthogonal to alignment.
Subsampled-reads embedding cosine (case 2). Sample-level embedding fingerprints from <100 reads detect label swaps that would otherwise require a full alignment + variant-call pipeline.
Host/non-host NLL split as triage + standard BLAST as identification (case 3). A single omnii_split call with max_reads=64 is enough to establish host identity; the high-NLL bin then feeds a conventional BLAST/curl-to-NCBI pipeline for genus-level naming. Arm B in recoyp6qrymldcjle solved this in ~3 minutes while arm A's assembly-based approach got lost in coverage artifacts after 40+ turns.

Both case-2-style tasks scored FAIL only because the rubric demands restating information already in the input (cell-type assignment) rather than because the model missed signal. Case 1's residual gap (species naming) is genuinely absent from the input modality. Neither failure indicates a limit of Omnii on the discriminative task itself.

A natural next step is to wrap each pattern in a deterministic JSON-returning MCP tool (omnii_decompose_samples, omnii_qc_swap_detect), so future agents call the validated pipeline once instead of re-deriving it; sketches in notes/biomystery-mcp-tools-design.md.

Reproducibility

Curated transcripts (final response, proof of work, grade):

Case 1: results/transcripts/hb024/B/trial2/
Case 2: results/transcripts/reccwgc4buredxvyz/B/trial0/
Case 3: results/runs_full99_20260606_opus48_r1/recoyp6qrymldcjle/B/trial0/ (forensic analysis at notes/analysis/recoyp6qrymldcjle.md)

Working scratch (more verbose, contains the full pipeline scripts the agent wrote):

notes/omnii-win-hb024.md
notes/biomystery-omnii-embedding-runs.md

Omnii embedding case studies on BioMysteryBench

Abstract

Setup

Case 1 — hb024: multi-host microbiome decomposition

Task

Methods

Results

Interpretation

Case 2 — reccwgc4buredxvyz: sample-swap QC

Task

Methods

Results

Interpretation

Case 3 — recoyp6qrymldcjle: symbiont detection in mixed-host FASTQ

Task

Methods

Interpretation

Discussion

Reproducibility

Case 1 — `hb024`: multi-host microbiome decomposition

Case 2 — `reccwgc4buredxvyz`: sample-swap QC

Case 3 — `recoyp6qrymldcjle`: symbiont detection in mixed-host FASTQ