Omnii embedding case studies on BioMysteryBench
Abstract
Two case studies in which Omnii (radicalnumerics/omnii-dna-7b-base, a 7B DNA
language model) supplied the decisive signal for BioMysteryBench tasks where
standard alignment-based tools either failed (case 1) or required orders of
magnitude more compute (case 2). In both cases the substantive answer is
correct; the strict-rubric grade is FAIL for reasons orthogonal to model
capability — case 1 requires species naming that is information-theoretically
absent from the input, case 2 requires the agent to additionally recite
per-sample cell-type assignments.
Setup
- Bench. BioMysteryBench (Anthropic, 2025).
- Model access. vLLM with two endpoints —
/v1/completions(prompt_logprobs=1,max_tokens=1) for per-position negative log-likelihood (NLL);/v1/embeddingsfor 4096-d pooled representations. - Arms. A = baseline (no Omnii), B = Omnii endpoints + brief usage hint, C = Omnii + suggested NLL → embed → cluster workflow. Up to 3 trials per cell.
- Logging. Every tool action is appended to
proof_of_work.mdby the subagent; Omnii calls are counted by grep for:8800/:8801/omnii.
Case 1 — hb024: multi-host microbiome decomposition
Task
18 FASTA files, ~250 16S V3-V5 amplicon reads each. Question: how many tissue
groups, how many host species. Truth: 18 = 2 tissues × 3 species × 3 reps
(snake oral and gut microbiomes from Boiga, Trimeresurus, Laticauda).
human_solvable=no.
Methods
Three trials in arm B, each with a different decomposition strategy:
| trial | strategy |
|---|---|
| 0 | Mean-pooled embeddings, all-pairs cosine |
| 1 | Per-sample NLL fingerprint (mean, std, p10, p50, p90, p99) → ward k=3 |
| 2 | NLL macro-cluster → embedding cosine + 8-mer Bray-Curtis substructure |
Trial 2 cost: 218 NLL completion calls + 649 embedding calls = 867 Omnii calls.
Results
| arm/trial | answer | tissues | species | rubric |
|---|---|---|---|---|
| A | 4 / 2 | ✗ | ✗ | FAIL |
| B trial 0 | 4 / 2 | ✗ | ✗ | FAIL |
| B trial 1 | 3 / 3 | ✗ | ✓ | FAIL |
| B trial 2 | 2 / 3 | ✓ | ✓ | FAIL† |
† Strict rubric demands the three snake genera by name.
NLL-fingerprint macro-clusters (ward, k=3) cleanly partition the 18 samples into three groups of six, separable by mean NLL alone:
| macro | members | mean NLL |
|---|---|---|
| A | {1, 8, 9, 10, 11, 12} | ~0.30 |
| B | {2, 3, 5, 6, 16, 18} | ~0.20 |
| C | {4, 7, 13, 14, 15, 17} | ~0.12 |
Within-macro substructure (case C, decisive):
| partition | within Bray-Curtis | between Bray-Curtis | gap |
|---|---|---|---|
| best pair (3+3) | — | — | 0.064 |
| triplet (2+2+2) | 0.233 | 0.355 | 0.122 |
Triplet partition wins 2 of 3 macros and is unambiguous in the cleanest macro ⇒ 2 tissues × 3 reps × 3 species.
Interpretation
Mean-pooled embeddings (trial 0) are uninformative here: pairwise cosines saturate above 0.95 because every sample is bacterial 16S amplicons. The NLL distribution preserves species-level differences in read "unusualness" that the pooled-embedding average washes out. Standard alignment tools (BLAST, kraken2) report "16S V3-V5 amplicon" for every sample without further differentiation, which is precisely arm A's failure mode (it guessed mammal + lizard from gut/oral microbe overlap).
Recovering species names from bacterial 16S alone is information-theoretic, not a model limitation: it requires either a snake-microbiome reference DB or host-DNA contamination, neither present in the input.
Case 2 — reccwgc4buredxvyz: sample-swap QC
Task
18 paired-end WGS samples (~26 GB). sample_info.csv labels each as
cell_type_1 or cell_type_2; two are switched. Identify them.
Truth: sample3 ↔ sample10. human_solvable=yes.
Methods
Single trial, arm B, 13 tool calls, 1m 38s wall-clock:
zcat | head— subsample 64 R1 reads per sample (1152 reads total).- 36 batched
/v1/embeddingscalls → 18 sample-level mean-pooled fingerprints (4096-d). - L2-normalise; compute the 18×18 cosine matrix.
- For each sample, compare mean within-label cosine vs mean between-label cosine; flag samples where between > within.
Results
16 of 18 samples have within > between. Two outliers, both with all top-5 nearest neighbours in the opposite group:
| sample | label given | mean cos to own label | mean cos to other | top-5 NN labels |
|---|---|---|---|---|
| sample3 | cell_type_1 | 0.9954 | 0.9977 | all cell_type_2 (sim ≥ 0.998) |
| sample10 | cell_type_2 | 0.9955 | 0.9980 | all cell_type_1 (sim ≥ 0.998) |
Verdict: sample3 and sample10 have swapped labels. Strict rubric requires the agent additionally to restate each affected sample's true cell type — the agent identified the swap but did not include this restatement, hence rubric FAIL despite correct identification.
Interpretation
The pattern is a textbook embedding-similarity QC: no reference genome, no alignment, no domain heuristic — the embedding cosine of <100 raw reads per sample is sufficient. Total Omnii compute is ~36 GPU-seconds for embeddings; the natural alignment-based pipeline (BWA + variant-call + PCA per sample) would take hours on the same hardware.
Case 3 — recoyp6qrymldcjle: symbiont detection in mixed-host FASTQ
Task
One FASTQ from a Drosophila melanogaster sample contaminated with a
microbial symbiont. Identify the bacterial genus of the symbiont.
Truth: Wolbachia. human_solvable=yes. New in the Opus 4.8 full-99 sweep
(runs_full99_20260606_opus48_r1) — not in mpoli's original two-flips set.
Methods
Single trial, arm B, ~11 tool calls before answer:
- Turn 3: one
python3 ./_omnii_tools/omnii_split_by_nll.py --input <fastq> --max-reads 64call. Result: 57/64 reads classified high-NLL, gap of 0.13 nats between low- and high-NLL bins — immediately establishes that the host signature is not human (the Omnii-DNA reference distribution). - Turn 5–9: web-BLAST via
curl --data-urlencode CMD=Putto the NCBI URL API, using a low-NLL read as query → returns matches to D. melanogaster, identifying the host. SubmittedWolbachiaas a host-anchored prior at turn 11 (Wolbachia is the canonical D. melanogaster endosymbiont) and then verified with two further BLAST hits at 100% / 150 bp to wMel (NC_002978.6). - The same query also returned the universal 16S V3 primer
CTCCTACGGGAGGCAGCAGT— its BLAST hits resolve to Wolbachia endosymbiont (group A), confirming the answer at strain-level.
Wall-clock: ~3 minutes for the full B trial. Arms A and C both failed:
- Arm A (no Omnii): ran a megahit assembly + a coverage scan, found Wolbachia at 1459 reads/Mb vs Acetobacter persici at 177 reads/Mb (an 8× advantage for Wolbachia), then dismissed Wolbachia at turn 49 because its reads clustered on 2 of 156 scaffolds — misreading a fragmented assembly as a cross-mapping artifact. Submitted "Acetobacter."
- Arm C (Omnii + workflow): Omnii calls returned cleanly at turn 3 and
turn 6, but the agent chose
blastn -remote -db ntfor bacterial ID, which hung inside the sandbox. Spent turns 11–17 polling a dead BLAST job and never submitted (<no answer>).
Interpretation
This is the cleanest host/non-host NLL split in the full-99 sweep. Two mechanisms compose:
omnii_split_by_nllcleanly partitions reads by training-distribution distance. 57/64 high-NLL at gap 0.13 nats is a strong off-distribution signal; without it, arm A burned >40 turns on an assembly that buried the genuine Wolbachia signal under coverage-spread artifacts.- Once the high-NLL bin is the working set, conventional BLAST gives a genus-level answer cheaply — the NLL split is the triage, BLAST is the identification. The combination is faster and more robust than either alone.
Note the cautionary contrast with arm C: the Omnii tools worked, but the
agent's choice of blastn -remote (instead of arm B's curl to the
NCBI URL API) blocked progress. The lesson is operational, not capability:
the Omnii primitive does its job; downstream tooling choice still matters.
This case mirrors mpoli's original Flip 2 (hb019, EBOV virus ID via
NLL-host-split) but is on a symbiont rather than a viral pathogen — and
it isn't blocked by AUP, so future runs can use it as a clean
demonstration of the host/non-host workflow without the pathogen-ID
sensitivity.
Discussion
Three patterns generalize beyond these specific tasks:
- Per-sample NLL distributional fingerprint (case 1). When mean-pooled embeddings saturate due to homogeneous library type, the distribution of per-position NLL across a small read sample (≈10–20 reads suffices) still discriminates source. Cheap and orthogonal to alignment.
- Subsampled-reads embedding cosine (case 2). Sample-level embedding fingerprints from <100 reads detect label swaps that would otherwise require a full alignment + variant-call pipeline.
- Host/non-host NLL split as triage + standard BLAST as identification
(case 3). A single
omnii_splitcall withmax_reads=64is enough to establish host identity; the high-NLL bin then feeds a conventional BLAST/curl-to-NCBI pipeline for genus-level naming. Arm B inrecoyp6qrymldcjlesolved this in ~3 minutes while arm A's assembly-based approach got lost in coverage artifacts after 40+ turns.
Both case-2-style tasks scored FAIL only because the rubric demands restating information already in the input (cell-type assignment) rather than because the model missed signal. Case 1's residual gap (species naming) is genuinely absent from the input modality. Neither failure indicates a limit of Omnii on the discriminative task itself.
A natural next step is to wrap each pattern in a deterministic
JSON-returning MCP tool (omnii_decompose_samples, omnii_qc_swap_detect),
so future agents call the validated pipeline once instead of re-deriving it;
sketches in notes/biomystery-mcp-tools-design.md.
Reproducibility
Curated transcripts (final response, proof of work, grade):
- Case 1:
results/transcripts/hb024/B/trial2/ - Case 2:
results/transcripts/reccwgc4buredxvyz/B/trial0/ - Case 3:
results/runs_full99_20260606_opus48_r1/recoyp6qrymldcjle/B/trial0/(forensic analysis atnotes/analysis/recoyp6qrymldcjle.md)
Working scratch (more verbose, contains the full pipeline scripts the agent wrote):
notes/omnii-win-hb024.mdnotes/biomystery-omnii-embedding-runs.md