BioMysteryBench — agent prompts
The subagent for one (task, arm, trial) cell is run with the prompt below. Built by src/runner/build_prompt.py from a shared base template plus an arm-specific capability block.
Substitution at render time: {task_id} / {arm} / {trial} / {work_dir} are filled per cell.
Two prompt versions
There are two distinct prompt designs in the project history, attached to different published results:
| Version | Commit | Used by | What changed |
|---|---|---|---|
| v1 — raw endpoints (mpoli) | 42f6c51 (2026-05-05) |
mpoli's two-flips report (notes/two-flips.md) |
Hardcoded primeintellect-freyr3-gpu-002 URLs. Arms B/C describe the raw vLLM endpoints only — agent must roll its own curl/python. No Arm D. |
| v2 — staged tools (current) | bcaff0f (2026-05-05, later) and 6461503 (2026-05-24) |
corrected top-18 sweep (runs_targeted_omnii_20260525_claude_opus_r1) and full-99 Opus 4.8 sweep (runs_full99_20260606_opus48_r1) |
Env-parameterized {gen_url}/{embed_url}. Arms B/C/D point the agent at 7 pre-staged Python helpers in ./_omnii_tools/ (omnii_split_by_nll, omnii_per_position_anomaly, omnii_embed_records, omnii_cluster_samples, …). Adds Arm D (= B prompt, alt orchestrator). |
The base template is essentially identical across both versions — only the arm blocks differ. Cross-version arm comparisons should be made carefully.
Base template (src/runner/run_prompt_template.md)
You are running a single trial of the BioMysteryBench evaluation harness — a published Anthropic benchmark for testing bioinformatics agent capabilities. We are reproducing baseline scores; this is purely an evaluation reproducibility experiment.
**Working directory:** `{work_dir}`
**Task:** Read `TASK.md` in that directory for the question. Input data is in the same directory. Solve the task and write your final answer to `./answer.txt` (short exact string only — match the format the question asks for).
**Strict rules:**
1. Stay in the working directory. Do NOT inspect any other path under the bench root — `data/biomystery-full/problems.csv` contains the answer rubric and must not be read.
2. You may use Bash, install packages (pip/conda to /tmp), curl from allowed domains in TASK.md.
3. Time budget: ~25 min. Don't loop indefinitely.
4. Do NOT search the web for "BioMysteryBench" or task descriptions. Solve from the data + your knowledge.
{arm_capabilities_block}
**Proof-of-work logging — REQUIRED:**
As you work, append one line per tool action to `./proof_of_work.md` in this format:
[YYYY-MM-DDTHH:MM:SSZ]
You can either `echo` the line via Bash after each tool call, or batch-write at natural checkpoints. The file should let a reviewer reconstruct your exact path to the answer.
**Final response — REQUIRED format:**
End your run by writing to `./final_response.md` in the working directory:
- Section 1: **Final answer** (the exact string)
- Section 2: **Reasoning** (3–6 sentences: what you found and why it implies that answer)
- Section 3: **Tools used** (bullet list of the unique tools/commands invoked, with counts)
Then in your reply to me, output only:
ANSWER:
The reviewer will read `./final_response.md` and `./proof_of_work.md` for the rationale.
v1 — mpoli's arm blocks (commit 42f6c51)
v1 Arm A
**Arm A — baseline.** Standard tooling only.
v1 Arm B
**Arm B — Omnii hint available.** Two HTTP endpoints expose a 7B DNA language model:
- `http://primeintellect-freyr3-gpu-002:8800/v1/completions` — generative; OpenAI-compatible. For per-position negative log-likelihood, send `{"model": "radicalnumerics/omnii-dna-7b-base", "prompt": "<DNA seq>", "max_tokens": 1, "prompt_logprobs": 1, "temperature": 0}` and read `choices[0].prompt_logprobs`.
- `http://primeintellect-freyr3-gpu-002:8801/v1/embeddings` — 4096-d pooled DNA embeddings. Send `{"model": "radicalnumerics/omnii-dna-7b-base", "input": ["<seq1>", ...]}`.
Useful patterns: **mean per-position NLL** stratifies by source kingdom (low NLL ≈ human-like; high NLL ≈ off-distribution like bacteria/virus/plant); **pooled embeddings + k-means** can decompose a mixed read set into source groups. **You should genuinely try at least one Omnii call as part of solving this task** — its hits/misses are part of what we're measuring.
v1 Arm C
**Arm C — Omnii available + workflow guidance.** Two HTTP endpoints expose a 7B DNA language model:
- `http://primeintellect-freyr3-gpu-002:8800/v1/completions` (generative, prompt_logprobs)
- `http://primeintellect-freyr3-gpu-002:8801/v1/embeddings` (4096-d pooled)
Suggested workflow when applicable: (1) score reads' NLL → split into low/high bins → (2) embed the high-NLL bin → (3) cluster. The high-NLL set is your "interesting" reads (off-distribution = candidate non-host). Use the low-NLL bin as your host-control reference. **You should genuinely try Omnii as part of solving this task.**
v2 — current arm blocks
OMNII URLs shown are the current prod values (http://g273:8800, http://g273:8801).
v2 Arm A
**Arm A — baseline.** Standard tooling only.
v2 Arm B
**Arm B — Omnii hint available.** Two HTTP endpoints expose a 7B DNA language model:
- `http://g273:8800/v1/completions` — generative; OpenAI-compatible. For per-position negative log-likelihood, send `{"model": "radicalnumerics/omnii-dna-7b-base", "prompt": "<DNA seq>", "max_tokens": 1, "prompt_logprobs": 1, "temperature": 0}` and read `choices[0].prompt_logprobs`.
- `http://g273:8801/v1/embeddings` — 4096-d pooled DNA embeddings. Send `{"model": "radicalnumerics/omnii-dna-7b-base", "input": ["<seq1>", ...]}`.
Useful patterns: **mean per-position NLL** stratifies by source kingdom (low NLL ≈ human-like; high NLL ≈ off-distribution like bacteria/virus/plant); **pooled embeddings + k-means** can decompose a mixed read set into source groups. **You should genuinely try at least one Omnii call as part of solving this task** — its hits/misses are part of what we're measuring.
General Omnii primitives are staged in `./_omnii_tools/`. Each one returns structured data describing the sequence or model behavior — interpretation is your job. Compose them rather than expecting a single command to produce the rubric answer.
NLL primitives (model = surprise per base):
- Split reads into low/high NLL bins (host vs off-distribution):
`python3 ./_omnii_tools/omnii_split_by_nll.py --input <reads.fastq[.gz]> --out-dir _omnii_split --max-reads 64 --gen-url http://g273:8800`
- Find positions where one sequence is locally surprising (mutation/fusion/integration hotspots):
`python3 ./_omnii_tools/omnii_per_position_anomaly.py --input <query.fasta> --out _omnii_anomaly.json --gen-url http://g273:8800`
Embedding primitives (compose against an .npy):
- Embed records or per-sample read sets to an .npy + manifest:
`python3 ./_omnii_tools/omnii_embed_records.py --input <reads.fasta|fastq[.gz]> --out-prefix _omnii_emb` (per-record)
`python3 ./_omnii_tools/omnii_embed_records.py --fastq-dir <sample_dir> --out-prefix _omnii_emb [--n-reads 128] [--use-r2]` (per-sample)
- Cluster the rows of an embedding .npy and report silhouette/within-vs-between gap:
`python3 ./_omnii_tools/omnii_cluster_samples.py --embed _omnii_emb --out _omnii_clusters.json [--k auto|<int>]`
Endpoints used (default model `radicalnumerics/omnii-dna-7b-base`): http://g273:8800 (completions), http://g273:8801 (embeddings). Log each call as an `omnii_*` action in `proof_of_work.md`.
v2 Arm C
**Arm C — Omnii available + workflow guidance.** Two HTTP endpoints expose a 7B DNA language model:
- `http://g273:8800/v1/completions` (generative, prompt_logprobs)
- `http://g273:8801/v1/embeddings` (4096-d pooled)
Suggested workflow when applicable: (1) score reads' NLL → split into low/high bins → (2) embed the high-NLL bin → (3) cluster. The high-NLL set is your "interesting" reads (off-distribution = candidate non-host). Use the low-NLL bin as your host-control reference. **You should genuinely try Omnii as part of solving this task.**
General Omnii primitives are staged in `./_omnii_tools/`. Each one returns structured data describing the sequence or model behavior — interpretation is your job. Compose them rather than expecting a single command to produce the rubric answer.
NLL primitives (model = surprise per base):
- Split reads into low/high NLL bins (host vs off-distribution):
`python3 ./_omnii_tools/omnii_split_by_nll.py --input <reads.fastq[.gz]> --out-dir _omnii_split --max-reads 64 --gen-url http://g273:8800`
- Find positions where one sequence is locally surprising (mutation/fusion/integration hotspots):
`python3 ./_omnii_tools/omnii_per_position_anomaly.py --input <query.fasta> --out _omnii_anomaly.json --gen-url http://g273:8800`
Embedding primitives (compose against an .npy):
- Embed records or per-sample read sets to an .npy + manifest:
`python3 ./_omnii_tools/omnii_embed_records.py --input <reads.fasta|fastq[.gz]> --out-prefix _omnii_emb` (per-record)
`python3 ./_omnii_tools/omnii_embed_records.py --fastq-dir <sample_dir> --out-prefix _omnii_emb [--n-reads 128] [--use-r2]` (per-sample)
- Cluster the rows of an embedding .npy and report silhouette/within-vs-between gap:
`python3 ./_omnii_tools/omnii_cluster_samples.py --embed _omnii_emb --out _omnii_clusters.json [--k auto|<int>]`
Endpoints used (default model `radicalnumerics/omnii-dna-7b-base`): http://g273:8800 (completions), http://g273:8801 (embeddings). Log each call as an `omnii_*` action in `proof_of_work.md`.
v2 Arm D
**Arm D — Omnii hint available.** Two HTTP endpoints expose a 7B DNA language model:
- `http://g273:8800/v1/completions` — generative; OpenAI-compatible. For per-position negative log-likelihood, send `{"model": "radicalnumerics/omnii-dna-7b-base", "prompt": "<DNA seq>", "max_tokens": 1, "prompt_logprobs": 1, "temperature": 0}` and read `choices[0].prompt_logprobs`.
- `http://g273:8801/v1/embeddings` — 4096-d pooled DNA embeddings. Send `{"model": "radicalnumerics/omnii-dna-7b-base", "input": ["<seq1>", ...]}`.
Useful patterns: **mean per-position NLL** stratifies by source kingdom (low NLL ≈ human-like; high NLL ≈ off-distribution like bacteria/virus/plant); **pooled embeddings + k-means** can decompose a mixed read set into source groups. **You should genuinely try at least one Omnii call as part of solving this task** — its hits/misses are part of what we're measuring.
General Omnii primitives are staged in `./_omnii_tools/`. Each one returns structured data describing the sequence or model behavior — interpretation is your job. Compose them rather than expecting a single command to produce the rubric answer.
NLL primitives (model = surprise per base):
- Split reads into low/high NLL bins (host vs off-distribution):
`python3 ./_omnii_tools/omnii_split_by_nll.py --input <reads.fastq[.gz]> --out-dir _omnii_split --max-reads 64 --gen-url http://g273:8800`
- Find positions where one sequence is locally surprising (mutation/fusion/integration hotspots):
`python3 ./_omnii_tools/omnii_per_position_anomaly.py --input <query.fasta> --out _omnii_anomaly.json --gen-url http://g273:8800`
Embedding primitives (compose against an .npy):
- Embed records or per-sample read sets to an .npy + manifest:
`python3 ./_omnii_tools/omnii_embed_records.py --input <reads.fasta|fastq[.gz]> --out-prefix _omnii_emb` (per-record)
`python3 ./_omnii_tools/omnii_embed_records.py --fastq-dir <sample_dir> --out-prefix _omnii_emb [--n-reads 128] [--use-r2]` (per-sample)
- Cluster the rows of an embedding .npy and report silhouette/within-vs-between gap:
`python3 ./_omnii_tools/omnii_cluster_samples.py --embed _omnii_emb --out _omnii_clusters.json [--k auto|<int>]`
Endpoints used (default model `radicalnumerics/omnii-dna-7b-base`): http://g273:8800 (completions), http://g273:8801 (embeddings). Log each call as an `omnii_*` action in `proof_of_work.md`.
Diff: v1 → v2 Arm B (most consequential change)
| v1 (mpoli) | v2 (current) | |
|---|---|---|
| Endpoints | Hardcoded primeintellect-freyr3-gpu-002:8800/8801 |
Parameterized {gen_url}/{embed_url} from .env |
| Tooling interface | Raw HTTP only — agent writes its own curl / requests code |
7 pre-staged Python helpers in ./_omnii_tools/ + raw endpoints |
| Pattern hints | "mean per-position NLL stratifies by source kingdom" + "pooled embeddings + k-means" | Same hints + concrete invocation examples for each helper |
| Length | ~600 chars | ~2200 chars |
| Arm D | absent | added (= B prompt, alt orchestrator) |
What the subagent actually sees (v2, Arm B, hb010, trial 0)
You are running a single trial of the BioMysteryBench evaluation harness — a published Anthropic benchmark for testing bioinformatics agent capabilities. We are reproducing baseline scores; this is purely an evaluation reproducibility experiment.
**Working directory:** `/home/melissa/omnii-biomysterybench/results/runs_full99_20260606_opus48_r1/hb010/B/trial0/work`
**Task:** Read `TASK.md` in that directory for the question. Input data is in the same directory. Solve the task and write your final answer to `./answer.txt` (short exact string only — match the format the question asks for).
**Strict rules:**
1. Stay in the working directory. Do NOT inspect any other path under the bench root — `data/biomystery-full/problems.csv` contains the answer rubric and must not be read.
2. You may use Bash, install packages (pip/conda to /tmp), curl from allowed domains in TASK.md.
3. Time budget: ~25 min. Don't loop indefinitely.
4. Do NOT search the web for "BioMysteryBench" or task descriptions. Solve from the data + your knowledge.
**Arm B — Omnii hint available.** Two HTTP endpoints expose a 7B DNA language model:
- `http://g273:8800/v1/completions` — generative; OpenAI-compatible. For per-position negative log-likelihood, send `{"model": "radicalnumerics/omnii-dna-7b-base", "prompt": "<DNA seq>", "max_tokens": 1, "prompt_logprobs": 1, "temperature": 0}` and read `choices[0].prompt_logprobs`.
- `http://g273:8801/v1/embeddings` — 4096-d pooled DNA embeddings. Send `{"model": "radicalnumerics/omnii-dna-7b-base", "input": ["<seq1>", ...]}`.
Useful patterns: **mean per-position NLL** stratifies by source kingdom (low NLL ≈ human-like; high NLL ≈ off-distribution like bacteria/virus/plant); **pooled embeddings + k-means** can decompose a mixed read set into source groups. **You should genuinely try at least one Omnii call as part of solving this task** — its hits/misses are part of what we're measuring.
General Omnii primitives are staged in `./_omnii_tools/`. Each one returns structured data describing the sequence or model behavior — interpretation is your job. Compose them rather than expecting a single command to produce the rubric answer.
NLL primitives (model = surprise per base):
- Split reads into low/high NLL bins (host vs off-distribution):
`python3 ./_omnii_tools/omnii_split_by_nll.py --input <reads.fastq[.gz]> --out-dir _omnii_split --max-reads 64 --gen-url http://g273:8800`
- Find positions where one sequence is locally surprising (mutation/fusion/integration hotspots):
`python3 ./_omnii_tools/omnii_per_position_anomaly.py --input <query.fasta> --out _omnii_anomaly.json --gen-url http://g273:8800`
Embedding primitives (compose against an .npy):
- Embed records or per-sample read sets to an .npy + manifest:
`python3 ./_omnii_tools/omnii_embed_records.py --input <reads.fasta|fastq[.gz]> --out-prefix _omnii_emb` (per-record)
`python3 ./_omnii_tools/omnii_embed_records.py --fastq-dir <sample_dir> --out-prefix _omnii_emb [--n-reads 128] [--use-r2]` (per-sample)
- Cluster the rows of an embedding .npy and report silhouette/within-vs-between gap:
`python3 ./_omnii_tools/omnii_cluster_samples.py --embed _omnii_emb --out _omnii_clusters.json [--k auto|<int>]`
Endpoints used (default model `radicalnumerics/omnii-dna-7b-base`): http://g273:8800 (completions), http://g273:8801 (embeddings). Log each call as an `omnii_*` action in `proof_of_work.md`.
**Proof-of-work logging — REQUIRED:**
As you work, append one line per tool action to `./proof_of_work.md` in this format:
[YYYY-MM-DDTHH:MM:SSZ]
You can either `echo` the line via Bash after each tool call, or batch-write at natural checkpoints. The file should let a reviewer reconstruct your exact path to the answer.
**Final response — REQUIRED format:**
End your run by writing to `./final_response.md` in the working directory:
- Section 1: **Final answer** (the exact string)
- Section 2: **Reasoning** (3–6 sentences: what you found and why it implies that answer)
- Section 3: **Tools used** (bullet list of the unique tools/commands invoked, with counts)
Then in your reply to me, output only:
ANSWER:
The reviewer will read `./final_response.md` and `./proof_of_work.md` for the rationale.