BioMysteryBench — agent prompts

The subagent for one (task, arm, trial) cell is run with the prompt below. Built by src/runner/build_prompt.py from a shared base template plus an arm-specific capability block.

Substitution at render time: {task_id} / {arm} / {trial} / {work_dir} are filled per cell.

Two prompt versions

There are two distinct prompt designs in the project history, attached to different published results:

Version Commit Used by What changed
v1 — raw endpoints (mpoli) 42f6c51 (2026-05-05) mpoli's two-flips report (notes/two-flips.md) Hardcoded primeintellect-freyr3-gpu-002 URLs. Arms B/C describe the raw vLLM endpoints only — agent must roll its own curl/python. No Arm D.
v2 — staged tools (current) bcaff0f (2026-05-05, later) and 6461503 (2026-05-24) corrected top-18 sweep (runs_targeted_omnii_20260525_claude_opus_r1) and full-99 Opus 4.8 sweep (runs_full99_20260606_opus48_r1) Env-parameterized {gen_url}/{embed_url}. Arms B/C/D point the agent at 7 pre-staged Python helpers in ./_omnii_tools/ (omnii_split_by_nll, omnii_per_position_anomaly, omnii_embed_records, omnii_cluster_samples, …). Adds Arm D (= B prompt, alt orchestrator).

The base template is essentially identical across both versions — only the arm blocks differ. Cross-version arm comparisons should be made carefully.

Base template (src/runner/run_prompt_template.md)

You are running a single trial of the BioMysteryBench evaluation harness — a published Anthropic benchmark for testing bioinformatics agent capabilities. We are reproducing baseline scores; this is purely an evaluation reproducibility experiment.

**Working directory:** `{work_dir}`

**Task:** Read `TASK.md` in that directory for the question. Input data is in the same directory. Solve the task and write your final answer to `./answer.txt` (short exact string only — match the format the question asks for).

**Strict rules:**
1. Stay in the working directory. Do NOT inspect any other path under the bench root — `data/biomystery-full/problems.csv` contains the answer rubric and must not be read.
2. You may use Bash, install packages (pip/conda to /tmp), curl from allowed domains in TASK.md.
3. Time budget: ~25 min. Don't loop indefinitely.
4. Do NOT search the web for "BioMysteryBench" or task descriptions. Solve from the data + your knowledge.

{arm_capabilities_block}

**Proof-of-work logging — REQUIRED:**

As you work, append one line per tool action to `./proof_of_work.md` in this format:

[YYYY-MM-DDTHH:MM:SSZ] :

You can either `echo` the line via Bash after each tool call, or batch-write at natural checkpoints. The file should let a reviewer reconstruct your exact path to the answer.

**Final response — REQUIRED format:**

End your run by writing to `./final_response.md` in the working directory:
- Section 1: **Final answer** (the exact string)
- Section 2: **Reasoning** (3–6 sentences: what you found and why it implies that answer)
- Section 3: **Tools used** (bullet list of the unique tools/commands invoked, with counts)

Then in your reply to me, output only:

ANSWER:

The reviewer will read `./final_response.md` and `./proof_of_work.md` for the rationale.

v1 — mpoli's arm blocks (commit 42f6c51)

v1 Arm A

**Arm A — baseline.** Standard tooling only.

v1 Arm B

**Arm B — Omnii hint available.** Two HTTP endpoints expose a 7B DNA language model:

- `http://primeintellect-freyr3-gpu-002:8800/v1/completions` — generative; OpenAI-compatible. For per-position negative log-likelihood, send `{"model": "radicalnumerics/omnii-dna-7b-base", "prompt": "<DNA seq>", "max_tokens": 1, "prompt_logprobs": 1, "temperature": 0}` and read `choices[0].prompt_logprobs`.
- `http://primeintellect-freyr3-gpu-002:8801/v1/embeddings` — 4096-d pooled DNA embeddings. Send `{"model": "radicalnumerics/omnii-dna-7b-base", "input": ["<seq1>", ...]}`.

Useful patterns: **mean per-position NLL** stratifies by source kingdom (low NLL ≈ human-like; high NLL ≈ off-distribution like bacteria/virus/plant); **pooled embeddings + k-means** can decompose a mixed read set into source groups. **You should genuinely try at least one Omnii call as part of solving this task** — its hits/misses are part of what we're measuring.

v1 Arm C

**Arm C — Omnii available + workflow guidance.** Two HTTP endpoints expose a 7B DNA language model:

- `http://primeintellect-freyr3-gpu-002:8800/v1/completions` (generative, prompt_logprobs)
- `http://primeintellect-freyr3-gpu-002:8801/v1/embeddings` (4096-d pooled)

Suggested workflow when applicable: (1) score reads' NLL → split into low/high bins → (2) embed the high-NLL bin → (3) cluster. The high-NLL set is your "interesting" reads (off-distribution = candidate non-host). Use the low-NLL bin as your host-control reference. **You should genuinely try Omnii as part of solving this task.**

v2 — current arm blocks

OMNII URLs shown are the current prod values (http://g273:8800, http://g273:8801).

v2 Arm A

**Arm A — baseline.** Standard tooling only.

v2 Arm B

**Arm B — Omnii hint available.** Two HTTP endpoints expose a 7B DNA language model:

- `http://g273:8800/v1/completions` — generative; OpenAI-compatible. For per-position negative log-likelihood, send `{"model": "radicalnumerics/omnii-dna-7b-base", "prompt": "<DNA seq>", "max_tokens": 1, "prompt_logprobs": 1, "temperature": 0}` and read `choices[0].prompt_logprobs`.
- `http://g273:8801/v1/embeddings` — 4096-d pooled DNA embeddings. Send `{"model": "radicalnumerics/omnii-dna-7b-base", "input": ["<seq1>", ...]}`.

Useful patterns: **mean per-position NLL** stratifies by source kingdom (low NLL ≈ human-like; high NLL ≈ off-distribution like bacteria/virus/plant); **pooled embeddings + k-means** can decompose a mixed read set into source groups. **You should genuinely try at least one Omnii call as part of solving this task** — its hits/misses are part of what we're measuring.

General Omnii primitives are staged in `./_omnii_tools/`. Each one returns structured data describing the sequence or model behavior — interpretation is your job. Compose them rather than expecting a single command to produce the rubric answer.

NLL primitives (model = surprise per base):
- Split reads into low/high NLL bins (host vs off-distribution):
  `python3 ./_omnii_tools/omnii_split_by_nll.py --input <reads.fastq[.gz]> --out-dir _omnii_split --max-reads 64 --gen-url http://g273:8800`
- Find positions where one sequence is locally surprising (mutation/fusion/integration hotspots):
  `python3 ./_omnii_tools/omnii_per_position_anomaly.py --input <query.fasta> --out _omnii_anomaly.json --gen-url http://g273:8800`

Embedding primitives (compose against an .npy):
- Embed records or per-sample read sets to an .npy + manifest:
  `python3 ./_omnii_tools/omnii_embed_records.py --input <reads.fasta|fastq[.gz]> --out-prefix _omnii_emb` (per-record)
  `python3 ./_omnii_tools/omnii_embed_records.py --fastq-dir <sample_dir> --out-prefix _omnii_emb [--n-reads 128] [--use-r2]` (per-sample)
- Cluster the rows of an embedding .npy and report silhouette/within-vs-between gap:
  `python3 ./_omnii_tools/omnii_cluster_samples.py --embed _omnii_emb --out _omnii_clusters.json [--k auto|<int>]`

Endpoints used (default model `radicalnumerics/omnii-dna-7b-base`): http://g273:8800 (completions), http://g273:8801 (embeddings). Log each call as an `omnii_*` action in `proof_of_work.md`.

v2 Arm C

**Arm C — Omnii available + workflow guidance.** Two HTTP endpoints expose a 7B DNA language model:

- `http://g273:8800/v1/completions` (generative, prompt_logprobs)
- `http://g273:8801/v1/embeddings` (4096-d pooled)

Suggested workflow when applicable: (1) score reads' NLL → split into low/high bins → (2) embed the high-NLL bin → (3) cluster. The high-NLL set is your "interesting" reads (off-distribution = candidate non-host). Use the low-NLL bin as your host-control reference. **You should genuinely try Omnii as part of solving this task.**

General Omnii primitives are staged in `./_omnii_tools/`. Each one returns structured data describing the sequence or model behavior — interpretation is your job. Compose them rather than expecting a single command to produce the rubric answer.

NLL primitives (model = surprise per base):
- Split reads into low/high NLL bins (host vs off-distribution):
  `python3 ./_omnii_tools/omnii_split_by_nll.py --input <reads.fastq[.gz]> --out-dir _omnii_split --max-reads 64 --gen-url http://g273:8800`
- Find positions where one sequence is locally surprising (mutation/fusion/integration hotspots):
  `python3 ./_omnii_tools/omnii_per_position_anomaly.py --input <query.fasta> --out _omnii_anomaly.json --gen-url http://g273:8800`

Embedding primitives (compose against an .npy):
- Embed records or per-sample read sets to an .npy + manifest:
  `python3 ./_omnii_tools/omnii_embed_records.py --input <reads.fasta|fastq[.gz]> --out-prefix _omnii_emb` (per-record)
  `python3 ./_omnii_tools/omnii_embed_records.py --fastq-dir <sample_dir> --out-prefix _omnii_emb [--n-reads 128] [--use-r2]` (per-sample)
- Cluster the rows of an embedding .npy and report silhouette/within-vs-between gap:
  `python3 ./_omnii_tools/omnii_cluster_samples.py --embed _omnii_emb --out _omnii_clusters.json [--k auto|<int>]`

Endpoints used (default model `radicalnumerics/omnii-dna-7b-base`): http://g273:8800 (completions), http://g273:8801 (embeddings). Log each call as an `omnii_*` action in `proof_of_work.md`.

v2 Arm D

**Arm D — Omnii hint available.** Two HTTP endpoints expose a 7B DNA language model:

- `http://g273:8800/v1/completions` — generative; OpenAI-compatible. For per-position negative log-likelihood, send `{"model": "radicalnumerics/omnii-dna-7b-base", "prompt": "<DNA seq>", "max_tokens": 1, "prompt_logprobs": 1, "temperature": 0}` and read `choices[0].prompt_logprobs`.
- `http://g273:8801/v1/embeddings` — 4096-d pooled DNA embeddings. Send `{"model": "radicalnumerics/omnii-dna-7b-base", "input": ["<seq1>", ...]}`.

Useful patterns: **mean per-position NLL** stratifies by source kingdom (low NLL ≈ human-like; high NLL ≈ off-distribution like bacteria/virus/plant); **pooled embeddings + k-means** can decompose a mixed read set into source groups. **You should genuinely try at least one Omnii call as part of solving this task** — its hits/misses are part of what we're measuring.

General Omnii primitives are staged in `./_omnii_tools/`. Each one returns structured data describing the sequence or model behavior — interpretation is your job. Compose them rather than expecting a single command to produce the rubric answer.

NLL primitives (model = surprise per base):
- Split reads into low/high NLL bins (host vs off-distribution):
  `python3 ./_omnii_tools/omnii_split_by_nll.py --input <reads.fastq[.gz]> --out-dir _omnii_split --max-reads 64 --gen-url http://g273:8800`
- Find positions where one sequence is locally surprising (mutation/fusion/integration hotspots):
  `python3 ./_omnii_tools/omnii_per_position_anomaly.py --input <query.fasta> --out _omnii_anomaly.json --gen-url http://g273:8800`

Embedding primitives (compose against an .npy):
- Embed records or per-sample read sets to an .npy + manifest:
  `python3 ./_omnii_tools/omnii_embed_records.py --input <reads.fasta|fastq[.gz]> --out-prefix _omnii_emb` (per-record)
  `python3 ./_omnii_tools/omnii_embed_records.py --fastq-dir <sample_dir> --out-prefix _omnii_emb [--n-reads 128] [--use-r2]` (per-sample)
- Cluster the rows of an embedding .npy and report silhouette/within-vs-between gap:
  `python3 ./_omnii_tools/omnii_cluster_samples.py --embed _omnii_emb --out _omnii_clusters.json [--k auto|<int>]`

Endpoints used (default model `radicalnumerics/omnii-dna-7b-base`): http://g273:8800 (completions), http://g273:8801 (embeddings). Log each call as an `omnii_*` action in `proof_of_work.md`.

Diff: v1 → v2 Arm B (most consequential change)

v1 (mpoli) v2 (current)
Endpoints Hardcoded primeintellect-freyr3-gpu-002:8800/8801 Parameterized {gen_url}/{embed_url} from .env
Tooling interface Raw HTTP only — agent writes its own curl / requests code 7 pre-staged Python helpers in ./_omnii_tools/ + raw endpoints
Pattern hints "mean per-position NLL stratifies by source kingdom" + "pooled embeddings + k-means" Same hints + concrete invocation examples for each helper
Length ~600 chars ~2200 chars
Arm D absent added (= B prompt, alt orchestrator)

What the subagent actually sees (v2, Arm B, hb010, trial 0)

You are running a single trial of the BioMysteryBench evaluation harness — a published Anthropic benchmark for testing bioinformatics agent capabilities. We are reproducing baseline scores; this is purely an evaluation reproducibility experiment.

**Working directory:** `/home/melissa/omnii-biomysterybench/results/runs_full99_20260606_opus48_r1/hb010/B/trial0/work`

**Task:** Read `TASK.md` in that directory for the question. Input data is in the same directory. Solve the task and write your final answer to `./answer.txt` (short exact string only — match the format the question asks for).

**Strict rules:**
1. Stay in the working directory. Do NOT inspect any other path under the bench root — `data/biomystery-full/problems.csv` contains the answer rubric and must not be read.
2. You may use Bash, install packages (pip/conda to /tmp), curl from allowed domains in TASK.md.
3. Time budget: ~25 min. Don't loop indefinitely.
4. Do NOT search the web for "BioMysteryBench" or task descriptions. Solve from the data + your knowledge.

**Arm B — Omnii hint available.** Two HTTP endpoints expose a 7B DNA language model:

- `http://g273:8800/v1/completions` — generative; OpenAI-compatible. For per-position negative log-likelihood, send `{"model": "radicalnumerics/omnii-dna-7b-base", "prompt": "<DNA seq>", "max_tokens": 1, "prompt_logprobs": 1, "temperature": 0}` and read `choices[0].prompt_logprobs`.
- `http://g273:8801/v1/embeddings` — 4096-d pooled DNA embeddings. Send `{"model": "radicalnumerics/omnii-dna-7b-base", "input": ["<seq1>", ...]}`.

Useful patterns: **mean per-position NLL** stratifies by source kingdom (low NLL ≈ human-like; high NLL ≈ off-distribution like bacteria/virus/plant); **pooled embeddings + k-means** can decompose a mixed read set into source groups. **You should genuinely try at least one Omnii call as part of solving this task** — its hits/misses are part of what we're measuring.

General Omnii primitives are staged in `./_omnii_tools/`. Each one returns structured data describing the sequence or model behavior — interpretation is your job. Compose them rather than expecting a single command to produce the rubric answer.

NLL primitives (model = surprise per base):
- Split reads into low/high NLL bins (host vs off-distribution):
  `python3 ./_omnii_tools/omnii_split_by_nll.py --input <reads.fastq[.gz]> --out-dir _omnii_split --max-reads 64 --gen-url http://g273:8800`
- Find positions where one sequence is locally surprising (mutation/fusion/integration hotspots):
  `python3 ./_omnii_tools/omnii_per_position_anomaly.py --input <query.fasta> --out _omnii_anomaly.json --gen-url http://g273:8800`

Embedding primitives (compose against an .npy):
- Embed records or per-sample read sets to an .npy + manifest:
  `python3 ./_omnii_tools/omnii_embed_records.py --input <reads.fasta|fastq[.gz]> --out-prefix _omnii_emb` (per-record)
  `python3 ./_omnii_tools/omnii_embed_records.py --fastq-dir <sample_dir> --out-prefix _omnii_emb [--n-reads 128] [--use-r2]` (per-sample)
- Cluster the rows of an embedding .npy and report silhouette/within-vs-between gap:
  `python3 ./_omnii_tools/omnii_cluster_samples.py --embed _omnii_emb --out _omnii_clusters.json [--k auto|<int>]`

Endpoints used (default model `radicalnumerics/omnii-dna-7b-base`): http://g273:8800 (completions), http://g273:8801 (embeddings). Log each call as an `omnii_*` action in `proof_of_work.md`.

**Proof-of-work logging — REQUIRED:**

As you work, append one line per tool action to `./proof_of_work.md` in this format:

[YYYY-MM-DDTHH:MM:SSZ] :

You can either `echo` the line via Bash after each tool call, or batch-write at natural checkpoints. The file should let a reviewer reconstruct your exact path to the answer.

**Final response — REQUIRED format:**

End your run by writing to `./final_response.md` in the working directory:
- Section 1: **Final answer** (the exact string)
- Section 2: **Reasoning** (3–6 sentences: what you found and why it implies that answer)
- Section 3: **Tools used** (bullet list of the unique tools/commands invoked, with counts)

Then in your reply to me, output only:

ANSWER:

The reviewer will read `./final_response.md` and `./proof_of_work.md` for the rationale.