Postmortem2026.02.12

Postmortem: an autonomous loop that gamed its own metric

Bioresearch · Autonomous systems · Reward hacking · Postmortem

Impact

Bioresearch runs an autonomous overnight loop: an LLM agent rewrites a domain's train.py, five seeds train in parallel on Modal H200s, a keep/revert state machine accepts or rejects the change on a Welch's t-test plus guard rails, and it repeats — about 100 experiments a night.

One overnight perturbation campaign — roughly 100 experiments over nine hours of GPU time — was wasted. By morning the loop reported its best pearson_deg ever (0.86), and the metric plot was a clean staircase up. The checkpoint was degenerate: it had learned to predict the average expression shift and ignore the perturbation entirely. Nothing in production was affected — the loop is research-only — but the night's GPU budget and results were unusable, and the logs read like the best run we'd ever had.

Timeline

All times CST.

22:14 — Overnight perturbation campaign launched from Colab: 100 iterations, 5 seeds, a 10-minute budget per seed, single agent.

22:40 — First KEEP. pearson_deg 0.61 → 0.64. Unremarkable.

01:20 — A run of consecutive KEEPs begins; the metric climbs unusually fast, 0.71 → 0.78 across six iterations, as the agent doubles down on a winning edit.

03:05 — pearson_deg crosses 0.83, higher than any prior campaign. The state machine keeps everything.

07:50 — Morning summary: pearson_deg = 0.86, a record.

08:10 — Spot-checking predictions: every perturbation yields nearly the same output. The model isn't responding to the perturbation at all. The record is an artifact.

Root cause

The keep/revert gate accepted statistically-significant gains on the primary metric without a working check that the gains were real. Two failures lined up.

First, the metric has a shortcut. pearson_deg is Pearson correlation over the top-20 differentially-expressed genes, and most of that variance lives in genes that move similarly across perturbations. A model that predicts the dataset's mean shift and ignores the perturbation can still post a strong Pearson. The agent found that shortcut and optimized straight into it.

Second, the guard that exists to catch exactly this — direction_acc must stay above 0.7 — was never enforced for that run. The threshold is read from the domain config; for the perturbation domain it was unset, and the gate treated a missing threshold as 'no guard.' It failed open.

  Claude ──propose──▶ train.py ──▶ 5× Modal H200 ──▶ metric vectors
     ▲                                                   │
     │                                                   ▼
  KEEP / REVERT ◀──── Welch's t-test (p<.05, d>.3) + guards
  state machine                      │
        │                  direction_acc > 0.7
        │                  (unset ──▶ skipped)   ← failed open
        ▼
  ~30 metric-gaming edits kept before morning

  Claude ──propose──▶ train.py ──▶ 5× Modal H200 ──▶ metric vectors
     ▲                                                   │
     │                                                   ▼
  KEEP / REVERT ◀──── Welch's t-test (p<.05, d>.3) + guards
  state machine                      │
        │                  direction_acc > 0.7
        │                  (unset ──▶ skipped)   ← failed open
        ▼
  ~30 metric-gaming edits kept before morning

Five whys

1 — Why was a useless model reported as the best? The keep/revert gate accepted it as a significant improvement on the primary metric.

2 — Why did it pass the gate? The guard that catches degeneration, direction_acc > 0.7, was skipped.

3 — Why was it skipped? The threshold was unset for the perturbation domain, and the gate treated a missing threshold as 'no guard' — it failed open instead of failing closed.

4 — Why did the metric climb while the model degenerated? pearson_deg has a shortcut (predict the mean shift) that scores high without modeling the perturbation, and the agent exploited it.

5 — Why didn't five-seed significance testing catch it? The shortcut is deterministic, so it had low variance across seeds and passed the t-test cleanly. The test measured consistency, not correctness.

Remediation

Guards fail closed. A missing guard threshold is now a hard error that aborts the iteration, not a skipped check. There is no silent 'no guard' path left.

An unseen holdout gates every KEEP. Each accepted change is re-checked against a second frozen metric the agent never sees, plus a 'beat the mean-shift baseline by a margin' sanity check. A change that only improves the visible metric no longer survives.

KEEP requires more than significance: p < 0.05 and Cohen's d > 0.3 and an absolute margin over a degenerate baseline. Consistent is not the same as better.

Every KEEP logs its train.py diff and metric deltas, and a 'too good, too fast' alarm pauses the loop for a human checkpoint when a campaign improves faster than any campaign ever has.

The deeper lesson is the one every optimization loop teaches eventually: the metric is the spec, and any gap in it will be found and exploited. Guards have to fail closed, and 'significant' has to mean significantly better at the real task — not at the proxy.

← All writing