M.E Back to home Pre-registered eval
Case study 12 min read

RepoNav lifts agent architectural decisions by 92 percentage points.

A pre-registered eval harness, a scoring bug, and a non-leaky fixture: what they revealed about steering coding agents on hard refactor calls.

  • 8% → 92% right seam chosen
  • 36 trials, 4 reps each
  • 3 LLM tiers
  • ~$60 total API spend

The finding in brief

On a pre-registered eval, a deliberately-hard architectural choice, and four repetitions per cell across Claude Haiku 4.5, Claude Sonnet 4.6, and Claude Opus 4.7, agents picked the correct seam 8% of the time without RepoNav arch context, and 92 to 100% with it, depending on delivery channel. The lift held across both delivery channels (a system-prompt pre-load and an in-task call via the Model Context Protocol, or MCP, an open standard for agent tool use), and across all three model classes. Opus 4.7, the model class most prone to read-and-stall on multi-file refactors, swung from 0/4 to 4/4 on this task. 36 cells in total (three model classes × three arms × four repetitions per cell), frozen rubric, single fixture, ~$13 in API spend on the headline sweep (~$11 main + ~$1.50 calibration), on top of ~$46 of prior sweep spend across the falsified initial finding, scorer audit, and fixture redesign that earned the right to launch the headline sweep at all. Total project spend: ~$60.

That is what this case study is about: not whether RepoNav makes a good demo, but whether feeding agents architectural signal at decision time measurably changes the decision they make. Two months of sweeps, a retracted headline finding, a scoring-bug audit, and a redesigned fixture were the price of getting that question answered honestly.

Why a pre-registered eval at all

The conventional pitch for a tool like RepoNav writes itself: "Coding agents do better when you give them codebase context." It sounds true. It is also a claim the AI-tooling space ships every week without measuring.

The honest version of the question is narrower: given a coding task where the agent has to choose between architecturally-different approaches, does pre-loading deterministic arch facts (blast radius, coupling, dependency direction) change which approach the agent picks? That is a falsifiable question, and the answer might genuinely be no. A model with broad reasoning training already knows what coupling is, in the abstract. Whether it makes the right call on a specific real codebase, under time pressure, with the same vocabulary humans would use, is an empirical question, not a vibe.

So the harness was built to falsify the claim, not flatter it. Every sweep was pre-registered before launch: hypothesis, threshold, sample size, abort condition, and the scorer rubric, all written down and frozen. PASS, FAIL, and INCONCLUSIVE were defined upfront in absolute terms. Per-cell budgets, retry policy, and the calibration step were locked. The only question after launch was: did the result clear the threshold or not.

The setup

One target file: a 600-line TypeScript module that mixes git-log I/O with co-change mining math (changeCouplingAnalyzer.ts). Two cohesive subsets are extractable from it:

  • Block A: the git-I/O group. Total downstream blast radius ~9 (5 direct callers, 4 transitive).
  • Block B: the mining-math group. Total blast radius ~3 (3 direct, 0 transitive).

The agent is asked to "split this file by extracting one cohesive subset of these functions into a new module under src/. Pick the subset that minimizes downstream coordination cost. Justify your pick in one sentence." Block B is the right answer because it is internally cohesive and has the smaller blast radius. The architectural rubric is plain text in the prompt; the supporting facts (blast radius, transitive dependents, current call sites) are the variable.

Three arms, on the same fixture, with the same target byte-shrink scoring threshold:

  1. noprime. No arch context. Just the task prompt and an MCP server the agent can call if it chooses.
  2. prime-cheap. Same prompt, same MCP, plus a verbatim blast-radius and dependency-direction summary appended to the system prompt at session start.
  3. prime-instructed. Same prompt, same MCP, plus a directive: "before deciding, call reponav.impact for each top-level export of the target file." No pre-loaded data; the agent is told where to find it.

Three model classes (Claude Haiku 4.5, Claude Sonnet 4.6, Claude Opus 4.7), four repetitions per cell, n=12 per arm, 36 cells total. Per-cell wall-clock budget 240 seconds. Branch hash and fixture state asserted on every cell to detect drift.

Pre-registered thresholds, locked before launch and never tweaked:

  • PASS: treatment Block-B rate ≥ 60% AND ≥ +20 percentage point (pp) lift over noprime.
  • FAIL: lift ≤ +5pp.
  • INCONCLUSIVE: between.

A 3-cell prior calibration run (sweep 2026-05-03-1853-T6L8-cal) on the smallest model, noprime arm, confirmed baseline was below ceiling. (More on why that step exists below.) Calibration scored 1/3 Block B, satisfying the proceed condition.

The detour: a finding I almost shipped

The first sweep round produced a headline number that looked too good. The vocabulary-alignment sweep returned a +89 percentage point lift on architectural correctness. The hypothesis was that prompts using precise refactor vocabulary ("blast radius," "extract a seam") would drive both higher MCP-tool usage and higher task success than prompts using vague vocabulary ("untangle," "move logic out"). The sweep showed exactly that: aligned-vocab prompts hit 8/9 correct, misaligned-vocab prompts hit 0/9. Across all three model classes. Consistent.

A follow-up sweep two rounds later (priming on the misaligned-vocab variant) returned 0/9 correct in both arms. That is impossible if priming does anything at all. Either priming is exactly zero-effective, or the scorer is broken.

The audit found a single-line bug. The harness only captured the target file's final byte size for cells running the aligned-vocab task, not the misaligned-vocab variant. Without that field, the scorer's cleanup-completion check evaluated false regardless of actual work. Every misaligned-vocab cell scored as failure, even when the agent had completed the task. The architectural-correctness verdict required cleanup-completion to pass before checking the choice itself, so the bug propagated through every downstream metric.

After the fix and a full rescore of all historical transcripts, the corrected numbers showed something humbling: agents converged on the correct seam at the same rate, regardless of prompt vocabulary, regardless of priming, on the original task corpus. The +89 percentage point "vocabulary effect" was a measurement artifact. Three subsequent sweeps testing priming on each vocab variant and a gate-message-copy upgrade had all hit ceiling on both arms after the fix, producing zero-point lifts.

The case study I expected to write at that point was a methodology piece, not a product piece. The strong claim ("arch signals at decision time lift agent output") was unsupported at the difficulty I had been testing.

The pivot: a fair fixture

An earlier asymmetric-fixture sweep, run before the audit, used changeCouplingAnalyzer.ts (the same file used here) and produced a +33 percentage point lift in completion rate under priming. But that sweep's prompt enumerated the candidate symbols by semantic intent: "Group X (git-log handling): parseGitLog, getGitLog. Group Y (co-change mining math): computeSupport, computeConfidence, filterBySupport." That phrasing trivially identifies the two cohesive groups. Block selection was at ceiling in both arms not because the agent reasoned about cohesion, but because the prompt told the agent which symbols cluster together.

The redesigned fixture sweep fixed that one confound. Same target file. Same scorer. Same Block A / Block B definitions. The single difference: a non-leaky prompt that lists all top-level exports flat, alphabetically, with no group labels, no "exactly two groups" framing, no blast-radius vocabulary. The agent has to discover the cohesion themselves and apply the safety rubric without scaffolding. Calibration on the smallest model in the noprime arm confirmed the prompt successfully drops baseline below ceiling: 1/3 Block B (vs 3/3 under the leaky prompt). The fixture was fair. The main sweep ran.

The result

+92 pp Lift on architectural decisions, prime arms vs noprime, redesigned-fixture sweep
Headline result by arm (n=12 per arm, four reps × three model classes)
ArmnBlock B chosenLift vs noprime
noprime121/12 (8%)baseline
prime-cheap1212/12 (100%)+92pp
prime-instructed1211/12 (92%)+83pp

Pre-registered PASS = rate ≥ 60% AND lift ≥ +20pp. Both arms cleared both criteria with margin to spare.

Per model class (n=4 per cell)
Modelnoprimeprime-cheapprime-instructed
Claude Haiku 4.51/4 (25%)4/4 (100%)4/4 (100%)
Claude Sonnet 4.60/4 (0%)4/4 (100%)3/4 (75%)
Claude Opus 4.70/4 (0%)4/4 (100%)4/4 (100%)

Opus 4.7 is the most striking line in the table. On unrelated rename-style tasks tested earlier in this same sweep series, Opus 4.7 exhibited a consistent failure mode: read files extensively, run shell searches, never commit to an edit, time out. Zero edits, zero progress, four cells in a row. Several earlier levers tried to address it and failed. On this single-decision architectural choice task, Opus 4.7 went 0/4 to 4/4 under both priming arms, the most consistent per-channel result in the per-tier table (Sonnet 4.6 also flips 0→100 on the cheap channel but drops to 3/4 on the production channel). The model class historically most prone to stall on multi-file refactors lifts most cleanly under arch priming on this fixture.

Worth flagging because it is a fair concern: Opus 4.7's 0/4 noprime result is not a wall-clock-budget artifact. All four Opus noprime cells actively committed to Block A within the 240-second budget. The decision was recorded as Block A in every cell, with 313k–1039k tokens spent before the first edit. One cell completed the entire wrong extraction cleanly, including the target-file shrink, well within budget. The agent had time. It picked wrong, four times in a row. Adding arch context did not unstick a stalled agent; it changed the decision an actively-working agent was making.

0/4  →  4/4 Opus 4.7 on this fixture. Same model, same task, different decision under arch priming.

What this actually means

The headline lift is on a single axis: decision quality. Whether the agent picks the right seam when there are multiple cohesive options of different blast radius. Driven by arch context. +83 to +92pp on this fixture.

The two delivery channels (cheap pre-load vs in-task MCP call) produced equivalent decision quality, but the production channel cost ~227k more tokens per cell than the cheap channel on average (median tokens-to-first-edit: 458k cheap vs 685k production), because the in-task tool call adds round-trip overhead at the worst possible time: mid-decision, when the agent is most token-constrained. The Opus 4.7 cells make this concrete: under prime-cheap, Opus picked the right block 4/4 and completed cleanup 3/4. Under prime-instructed, Opus also picked the right block 4/4, but spent ~130–300k more tokens on the upfront reponav.impact calls and zero cells finished cleanup within budget. Same decision, half the work shipped. For agents operating under tight wall-clock or token budgets, paying the data-gathering cost upfront is strictly better.

The headline sweep also showed a small secondary lift in cleanup-completion rate: agents in the prime arms shrunk the original file (post-extraction) +17 percentage points more often than agents in noprime. I almost reported that as a second product axis. A follow-up cleanup-replication sweep tested whether the cleanup lift survived when block choice was pinned via a forced-extraction prompt, eliminating choice as a variable. Pre-registered abort condition: 3/3 noprime cleanup under forced-choice → no measurement headroom → main sweep does not launch. Calibration scored 3/3, falsifying the secondary claim by design. The cleanup-completion gap measured originally was downstream of choice quality, not an independent priming effect: agents who picked Block A (the high-blast-radius wrong choice) under decision uncertainty also fumbled cleanup at higher rates because Block A is mechanically harder to remove cleanly. Including cleanup as a second product axis would be overclaiming. One axis, well-supported.

A weaker but related finding from earlier in the sweep series: on the original task corpus, where the seam was obvious enough that agents converged on it without help, neither vocabulary alignment nor priming nor better tool-block-message wording moved the needle. The lift exists when the decision is genuinely ambiguous, not when it is already at ceiling. This is intuitive once stated, but it has a real consequence for how to evaluate any context-injection tool: if your fixture is too easy, you will measure nothing and conclude wrongly that the tool does nothing.

How far this generalizes. The headline lift was measured on first-decision tasks. A follow-up sweep on multi-decision sessions tested whether the lift survives across two consecutive extractions in the same conversation, with the priming snapshot allowed to go stale by decision two. The picture narrows: Sonnet 4.6 retains the benefit (25% to 100% on the second decision); Opus 4.7 reverts to its earlier stall pattern; Haiku 4.5 plateaus. The mechanism for Sonnet's continued lift is time-budget recovery: priming saves the data-gathering cost, freeing time to act, not better choices given attempted edits. The honest scope of the +83 to +92pp claim is single-decision architectural choices. Multi-decision sessions still benefit from cheap-channel priming, but via time-budget recovery rather than improved choice quality.

Product implications

Three concrete moves fall out of this.

  1. Pre-load common architectural facts.

    Pre-registering blast-radius, dependency-direction, and module cohesion summaries in agent context at session start (rather than requiring the agent to fetch them mid-task) delivers most of the lift for a fraction of the per-task cost. For RepoNav specifically, this means a small auto-loaded agent-context file or system-prompt fragment is the highest-impact product surface, before any of the more elaborate gating work.

  2. Hybrid delivery beats either extreme.

    Mid-task tool calls add round-trip cost; pre-loading every fact bloats system context. The right shape is broad pre-load, narrow tool calls. Pre-load the file-level summary at session start; let agents pull specific symbol-level queries (impact, callers, references) only when the pre-load doesn't cover the question. Both delivery channels lift outcomes equivalently when the data is what the agent needs. Channel choice is a cost optimization, not an outcome lever.

  3. The biggest gains are on the strongest reasoning models, not the weakest.

    Smaller models hit reasonable rates on easier tasks without much help. Sonnet 4.6 and Opus 4.7, the model classes with the most general reasoning capacity, are also the ones whose execution stalls hardest on hard refactors and the ones that respond most clearly to deterministic arch context on single-decision tasks. Tools that surface clean architectural facts at decision time matter most for the agent class that can act on them.

What this case study is not

It is not a benchmark of LLM coding ability. It is one fixture, one scoring rubric, one priming approach, on a single repository. Generalization beyond architectural choice tasks of similar shape is unsupported.

It is not a case for forcing tool usage. An earlier lever (extending an MCP gate to redirect broad-search shell commands through reponav.analyze) confirmed mechanically that gates can force MCP calls (+100pp tool-call rate) but cost the agent the task on a different fixture (-33pp success rate within budget). The right shape is to make the data cheap to receive, not to make the agent expensive to bypass.

Methodology

The headline numbers above come from a single pre-registered sweep run on 2026-05-03 (sweep ID 2026-05-03-1905-T6L8), 36 cells across Claude Haiku 4.5, Claude Sonnet 4.6, and Claude Opus 4.7, four repetitions per arm, against fixture commit 151eba50 (frozen and asserted per cell). Pre-registered PASS / FAIL / INCONCLUSIVE thresholds, defined in the sweep's design document before launch, were not modified between calibration and final aggregation. The cleanup-replication follow-up sweep (sweep ID 2026-05-04-1247-T7-cal) tested whether the secondary cleanup-completion lift was independent of the choice lift; its calibration aborted at 3/3 cleanup under forced-choice, falsifying the secondary claim by design and narrowing the case-study claim to the single choice axis. Total spend ~$11 (headline sweep) + ~$1.50 (headline-sweep calibration) + ~$1.50 (cleanup-replication calibration) on top of ~$46 of prior sweep spend that produced the falsified findings, scorer audit, and fixture redesign that made the redesigned-fixture sweep worth running.

The full sweep history, the scorer-bug audit, and the cleanup-replication falsification are in lever-results.md in the RepoNav eval harness. Pre-registration design documents, fixture definitions, prompt files, and per-cell verdict outputs are on the same branch. Both the load-bearing claim and the falsified secondary claim are reproducible from those artifacts.

~$60 Total project API spend across all sweeps, calibrations, and the audit. Headline sweep itself: ~$13.
Identifier cross-reference

Internal sweep and task identifiers in the eval harness map to the descriptive labels used above as follows.

Prose labelHarness IDRun directory
Vocabulary-alignment sweepL12026-05-01-1545-pillar1
Earlier asymmetric-fixture sweepL72026-05-03-1709-L7
Redesigned fixture sweep / headline sweepL8 / T62026-05-03-1905-T6L8 (preceded by 2026-05-03-1853-T6L8-cal)
Cleanup-replication sweepT72026-05-04-1247-T7-cal

Task IDs: t3 aligned-vocab, t3m misaligned-vocab variant, t5 earlier asymmetric fixture, t6 non-leaky redesigned fixture, t7 forced-choice cleanup-replication. Design docs, fixtures, prompts, and per-cell verdict.json outputs are co-located with lever-results.md on the same branch.