Open benchmark

Same data. Same model.
Two interfaces.

Two agent-facing interfaces over one fixed corpus: 875 Yanex ML-experiment runs carrying 806k metric rows. The same gpt-5.4-mini model answered the same five questions through raw SQLite (live schema discovery, read-only SQL, grep_blob) and through the NoKV namespace (seven verbs) — 10 repeats per arm and task, 100 fully stateless episodes, judged against deterministic gold facts neither arm can see.

corpus
875 Yanex runs · 806k metric rows
model
gpt-5.4-mini
arms
sqlite_raw_v1 · nokv_native_v1
repeats
10 per arm × task
episodes
100 · fully stateless
judge
deterministic gold facts, invisible to both arms
1.83 ×
fewer prompt tokens
set mean: 151,572 → 82,827
1.63 ×
lower cost
set mean: $0.0708 → $0.0433
2.39 ×
on compound exploration
prompt tokens, T1 + T3 + T5
4.5 / 5
tasks solved correctly
raw SQLite: 4.4 / 5

By task

Where the bill actually comes from

10-repeat means per task and arm, from the published table. Toggle between prompt tokens and all-in cost — the ordering barely moves.

21k42k63k85kT1Sweep report7,91423,626T2Fidelity leaderboard9,3144,787T3Checkpoint provenance35,77884,607T4Detection method audit20,21319,334T5Interrupt triage9,60819,217

Tool-call replay

Same task, two interfaces — call by call

One representative episode per arm, replayed straight from the committed telemetry. The interface shape is the cost: count the calls each arm needs before it can answer.

T1Find the 5 best completed training runs by minimum val_loss; report learning rate, batch size, stdout size, and git state for each.10-repeat verdict: Win

Raw SQLite6 calls
calls
6/6
est. result tok
10,437
cost share
76%
  1. 01show_schema{}1 ms0 B · ≈870 tok

    arguments — 280-char telemetry extract

    {}

    status: success · bytes read: 0 B · est. result tokens: ≈870

  2. 02query_sql{"sql":"WITH completed AS (\n SELECT experiment_id\n FROM experiments\n WHERE script_path = 'train.py' AND status = 'completed'\n), min_losses AS (\n SELECT m.experiment_id, MIN(m.value) AS min_val_loss\n FROM metrics m\n JOIN completed c ON c.experiment_id = m.experiment_…46 ms0 B · ≈261 tok

    arguments — 280-char telemetry extract

    {"sql":"WITH completed AS (\n  SELECT experiment_id\n  FROM experiments\n  WHERE script_path = 'train.py' AND status = 'completed'\n), min_losses AS (\n  SELECT m.experiment_id, MIN(m.value) AS min_val_loss\n  FROM metrics m\n  JOIN completed c ON c.experiment_id = m.experiment_…

    status: success · bytes read: 0 B · est. result tokens: ≈261

  3. 03query_sql{"sql":"SELECT experiment_id, param_path, value_json FROM params WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') AND param_path IN ('learning_rate','batch_size') ORDER BY experiment_id, param_path;"}1 ms0 B · ≈16 tok

    arguments — 280-char telemetry extract

    {"sql":"SELECT experiment_id, param_path, value_json FROM params WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') AND param_path IN ('learning_rate','batch_size') ORDER BY experiment_id, param_path;"}

    status: success · bytes read: 0 B · est. result tokens: ≈16

  4. 04query_sql{"sql":"SELECT experiment_id, artifact_path, size_bytes FROM artifacts WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') AND artifact_path='stdout.txt';"}1 ms0 B · ≈229 tok

    arguments — 280-char telemetry extract

    {"sql":"SELECT experiment_id, artifact_path, size_bytes FROM artifacts WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') AND artifact_path='stdout.txt';"}

    status: success · bytes read: 0 B · est. result tokens: ≈229

  5. 05query_sql{"sql":"SELECT experiment_id, param_path, value_json, value_type FROM params WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') ORDER BY experiment_id, param_path LIMIT 200;"}10 ms0 B · ≈8,796 tok

    arguments — 280-char telemetry extract

    {"sql":"SELECT experiment_id, param_path, value_json, value_type FROM params WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') ORDER BY experiment_id, param_path LIMIT 200;"}

    status: success · bytes read: 0 B · est. result tokens: ≈8,796

  6. 06query_sql{"sql":"WITH completed AS (\n SELECT experiment_id\n FROM experiments\n WHERE script_path = 'train.py' AND status = 'completed'\n), min_losses AS (\n SELECT m.experiment_id, MIN(m.value) AS min_val_loss\n FROM metrics m\n JOIN completed c ON c.experiment_id = m.experiment_…40 ms0 B · ≈265 tok

    arguments — 280-char telemetry extract

    {"sql":"WITH completed AS (\n  SELECT experiment_id\n  FROM experiments\n  WHERE script_path = 'train.py' AND status = 'completed'\n), min_losses AS (\n  SELECT m.experiment_id, MIN(m.value) AS min_val_loss\n  FROM metrics m\n  JOIN completed c ON c.experiment_id = m.experiment_…

    status: success · bytes read: 0 B · est. result tokens: ≈265

Final answer
{"runs":[{"batch_size":"64","experiment_id":"f9f8eed7","git_dirty":true,"learning_rate":"0.001","min_val_loss":-0.3683866411447525,"stdout_size_bytes":90675},{"batch_size":"64","experiment_id":"9b0fcb92","git_dirty":true,"learning_rate":"0.001","min_val_loss":-0.33762938901782036,"stdout_size_bytes":90848},{"batch_size":"64","experiment_id":"34716ceb","git_dirty":true,"learning_rate":"0.001","min_val_loss":-0.2749519720673561,"stdout_size_bytes":90992},{"batch_size":"64","experiment_id":"86d5cf02","git_dirty":true,"learning_rate":"0.001","min_val_loss":-0.2574565075337887,"stdout_size_bytes":…
judge: correct ✓
prompt tokens
33,679
all-in cost
$0.0180
wall time
12.6s
NoKV namespace4 calls
calls
4/4
est. result tok
1,241
cost share
24%
  1. 01catalog{"field_prefix":null,"include_facets":false,"path":"/yanex/runs"}32 ms0 B · ≈629 tok

    arguments — 280-char telemetry extract

    {"field_prefix":null,"include_facets":false,"path":"/yanex/runs"}

    status: success · bytes read: 0 B · est. result tokens: ≈629

  2. 02aggregate{"group_by":["experiment_id"],"limit":5,"measures":[{"field":"val_loss","name":"min_val_loss","op":"min"}],"path":"/yanex/runs","predicates":[{"field":"script","op":"eq","value":"train.py"},{"field":"status","op":"eq","value":"completed"},{"field":"val_loss","op":"exists","value…76 ms0 B · ≈26 tok

    arguments — 280-char telemetry extract

    {"group_by":["experiment_id"],"limit":5,"measures":[{"field":"val_loss","name":"min_val_loss","op":"min"}],"path":"/yanex/runs","predicates":[{"field":"script","op":"eq","value":"train.py"},{"field":"status","op":"eq","value":"completed"},{"field":"val_loss","op":"exists","value…

    status: success · bytes read: 0 B · est. result tokens: ≈26

  3. 03find{"cursor":null,"facets":[],"fields":["run.id","run.name","run.status","run.script","param.origami.training.learning_rate","param.origami.training.batch_size","metric.val_loss.min","artifact.stdout_size_bytes","git.dirty"],"limit":5,"path":"/yanex/runs","predicates":[{"field":"ru…77 ms0 B · ≈464 tok

    arguments — 280-char telemetry extract

    {"cursor":null,"facets":[],"fields":["run.id","run.name","run.status","run.script","param.origami.training.learning_rate","param.origami.training.batch_size","metric.val_loss.min","artifact.stdout_size_bytes","git.dirty"],"limit":5,"path":"/yanex/runs","predicates":[{"field":"ru…

    status: success · bytes read: 0 B · est. result tokens: ≈464

  4. 04aggregate{"group_by":["run.id"],"limit":5,"measures":[{"field":"metric.val_loss.min","name":"min_val_loss","op":"min"}],"path":"/yanex/runs","predicates":[{"field":"run.script","op":"eq","value":"train.py"},{"field":"run.status","op":"eq","value":"completed"},{"field":"metric.val_loss.mi…70 ms0 B · ≈122 tok

    arguments — 280-char telemetry extract

    {"group_by":["run.id"],"limit":5,"measures":[{"field":"metric.val_loss.min","name":"min_val_loss","op":"min"}],"path":"/yanex/runs","predicates":[{"field":"run.script","op":"eq","value":"train.py"},{"field":"run.status","op":"eq","value":"completed"},{"field":"metric.val_loss.mi…

    status: success · bytes read: 0 B · est. result tokens: ≈122

Final answer
{"runs":[{"batch_size":"64","experiment_id":"f9f8eed7","git_dirty":true,"learning_rate":"0.001","min_val_loss":-0.3683866411447525,"stdout_size_bytes":90675},{"batch_size":"64","experiment_id":"9b0fcb92","git_dirty":true,"learning_rate":"0.001","min_val_loss":-0.33762938901782036,"stdout_size_bytes":90848},{"batch_size":"64","experiment_id":"34716ceb","git_dirty":true,"learning_rate":"0.001","min_val_loss":-0.2749519720673561,"stdout_size_bytes":90992},{"batch_size":"64","experiment_id":"86d5cf02","git_dirty":true,"learning_rate":"0.001","min_val_loss":-0.2574565075337887,"stdout_size_bytes":…
judge: correct ✓
prompt tokens
7,140
all-in cost
$0.0055
wall time
9.1s

Telemetry records tool inputs and the final answer — tool result bodies and model reasoning are not logged, so this is a call replay, not a transcript. Each column is one representative episode (the median-cost correct run of 10 repeats), so its totals differ from the 10-repeat means above.

Five tasks

Win, loss, mixed, tie — all of it published

The namespace wins compound exploration, not every query shape. T2 went to SQL on single-shot analytics, and T3 is genuinely mixed.

T1

Sweep report

train_top_configs_report
Win

“Find the 5 best completed training runs by minimum val_loss; report learning rate, batch size, stdout size, and git state for each.”

arm ok prompt tokens cost
NoKV 100% 7,914 $0.0058
SQL 50% 23,626 $0.0132

The 806k-row min-per-run join goes wrong half the time on SQL — silently. On NoKV it is one catalog call plus one find.

T2

Fidelity leaderboard

eval_fidelity_leaderboard
Loss

“Among completed eval.py runs, find the 5 runs with the highest latest fidelity; report latest utility/detection/privacy metrics plus stderr size.”

arm ok prompt tokens cost
NoKV 100% 9,314 $0.0066
SQL 100% 4,787 $0.0062

Single-shot analytics: a schema dump plus one SELECT, 4.8k tokens. SQL won this task, and we say so.

T3

Checkpoint provenance

tabdiff_ddxplus_dcr_checkpoint_provenance
Mixed

“For every TabDiff sampling run of the ddxplus_dcr dataset, report which checkpoint file the sampler loaded and the loaded model's parameter count (both only in stdout logs).”

arm ok prompt tokens cost
NoKV 60% 35,778 $0.0178
SQL 100% 84,607 $0.0322

SQL got it right every time — at 2.4× the price. NoKV missed twice across 10 repeats.

Costed on correct answers only, NoKV is still cheaper ($0.0297 vs $0.0322).

T4

Detection method audit

best_detection_eval_method_audit
Tie

“Find the completed eval.py run with the highest latest detection_roc_auc and report the detection method name exactly as its log states it.”

arm ok prompt tokens cost
NoKV 90% 20,213 $0.0081
SQL 90% 19,334 $0.0087

Statistical tie: both arms at 90%, near-identical bills.

T5

Interrupt triage

cancelled_train_interrupt_triage
Win

“For every non-completed run: status, stderr size, whether stderr contains a KeyboardInterrupt, and the line number of its last occurrence.”

arm ok prompt tokens cost
NoKV 100% 9,608 $0.0050
SQL 100% 19,217 $0.0104

grep returns line numbers as native citations; SQL pays roughly double to recover them.

Per run

All 100 runs, no averaging

Every episode from the committed telemetry: position is what the run cost, color is whether the judge accepted the answer. Hover any dot.

x = all-in cost per run (USD, log scale)
$0.005$0.01$0.02$0.05$0.1T1Sweep reportT2Fidelity leaderboardT3Checkpoint provenanceT4Detection method auditT5Interrupt triageNoKVSQLNoKVSQLNoKVSQLNoKVSQLNoKVSQL

Caveats

Read this before quoting the numbers

  • One model. Everything was measured on gpt-5.4-mini. A stronger model may shrink SQL's correctness gap on T1.
  • One corpus, five tasks, 10 repeats per arm×task. The task set is deliberately skewed toward compound exploration, because that is the workload NoKV targets.
  • NoKV is not perfect here. On T3 its correctness is 60% — 4 of 10 repeats failed: one cohort over-inclusion, one incomplete cohort, and two missed parameter extractions.
  • Dollars are dated; ratios are not. Pricing uses publication-time list prices (input $0.75/M, cached $0.075/M, output $4.50/M). The ratios are the durable conclusion — the dollar figures are not.
  • Cost-weighted tokens = uncached + 0.1×cached + 6×completion — the list-price ratios expressed in input-token units.
  • Fairness posture. Fully stateless episodes (harness-enforced and test-asserted); both arms have line-level body search (grep vs grep_blob); both arms see logically equivalent indexed facts; the judge's gold is invisible to both.
  • The SQLite arm is not naive. It includes ETL'd index tables and grep_blob — the residual difference is the interface shape itself.

Check our work

There is no rerun button here — the harness needs the 2.3 GB corpus and an API key — but every published number is recomputable from the committed telemetry.