Open benchmark

Same data. Same model.
Two interfaces.

Name: NoKV Agent Interface Benchmark — telemetry
Creator: NoKV
License: https://www.apache.org/licenses/LICENSE-2.0

Two agent-facing interfaces over one fixed corpus: 875 Yanex ML-experiment runs carrying 806k metric rows. The same gpt-5.4-mini model answered the same five questions through raw SQLite (live schema discovery, read-only SQL, grep_blob) and through the NoKV namespace (seven verbs) — 10 repeats per arm and task, 100 fully stateless episodes, judged against deterministic gold facts neither arm can see.

corpus: 875 Yanex runs · 806k metric rows
model: gpt-5.4-mini
arms: sqlite_raw_v1 · nokv_native_v1
repeats: 10 per arm × task
episodes: 100 · fully stateless
judge: deterministic gold facts, invisible to both arms

1.83 ×

fewer prompt tokens

set mean: 151,572 → 82,827

1.63 ×

lower cost

set mean: $0.0708 → $0.0433

2.39 ×

on compound exploration

prompt tokens, T1 + T3 + T5

4.5 / 5

tasks solved correctly

raw SQLite: 4.4 / 5

By task

Where the bill actually comes from

10-repeat means per task and arm, from the published table. Toggle between prompt tokens and all-in cost — the ordering barely moves.

Tool-call replay

Same task, two interfaces — call by call

One representative episode per arm, replayed straight from the committed telemetry. The interface shape is the cost: count the calls each arm needs before it can answer.

T1Find the 5 best completed training runs by minimum val_loss; report learning rate, batch size, stdout size, and git state for each.10-repeat verdict: Win

Raw SQLite6 calls

calls: 6/6
est. result tok: ≈10,437
cost share: 76%

01show_schema{}1 ms0 B · ≈870 tok
arguments — 280-char telemetry extract
```
{}
```
status: success · bytes read: 0 B · est. result tokens: ≈870
02query_sql{"sql":"WITH completed AS (\n SELECT experiment_id\n FROM experiments\n WHERE script_path = 'train.py' AND status = 'completed'\n), min_losses AS (\n SELECT m.experiment_id, MIN(m.value) AS min_val_loss\n FROM metrics m\n JOIN completed c ON c.experiment_id = m.experiment_…46 ms0 B · ≈261 tok
arguments — 280-char telemetry extract
```
{"sql":"WITH completed AS (\n  SELECT experiment_id\n  FROM experiments\n  WHERE script_path = 'train.py' AND status = 'completed'\n), min_losses AS (\n  SELECT m.experiment_id, MIN(m.value) AS min_val_loss\n  FROM metrics m\n  JOIN completed c ON c.experiment_id = m.experiment_…
```
status: success · bytes read: 0 B · est. result tokens: ≈261
03query_sql{"sql":"SELECT experiment_id, param_path, value_json FROM params WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') AND param_path IN ('learning_rate','batch_size') ORDER BY experiment_id, param_path;"}1 ms0 B · ≈16 tok
arguments — 280-char telemetry extract
```
{"sql":"SELECT experiment_id, param_path, value_json FROM params WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') AND param_path IN ('learning_rate','batch_size') ORDER BY experiment_id, param_path;"}
```
status: success · bytes read: 0 B · est. result tokens: ≈16
04query_sql{"sql":"SELECT experiment_id, artifact_path, size_bytes FROM artifacts WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') AND artifact_path='stdout.txt';"}1 ms0 B · ≈229 tok
arguments — 280-char telemetry extract
```
{"sql":"SELECT experiment_id, artifact_path, size_bytes FROM artifacts WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') AND artifact_path='stdout.txt';"}
```
status: success · bytes read: 0 B · est. result tokens: ≈229
05query_sql{"sql":"SELECT experiment_id, param_path, value_json, value_type FROM params WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') ORDER BY experiment_id, param_path LIMIT 200;"}10 ms0 B · ≈8,796 tok
arguments — 280-char telemetry extract
```
{"sql":"SELECT experiment_id, param_path, value_json, value_type FROM params WHERE experiment_id IN ('f9f8eed7','9b0fcb92','34716ceb','86d5cf02','36b0a393') ORDER BY experiment_id, param_path LIMIT 200;"}
```
status: success · bytes read: 0 B · est. result tokens: ≈8,796
06query_sql{"sql":"WITH completed AS (\n SELECT experiment_id\n FROM experiments\n WHERE script_path = 'train.py' AND status = 'completed'\n), min_losses AS (\n SELECT m.experiment_id, MIN(m.value) AS min_val_loss\n FROM metrics m\n JOIN completed c ON c.experiment_id = m.experiment_…40 ms0 B · ≈265 tok
arguments — 280-char telemetry extract
```
{"sql":"WITH completed AS (\n  SELECT experiment_id\n  FROM experiments\n  WHERE script_path = 'train.py' AND status = 'completed'\n), min_losses AS (\n  SELECT m.experiment_id, MIN(m.value) AS min_val_loss\n  FROM metrics m\n  JOIN completed c ON c.experiment_id = m.experiment_…
```
status: success · bytes read: 0 B · est. result tokens: ≈265

NoKV namespace4 calls

calls: 4/4
est. result tok: ≈1,241
cost share: 24%

01catalog{"field_prefix":null,"include_facets":false,"path":"/yanex/runs"}32 ms0 B · ≈629 tok
arguments — 280-char telemetry extract
```
{"field_prefix":null,"include_facets":false,"path":"/yanex/runs"}
```
status: success · bytes read: 0 B · est. result tokens: ≈629
02aggregate{"group_by":["experiment_id"],"limit":5,"measures":[{"field":"val_loss","name":"min_val_loss","op":"min"}],"path":"/yanex/runs","predicates":[{"field":"script","op":"eq","value":"train.py"},{"field":"status","op":"eq","value":"completed"},{"field":"val_loss","op":"exists","value…76 ms0 B · ≈26 tok
arguments — 280-char telemetry extract
```
{"group_by":["experiment_id"],"limit":5,"measures":[{"field":"val_loss","name":"min_val_loss","op":"min"}],"path":"/yanex/runs","predicates":[{"field":"script","op":"eq","value":"train.py"},{"field":"status","op":"eq","value":"completed"},{"field":"val_loss","op":"exists","value…
```
status: success · bytes read: 0 B · est. result tokens: ≈26
03find{"cursor":null,"facets":[],"fields":["run.id","run.name","run.status","run.script","param.origami.training.learning_rate","param.origami.training.batch_size","metric.val_loss.min","artifact.stdout_size_bytes","git.dirty"],"limit":5,"path":"/yanex/runs","predicates":[{"field":"ru…77 ms0 B · ≈464 tok
arguments — 280-char telemetry extract
```
{"cursor":null,"facets":[],"fields":["run.id","run.name","run.status","run.script","param.origami.training.learning_rate","param.origami.training.batch_size","metric.val_loss.min","artifact.stdout_size_bytes","git.dirty"],"limit":5,"path":"/yanex/runs","predicates":[{"field":"ru…
```
status: success · bytes read: 0 B · est. result tokens: ≈464
04aggregate{"group_by":["run.id"],"limit":5,"measures":[{"field":"metric.val_loss.min","name":"min_val_loss","op":"min"}],"path":"/yanex/runs","predicates":[{"field":"run.script","op":"eq","value":"train.py"},{"field":"run.status","op":"eq","value":"completed"},{"field":"metric.val_loss.mi…70 ms0 B · ≈122 tok
arguments — 280-char telemetry extract
```
{"group_by":["run.id"],"limit":5,"measures":[{"field":"metric.val_loss.min","name":"min_val_loss","op":"min"}],"path":"/yanex/runs","predicates":[{"field":"run.script","op":"eq","value":"train.py"},{"field":"run.status","op":"eq","value":"completed"},{"field":"metric.val_loss.mi…
```
status: success · bytes read: 0 B · est. result tokens: ≈122

Telemetry records tool inputs and the final answer — tool result bodies and model reasoning are not logged, so this is a call replay, not a transcript. Each column is one representative episode (the median-cost correct run of 10 repeats), so its totals differ from the 10-repeat means above.

Five tasks

Win, loss, mixed, tie — all of it published

The namespace wins compound exploration, not every query shape. T2 went to SQL on single-shot analytics, and T3 is genuinely mixed.

Sweep report

train_top_configs_report

Win

“Find the 5 best completed training runs by minimum val_loss; report learning rate, batch size, stdout size, and git state for each.”

arm ok prompt tokens cost

NoKV 100% 7,914 $0.0058

SQL 50% 23,626 $0.0132

The 806k-row min-per-run join goes wrong half the time on SQL — silently. On NoKV it is one catalog call plus one find.

Fidelity leaderboard

eval_fidelity_leaderboard

Loss

“Among completed eval.py runs, find the 5 runs with the highest latest fidelity; report latest utility/detection/privacy metrics plus stderr size.”

arm ok prompt tokens cost

NoKV 100% 9,314 $0.0066

SQL 100% 4,787 $0.0062

Single-shot analytics: a schema dump plus one SELECT, 4.8k tokens. SQL won this task, and we say so.

Checkpoint provenance

tabdiff_ddxplus_dcr_checkpoint_provenance

Mixed

“For every TabDiff sampling run of the ddxplus_dcr dataset, report which checkpoint file the sampler loaded and the loaded model's parameter count (both only in stdout logs).”

arm ok prompt tokens cost

NoKV 60% 35,778 $0.0178

SQL 100% 84,607 $0.0322

SQL got it right every time — at 2.4× the price. NoKV missed twice across 10 repeats.

Costed on correct answers only, NoKV is still cheaper ($0.0297 vs $0.0322).

Detection method audit

best_detection_eval_method_audit

Tie

“Find the completed eval.py run with the highest latest detection_roc_auc and report the detection method name exactly as its log states it.”

arm ok prompt tokens cost

NoKV 90% 20,213 $0.0081

SQL 90% 19,334 $0.0087

Statistical tie: both arms at 90%, near-identical bills.

Interrupt triage

cancelled_train_interrupt_triage

Win

“For every non-completed run: status, stderr size, whether stderr contains a KeyboardInterrupt, and the line number of its last occurrence.”

arm ok prompt tokens cost

NoKV 100% 9,608 $0.0050

SQL 100% 19,217 $0.0104

grep returns line numbers as native citations; SQL pays roughly double to recover them.

Per run

All 100 runs, no averaging

Every episode from the committed telemetry: position is what the run cost, color is whether the judge accepted the answer. Hover any dot.

x = all-in cost per run (USD, log scale)

Caveats

Read this before quoting the numbers

One model. Everything was measured on gpt-5.4-mini. A stronger model may shrink SQL's correctness gap on T1.
One corpus, five tasks, 10 repeats per arm×task. The task set is deliberately skewed toward compound exploration, because that is the workload NoKV targets.
NoKV is not perfect here. On T3 its correctness is 60% — 4 of 10 repeats failed: one cohort over-inclusion, one incomplete cohort, and two missed parameter extractions.
Dollars are dated; ratios are not. Pricing uses publication-time list prices (input $0.75/M, cached $0.075/M, output $4.50/M). The ratios are the durable conclusion — the dollar figures are not.
Cost-weighted tokens = uncached + 0.1×cached + 6×completion — the list-price ratios expressed in input-token units.
Fairness posture. Fully stateless episodes (harness-enforced and test-asserted); both arms have line-level body search (grep vs grep_blob); both arms see logically equivalent indexed facts; the judge's gold is invisible to both.
The SQLite arm is not naive. It includes ETL'd index tables and grep_blob — the residual difference is the interface shape itself.

Check our work

There is no rerun button here — the harness needs the 2.3 GB corpus and an API key — but every published number is recomputable from the committed telemetry.

Harness source ↗ Full report ↗ Raw telemetry (100 runs) ↗

Same data. Same model. Two interfaces.

Where the bill actually comes from

Same task, two interfaces — call by call

Win, loss, mixed, tie — all of it published

All 100 runs, no averaging

Read this before quoting the numbers

Check our work

Same data. Same model.
Two interfaces.