We ran the same ML-research tasks against two interfaces over identical data. The filesystem-shaped one used 45% fewer tokens, cost 39% less, and got more answers right. Here is the argument, the evidence, and what it means for your stack.
The week token economics stopped being a footnote
Two things happened in the same week of June 2026.
On June 9, Anthropic shipped Claude Fable 5. In Anthropic’s own words, “Apps that took a hundred prompts a year ago, it now one-shots.” The market reacted within a day: Amplitude, Atlassian, and Guidewire slid as analysts rehearsed the “SaaSpocalypse” thesis. If a $20/month agent can do long-horizon, multi-step knowledge work, per-seat subscriptions get hard to defend. The capability curve is doing to software what it already did to demos: when models this strong are an API call away, every serious enterprise becomes a potential builder of internal agents rather than a buyer of packaged workflows. (Futurum’s Rolf Bulk, back in February: “There’s likely to be cannibalization of SaaS by AI-driven workflows.”)
On June 10, the Linux Foundation announced Tokenomicon, an entire conference dedicated to the economics of AI, citing Goldman Sachs research that projects global token usage to multiply roughly 24x between 2026 and 2030. The economics of tokens now has its own conference circuit. Jensen Huang has been saying this for over a year: datacenters are “AI factories” with one job, “generating these incredible tokens”, and by GTC 2026 the framing had hardened: “Tokens are the new commodity… your tokens are your commodity, and that compute is your revenue.”
Put the two together and you get the thesis of this post. If everyone can build agents, having agents is not a differentiator. What differentiates is the unit economics of every task your agents run. In our experience, the single biggest lever on that bill is not the model or the prompt. It is the interface your agent operates on. Frontier tokens are not cheap (Fable 5 lists at $10/M input, $50/M output), volume is about to multiply, and every wasted exploration turn re-bills the entire conversation.
An agent-friendly interface is a token-efficiency strategy. We benchmarked it. But first, the argument.
Part 1 — Do agents prefer filesystems?
In August 2025, Letta published a benchmark with a provocative title: “Benchmarking AI Agent Memory: Is a Filesystem All You Need?” A Letta agent on gpt-4o-mini that simply stored conversation history in files scored 74.0% on the LoCoMo long-conversation-memory benchmark, beating Mem0’s reported 68.5% for its top-performing graph variant, a tool purpose-built for agent memory. Their conclusion: “With a well-designed agent, even simple filesystem tools are sufficient to perform well on retrieval benchmarks such as LoCoMo.”
The intuition has been circulating among systems people too. Pekka Enberg, sketching a disaggregated agent filesystem on object storage, put it bluntly:
“Give an agent access to
grep,sed,awk,cat, andgit, and it becomes unreasonably capable and effective, requiring no custom tools.”
There is nothing mystical here. Filesystem and shell are among the most common computing interfaces in LLM training data, and the past two years of post-training have specifically optimized frontier models for agentic coding tasks. That is why coding agents consistently feel like the strongest agents anyone has shipped. The skills transfer: navigate a tree, grep for a needle, read what you found, cite the line.
A first, rough conclusion: agents want filesystems.
Part 2 — Why: the shape of an agent-friendly surface
The affinity is not just familiarity. A filesystem is a progressive-disclosure interface with stable handles: an agent first locates the thing by directory, by name, or by grep, and only then pays to read its content. Cheap discovery, lazy loading, composable steps. SQL is excellent at relational queries and aggregation, but in the locate-the-handle phase it front-loads cognitive cost: schema comprehension, join semantics, field naming, and query composition. The agent pays for all of that in tokens and in error probability before it has found anything.
The two biggest model labs have both, independently, converged on this shape for their own surfaces:
-
Anthropic showed in Code execution with MCP that presenting MCP tools as a TypeScript file tree (
servers/google-drive/getDocument.ts, …) instead of a flat tool list cut a representative workload “from 150,000 tokens to 2,000 tokens — a time and cost saving of 98.7%.” Their explanation points at the same interface shape: “presenting tools as code on a filesystem allows models to read tool definitions on-demand, rather than reading them all up-front.” The same progressive-disclosure principle drives Agent Skills: a skill costs ~100 tokens of metadata until the agent actually opens it. -
OpenAI’s tool search guide recommends organizing deferred tools into namespaces or MCP servers rather than flat lists, and is unusually explicit about why: “Our models have primarily been trained to search those surfaces, and token savings are usually more material there.”
Neither lab routes its agents through a SQL schema as the primary surface. SQL can represent the data, but the agent systems that work today (and the ones being trained for tomorrow) are code-executing, lazily-loading, search-first systems. Lazy loading over a named, hierarchical, searchable namespace is the consistently observed token-efficiency win.
So let’s refine Part 1’s conclusion: agents want filesystem-shaped surfaces.
Part 3 — Does that mean you should go all-in on the filesystem?
When this debate reached Hacker News in January 2026, the 200+ point thread around “FUSE is All You Need – Giving agents access to anything via filesystems” split predictably. Skeptics: a FUSE layer is “an extra layer of indirection for indirection’s sake”; LLMs can call APIs and write SQL directly; permissions belong in the underlying systems. Supporters: filesystem interfaces match the training data and Unix philosophy; one commenter reported running exactly this pattern in production, saying “It opens up absolutely bonkers capabilities.”
We think the framing of that fight is the actual mistake. It conflates the filesystem-shaped interface with the filesystem as storage substrate. Those are two independent design decisions, and the discussion that matters for agent systems is about the interface. Mikiko Bazeley, Staff Developer Advocate for Agentic Systems at MongoDB, put it precisely:
“The debate was never “filesystem or database;” it was always both, in the right layers.”
Her article also carries the honest counterpoint: in Vercel’s testing, database queries beat filesystem operations on structured data, with 100% accuracy and lower token usage. Both facts are true at once, and they decompose cleanly: the interface question (what surface does the agent operate on?) and the substrate question (where does state actually live and persist?) have different answers.
Our answer: agents operate best over simple, inspectable, workspace-shaped interfaces, while the real data stays in databases, object stores, and APIs underneath. The namespace is the agent’s view, not the storage engine.
That is a falsifiable claim about interfaces, holding data constant. So we tested exactly that.
Part 4 — We benchmarked it
We built an open benchmark harness that gives an agent the same fixed ML-experiment corpus through two different surfaces. The corpus contains 875 training/sampling/eval runs from a real experiment tracker: metadata, params, 806k metric rows, artifacts, git state, and raw stdout/stderr logs.
sqlite_raw_v1: a raw SQL surface with live schema discovery, bounded read-only SQL over fully materialized tables (including the log bodies as blobs), byte-range blob reads, and a line-orientedgrep_blob.nokv_native_v1: NoKV’s namespace surface withls,stat,catalog,find,aggregate,read, and a namespace-recursivegrep. Runs are directories; logs are files; indexed facts are queryable with predicates, sort, and projection pushed down into one call.
Five tasks are written the way an ML researcher actually talks to an agent: “find my best configs,” “which checkpoint did the privacy-study samplers load,” “write the incident note for the cancelled jobs.” The agent is gpt-5.4-mini; 10 repeats per arm/task (100 runs total); every run is a fully stateless episode. The harness refuses to run otherwise: fresh runner process, context rebuilt from only the system message, interface card, and task prompt, no response chaining across runs. Correctness is judged against deterministic gold facts that neither arm can see. And the comparison is deliberately hard on ourselves: both arms get a case-insensitive, line-oriented body search with line numbers; both arms see logically identical index facts; SQL holds the entire corpus, logs included, one query away.
USD costs use gpt-5.4-mini list rates ($0.75/M input, $0.075/M cached input, $4.50/M output), recorded per run in the telemetry.
The headline
| Set mean (per 5-task pass) | Raw SQLite | NoKV namespace | SQLite / NoKV |
|---|---|---|---|
| Tasks solved correctly | 4.40 / 5 | 4.50 / 5 | — |
| Prompt tokens (incl. cached) | 151,572 | 82,827 | 1.83x |
| Total tokens (incl. completion) | 156,098 | 87,418 | 1.79x |
| Cost (USD) | $0.0708 | $0.0433 | 1.63x |
Same data, same model, same questions: the namespace surface answered more accurately on 45% fewer tokens and a 39% smaller bill.
And on the three compound exploration tasks, where the agent has to locate a cohort, extract facts from log bodies, and cite the line, the gap widens to where the thesis lives:
| Compound tasks (T1+T3+T5) | Raw SQLite | NoKV namespace | SQLite / NoKV |
|---|---|---|---|
| Prompt tokens | 127,450 | 53,300 | 2.39x |
| Cost (USD) | $0.0558 | $0.0286 | 1.95x |
| Mean correctness | 83.3% | 86.7% | — |
What compound exploration looks like
T1 — the sweep report. “Find the 5 best completed training runs by minimum val_loss; report learning rate, batch size, stdout size, and git state.” On the namespace: one catalog call discovers the fields, one find call pushes predicates, sort, limit, and a six-field projection into the engine, producing a 100% correct answer at 7.9k tokens. On SQL the same answer is a min-per-run aggregation over an 806k-row metric table joined against params, artifacts, and git state. The model wrote that query wrong half the time, silently, at 23.6k tokens either way. Failed runs still bill.
T3 — checkpoint provenance. “For every TabDiff sampling run of the ddxplus_dcr dataset, report which checkpoint file the sampler loaded and the loaded model’s parameter count.” Those facts exist only inside stdout logs. On the namespace, one recursive grep for the dataset line identifies the ten run directories. The path is the cohort handle, and scoped greps return the Checkpoint: and Model parameters: lines directly (35.8k tokens). On SQL, even with grep_blob available, the model must first resolve params → artifacts → blob_ref indirection and then orchestrate one call per blob handle; in most repeats it gave up and dragged whole stdout blobs through query results. It still got the right answer, but at 84.6k tokens. (Honesty note: the namespace arm slipped on this task in 4 of 10 repeats, 60% vs SQL’s 100%, yet remains cheaper per correct answer, $0.0297 vs $0.0322.)
T5 — incident triage. “For every non-completed run: status, stderr size, whether stderr contains a KeyboardInterrupt, and the line number of its last occurrence.” The namespace gets the cohort plus stderr sizes from one find, and scoped greps return matching lines with line numbers, so audit citations fall out of the surface for free. 100% correct at 9.6k tokens, versus SQL’s 100% at 19.2k. SQL only stopped failing this task after we gave it a line-oriented search tool, because line numbers exist nowhere in a relational projection.
Why the namespace wins compound work
- Paths are cohort handles. Finding a run and reading its logs happen in the same address space; SQL interposes blob-handle indirection between “which runs” and “what their logs say.”
- Recursive, scoped search. The same
grepsweeps the corpus for discovery and a single directory for extraction;grep_blobsees one blob at a time. - Line numbers are native citations. “Which line says so” is the answer format auditors want, and SQL needs newline arithmetic to produce it.
- Push-down keeps turns short. Predicates + sort + limit + projection in one call. Every avoided turn avoids re-billing the whole conversation; under prompt-cache economics, turn count is the bill.
Where SQL holds its ground
We publish the losses too, because they confirm Part 3 rather than embarrass it. On the single-shot leaderboard task, SQL was hard to beat: schema dump plus one SELECT, 4.8k tokens, while the namespace took 9.3k. Another task was a statistical tie. That is exactly Vercel’s structured-data finding reproduced in our own data, and exactly Bazeley’s “both, in the right layers”: relational surfaces excel at single-shot analytics; namespace surfaces win compound exploration, where agent workloads are heading.
Everything here is reproducible: the harness, tasks, and judge are in the repo, the full report documents the methodology and fairness posture, and the raw telemetry of all 100 runs is committed, so every number above can be recomputed from source.
Part 5 — Which interfaces should the filesystem shape actually carry?
“Give the agent a filesystem” is not one decision; it is a small set of surfaces. From the benchmark and from building NoKV, three carry most of the value:
- Artifact & metadata control. Typed cards (
stat/catalog) that tell the agent what fields exist before it queries;find/aggregatewith full push-down so ranking and grouping cost one call;grepthat returns line-numbered evidence. This is the surface that turned T1 from a 50% coin-flip into a deterministic two-call answer. - Workspace management. A run is a directory; an experiment is a namespace; publishing results is an atomic publish into the tree rather than a row insert the agent can’t inspect. Fused listing keeps discovery one call wide; external body references let logs and checkpoints live in object storage while remaining one
readaway. - Snapshots and watches. Agents are long-running and concurrent. Snapshot reads give an agent a consistent view of the workspace while training jobs keep writing; watchable updates let an observer agent react to new runs without polling; quotas keep a misbehaving agent from flooding the namespace.
Note what is not on this list: replacing your database. The namespace is the agent-facing view; the substrate behind it should stay whatever your data already trusts.
Part 6 — Does your project need an agent interface layer?
When an agent workspace becomes a shared namespace spanning runtimes and object stores, with watchable updates, snapshot reads, rollback-able state, and atomic publish, that semantic layer should not be re-implemented in every FUSE adapter, MCP server, and SDK as bespoke SQL. It deserves to be a metadata plane of its own. The practical test is simple: if your project involves (a) multi-agent workflows, (b) long-running sessions, or (c) metadata that must persist and be re-discovered across sessions, you need a filesystem-shaped query dictionary in front of your data.
The first domain where this is already obvious is AI experiment tracking. Experiment and observability data is becoming agent-readable and agent-operable: the first productized use case is natural-language analysis over historical runs, traces, evals, and artifacts; autonomous experiment management is emerging behind it, gated on permissioning, auditability, and human-in-the-loop control. Our benchmark tasks are that workload. They showed that when an observer agent works this surface all day, the interface choice alone is worth ~2x on the token bill.
That is what NoKV is: a metadata control plane for agent workspaces. It provides atomic publish, fused listing, snapshot reads, watchable updates, external body references, and quotas. It happens to present the namespace surface agents were trained to be good at. The token efficiency isn’t a feature we bolted on. It is what falls out when the interface matches the model.
Tokens are the new commodity. Spend yours on answers, not on navigation.
Sources
- Letta — Benchmarking AI Agent Memory: Is a Filesystem All You Need? (Aug 2025)
- Pekka Enberg — Towards a Disaggregated Agent Filesystem on Object Storage (Jan 2026)
- Anthropic — Code execution with MCP: Building more efficient agents (Nov 2025); Agent Skills overview; Introducing Claude Fable 5 and Mythos 5 (Jun 2026)
- OpenAI — Tool search guide
- Hacker News — FUSE is All You Need – Giving agents access to anything via filesystems (Jan 2026)
- Mikiko Bazeley, The New Stack — The “files are all you need” debate misses what’s actually happening in agent memory architecture (Mar 2026)
- Linux Foundation — Tokenomicon announcement (Jun 2026)
- Jensen Huang — GTC 2025 keynote transcript; GTC 2026 token-economics coverage, RCR Wireless; GTC Taipei 2026 coverage, SiliconANGLE
- SaaS market reaction — Yahoo Finance / StockStory (Jun 2026); CNBC on AI and SaaS selloff (Feb 2026)
- NoKV — benchmark harness & README · full benchmark report · raw run telemetry (100 runs)