Agents Want Filesystems: Agent-Friendly Interfaces Are a Token-Efficiency Strategy

If tokens are a new unit of computation, then interface shape is part of the cost structure.

We ran the same ML-research tasks against two interfaces over identical data. The filesystem-shaped one used 45% fewer tokens, cost 39% less, and got more answers right. Here is the argument, the evidence, and what it means for your stack.

The week token economics stopped being a footnote

Two things happened in the same week of June 2026.

On June 9, Anthropic shipped Claude Fable 5. In Anthropic’s own words, “Apps that took a hundred prompts a year ago, it now one-shots.” The market reacted within a day: Amplitude, Atlassian, and Guidewire slid as analysts rehearsed the “SaaSpocalypse” thesis. If a $20/month agent can do long-horizon, multi-step knowledge work, per-seat subscriptions get hard to defend. The capability curve is doing to software what it already did to demos: when models this strong are an API call away, every serious enterprise becomes a potential builder of internal agents rather than a buyer of packaged workflows. (Futurum’s Rolf Bulk, back in February: “There’s likely to be cannibalization of SaaS by AI-driven workflows.”)

On June 10, the Linux Foundation announced Tokenomicon, an entire conference dedicated to the economics of AI, citing Goldman Sachs research that projects global token usage to multiply roughly 24x between 2026 and 2030. The economics of tokens now has its own conference circuit. Jensen Huang has been saying this for over a year: datacenters are “AI factories” with one job, “generating these incredible tokens”, and by GTC 2026 the framing had hardened: “Tokens are the new commodity… your tokens are your commodity, and that compute is your revenue.”

Put the two together and you get the thesis of this post. If everyone can build agents, having agents is not a differentiator. What differentiates is the unit economics of every task your agents run. In our experience, the single biggest lever on that bill is not the model or the prompt. It is the interface your agent operates on. Frontier tokens are not cheap (Fable 5 lists at $10/M input, $50/M output), volume is about to multiply, and every wasted exploration turn re-bills the entire conversation.

An agent-friendly interface is a token-efficiency strategy. We benchmarked it. But first, the argument.

Part 1 — Do agents prefer filesystems?

In August 2025, Letta published a benchmark with a provocative title: “Benchmarking AI Agent Memory: Is a Filesystem All You Need?” A Letta agent on gpt-4o-mini that simply stored conversation history in files scored 74.0% on the LoCoMo long-conversation-memory benchmark, beating Mem0’s reported 68.5% for its top-performing graph variant, a tool purpose-built for agent memory. Their conclusion: “With a well-designed agent, even simple filesystem tools are sufficient to perform well on retrieval benchmarks such as LoCoMo.”

The intuition has been circulating among systems people too. Pekka Enberg, sketching a disaggregated agent filesystem on object storage, put it bluntly:

“Give an agent access to grep, sed, awk, cat, and git, and it becomes unreasonably capable and effective, requiring no custom tools.”

There is nothing mystical here. Filesystem and shell are among the most common computing interfaces in LLM training data, and the past two years of post-training have specifically optimized frontier models for agentic coding tasks. That is why coding agents consistently feel like the strongest agents anyone has shipped. The skills transfer: navigate a tree, grep for a needle, read what you found, cite the line.

A first, rough conclusion: agents want filesystems.

Part 2 — Why: the shape of an agent-friendly surface

The affinity is not just familiarity. A filesystem is a progressive-disclosure interface with stable handles: an agent first locates the thing by directory, by name, or by grep, and only then pays to read its content. Cheap discovery, lazy loading, composable steps. SQL is excellent at relational queries and aggregation, but in the locate-the-handle phase it front-loads cognitive cost: schema comprehension, join semantics, field naming, and query composition. The agent pays for all of that in tokens and in error probability before it has found anything.

The two biggest model labs have both, independently, converged on this shape for their own surfaces:

Anthropic showed in Code execution with MCP that presenting MCP tools as a TypeScript file tree (servers/google-drive/getDocument.ts, …) instead of a flat tool list cut a representative workload “from 150,000 tokens to 2,000 tokens — a time and cost saving of 98.7%.” Their explanation points at the same interface shape: “presenting tools as code on a filesystem allows models to read tool definitions on-demand, rather than reading them all up-front.” The same progressive-disclosure principle drives Agent Skills: a skill costs ~100 tokens of metadata until the agent actually opens it.
OpenAI’s tool search guide recommends organizing deferred tools into namespaces or MCP servers rather than flat lists, and is unusually explicit about why: “Our models have primarily been trained to search those surfaces, and token savings are usually more material there.”

Neither lab routes its agents through a SQL schema as the primary surface. SQL can represent the data, but the agent systems that work today (and the ones being trained for tomorrow) are code-executing, lazily-loading, search-first systems. Lazy loading over a named, hierarchical, searchable namespace is the consistently observed token-efficiency win.

So let’s refine Part 1’s conclusion: agents want filesystem-shaped surfaces.

Part 3 — Does that mean you should go all-in on the filesystem?

When this debate reached Hacker News in January 2026, the 200+ point thread around “FUSE is All You Need – Giving agents access to anything via filesystems” split predictably. Skeptics: a FUSE layer is “an extra layer of indirection for indirection’s sake”; LLMs can call APIs and write SQL directly; permissions belong in the underlying systems. Supporters: filesystem interfaces match the training data and Unix philosophy; one commenter reported running exactly this pattern in production, saying “It opens up absolutely bonkers capabilities.”

We think the framing of that fight is the actual mistake. It conflates the filesystem-shaped interface with the filesystem as storage substrate. Those are two independent design decisions, and the discussion that matters for agent systems is about the interface. Mikiko Bazeley, Staff Developer Advocate for Agentic Systems at MongoDB, put it precisely:

“The debate was never “filesystem or database;” it was always both, in the right layers.”

Her article also carries the honest counterpoint: in Vercel’s testing, database queries beat filesystem operations on structured data, with 100% accuracy and lower token usage. Both facts are true at once, and they decompose cleanly: the interface question (what surface does the agent operate on?) and the substrate question (where does state actually live and persist?) have different answers.

Our answer: agents operate best over simple, inspectable, workspace-shaped interfaces, while the real data stays in databases, object stores, and APIs underneath. The namespace is the agent’s view, not the storage engine.

That is a falsifiable claim about interfaces, holding data constant. So we tested exactly that.

Part 4 — We benchmarked it

We tested the interface claim by holding the corpus, model, and tasks constant, and changing only the agent-facing surface.

The corpus contains 875 Yanex ML-experiment runs with metadata, params, 806k metric rows, artifacts, git state, and raw stdout/stderr logs. The model was gpt-5.4-mini. Each of the five tasks ran 10 times on each arm, for 100 fully stateless episodes judged against deterministic gold facts neither arm could see.

The two arms were deliberately close in power:

sqlite_raw_v1: live schema discovery, bounded read-only SQL over materialized tables, byte-range blob reads, and a line-oriented grep_blob.
nokv_native_v1: a namespace surface with ls, stat, catalog, find, aggregate, read, and recursive grep. Runs are directories; logs are files; indexed facts can be filtered, sorted, limited, and projected in one call.

The headline result:

Set mean, per 5-task pass	Raw SQLite	NoKV namespace	SQLite / NoKV
Tasks solved correctly	4.40 / 5	4.50 / 5	—
Prompt tokens, including cached	151,572	82,827	1.83x
Total tokens, including completion	156,098	87,418	1.79x
Cost, USD	$0.0708	$0.0433	1.63x

Same data, same model, same questions: the namespace surface answered slightly more accurately on 45% fewer prompt tokens and a 39% smaller bill.

There is also a less visible signal. Outside the public benchmark headline table, our internal run records show that the NoKV-interface arm consumed substantially fewer reasoning tokens while executing the tasks. We do not treat that as the public benchmark’s primary metric, because the published replay records tool inputs and final answers rather than model reasoning. But it points to the same mechanism: if the interface spends fewer turns forcing the model to reconstruct schema, join paths, blob handles, and temporary context, more of the model’s attention remains available for judgment, cross-checking, and synthesis. Token efficiency is not only a billing story; it is also a cognitive-load story.

The average says the gap exists. The task breakdown explains where it comes from.

SQL still held its ground on simple structured lookup. On the leaderboard-style task, the agent could inspect the schema and write one SELECT; SQLite used about 4.8k prompt tokens, while the namespace used about 9.3k. That is not surprising. Relational interfaces are very good at clear aggregation and sorting.

The real gap opened on compound exploration tasks: find a cohort, extract facts from log bodies, and cite the evidence.

Compound tasks, T1 + T3 + T5	Raw SQLite	NoKV namespace	SQLite / NoKV
Prompt tokens	127,450	53,300	2.39x
Cost, USD	$0.0558	$0.0286	1.95x
Mean correctness	83.3%	86.7%	—

T1 was a sweep report: find the five completed training runs with the lowest val_loss, then report learning rate, batch size, stdout size, and git state. On the namespace surface, the agent used catalog to discover fields and find to push filtering, sorting, limit, and projection into the system. It was 100% correct at about 7.9k prompt tokens. On SQL, the agent had to compute min-per-run over the 806k-row metrics table and join params, artifacts, and git state. It wrote that query wrong half the time, and failed runs still billed, at about 23.6k prompt tokens.

T3 was checkpoint provenance: find which checkpoint files a set of TabDiff sampling runs loaded, plus the model parameter count. Those facts existed only inside stdout. On the namespace surface, recursive grep first found the relevant run directories; those paths then became the cohort handle for scoped searches. On SQL, even with grep_blob, the agent had to resolve params, artifacts, and blob references before searching log bodies. SQL was more reliable here: 100% correct versus NoKV’s 60%. But it paid heavily for that reliability: 84.6k prompt tokens versus 35.8k, and NoKV was still cheaper per correct answer.

T5 was incident triage: for every non-completed run, report status, stderr size, whether stderr contains KeyboardInterrupt, and the line number of the last occurrence. The namespace got the cohort and stderr sizes from one find, then used scoped grep for line-numbered evidence. SQL could answer too, but line numbers are not a native relational projection; recovering them costs extra string work or another tool. Both arms were 100% correct, but NoKV used about 9.6k prompt tokens versus SQL’s 19.2k.

Across these tasks, the namespace wins compound work for four reasons: paths are cohort handles; recursive search can become scoped search; line numbers are native citations; and push-down keeps tool-call turns short. Every avoided turn avoids another context refill and another billable reasoning step.

Where SQL holds its ground

The point is not that SQL loses. The benchmark says the opposite: SQL is excellent for single-shot structured analytics, and in one task it was clearly cheaper. Another task was a statistical tie. That confirms the architecture argument rather than weakening it.

The more useful architecture is two layers:

Bottom layer: keep using databases, object stores, and APIs as the systems of record.
Top layer: maintain metadata, paths, versions, permissions, indexes, and references; expose them to agents as a POSIX-like workspace with path addressing, directory listing, lazy reads, scoped search, permission boundaries, atomic publish, and auditable citations.

The database remains where durable truth lives. The namespace is the agent-facing operating surface in front of it.

Everything here is reproducible: the harness, tasks, judge, full report, and raw telemetry of all 100 runs are in the repo, so every number above can be recomputed from source.

Part 5 — Which surfaces should the filesystem actually carry?

“Give the agent a filesystem” is not one decision; it is a small set of surfaces. From the benchmark and from building NoKV, three carry most of the value:

Artifact & metadata control. Typed cards (stat/catalog) that tell the agent what fields exist before it queries; find/aggregate with full push-down so ranking and grouping cost one call; grep that returns line-numbered evidence. This is the surface that turned T1 from a 50% coin-flip into a deterministic two-call answer.
Workspace management. A run is a directory; an experiment is a namespace; publishing results is an atomic publish into the tree rather than a row insert the agent can’t inspect. Fused listing keeps discovery one call wide; external body references let logs and checkpoints live in object storage while remaining one read away.
Snapshots and watches. Agents are long-running and concurrent. Snapshot reads give an agent a consistent view of the workspace while training jobs keep writing; watchable updates let an observer agent react to new runs without polling; quotas keep a misbehaving agent from flooding the namespace.

Note what is not on this list: replacing your database. The namespace is the agent-facing view; the substrate behind it should stay whatever your data already trusts.

Part 6 — Does your system need an agent interface?

Experiment tracking and observability are the first obvious use case, but they are not the only one.

More generally, any agentic system that depends heavily on external artifacts will run into the same interface problem: files, logs, reports, models, contracts, images, videos, checkpoints, datasets, traces, and intermediate outputs. Call these artifact-heavy agentic systems.

Modern ML systems already look like this. They generate runs, traces, logs, model checkpoints, metrics, configs, and artifacts. The questions humans ask — “Which config performed best?”, “Why did this batch fail?”, “Which model did this sampler load?” — usually require exploration across structured metrics, unstructured logs, artifact metadata, and provenance.

Legal, data-analysis, multimedia, engineering, and multi-agent systems have the same shape. A legal agent needs contracts, case law, diligence files, citations, parse status, permissions, and audit trails. An engineering agent needs issues, PRs, CI logs, build artifacts, and deployment records. A multi-agent workflow needs shared outputs, version locks, watches, and rollback.

If these objects are scattered across object-store keys, vector databases, application tables, temporary caches, and workflow state, the agent has to rediscover the workspace every time it acts. It spends tokens asking where the file is, which version is current, which summary points back to which source paragraph, and which derived artifact came from which original.

This is where the three filesystem-shaped surfaces from Part 5 become practical. Artifact and metadata control tells the agent what exists before it reads everything. Workspace management gives runs, documents, outputs, and reports stable addresses. Snapshots and watches let long-running or multi-agent workflows observe change without losing consistency.

That is why filesystem semantics are becoming important again. LLMs have been trained to work well in spaces made of named things, hierarchical scope, local inspection, search, paths, citations, and incremental disclosure. If tokens are a new unit of computation, then interface shape is part of the cost structure. In production agent systems, the model’s limited tokens and attention should be spent on reasoning and action, not on re-locating files, guessing versions, and stitching context back together.

Sources

Letta — Benchmarking AI Agent Memory: Is a Filesystem All You Need? (Aug 2025)
Pekka Enberg — Towards a Disaggregated Agent Filesystem on Object Storage (Jan 2026)
Anthropic — Code execution with MCP: Building more efficient agents (Nov 2025); Agent Skills overview; Introducing Claude Fable 5 and Mythos 5 (Jun 2026)
OpenAI — Tool search guide
Hacker News — FUSE is All You Need – Giving agents access to anything via filesystems (Jan 2026)
Mikiko Bazeley, The New Stack — The “files are all you need” debate misses what’s actually happening in agent memory architecture (Mar 2026)
Linux Foundation — Tokenomicon announcement (Jun 2026)
Jensen Huang — GTC 2025 keynote transcript; GTC 2026 token-economics coverage, RCR Wireless; GTC Taipei 2026 coverage, SiliconANGLE
SaaS market reaction — Yahoo Finance / StockStory (Jun 2026); CNBC on AI and SaaS selloff (Feb 2026)
NoKV — benchmark harness & README · full benchmark report · raw run telemetry (100 runs)