TL;DR — nokv’s metadata layer runs on holt, an embedded engine purpose-built for path-shaped keys (object-store names, filesystem paths). On a fair NVMe benchmark against RocksDB, SQLite, and LMDB at matched memory, holt does an S3-style delimiter rollup in ~131 ns — 13–16× over LMDB and up to ~140× over RocksDB/SQLite, serves hot point reads at ~1 µs (top tier, alongside LMDB), and overwrites at ~0.86 µs (2–13× faster). It is honest about where it doesn’t win, too. Here’s the why.
The workload nobody optimizes for
Object stores and filesystems don’t have random keys. They have paths:
bucket-042/tenant-07/path/sub/file-00831245.bin
/usr/local/share/category-118/file-00002231
Long shared prefixes. Hierarchies. And the single most common control-plane question
isn’t “get key K” — it’s “list everything under this prefix, rolled up at the next
/” (the S3 ListObjectsV2 delimiter query, every ls, every dashboard).
General-purpose KV engines treat that as a range scan plus de-duplication: walk every key under the prefix, strip the suffix, dedupe. nokv’s metadata layer doesn’t, because holt turns the path structure itself into the index.
holt in one paragraph
holt slices an Adaptive Radix Tree across self-describing 512 KB blob frames. Shared path prefixes collapse into the tree (that’s what an ART does); each frame is a walkable mini-index; and a delimiter rollup becomes a fast-forward over sub-prefixes instead of a scan-and-dedupe. Writes go through a lock-free, byte-tiled write-ahead log; reads are lock-free end to end. It’s single-node, embedded, Unix-only — it lives inside nokv, not over a socket.
By the numbers
All figures are single-process on a real NVMe box (AMD Ryzen 7 5825U, 16 threads, 27 GiB RAM, ext4, kernel 6.17), versus RocksDB, SQLite (WAL), and LMDB at matched memory. Path-shaped keys (~50 B), JSON values (~60 B).
The headline: directory rollup (S3 delimiter listing)
| engine | list_dir rollup, 10M keys | vs holt |
|---|---|---|
| holt | 131 ns | — |
| LMDB | 2,040 ns | 16× |
| RocksDB | 17,466 ns | 133× |
| SQLite | 18,589 ns | 142× |
holt rolls up a prefix in ~131 nanoseconds — ~16× over LMDB and 130–142× over RocksDB and SQLite. It fast-forwards over whole sub-trees; the others scan keys and de-duplicate.
Hot point reads — memory speed, top tier
| engine | point read @10M | point read @1M |
|---|---|---|
| holt | 1,267 ns | 550 ns |
| LMDB | 1,001 ns | 581 ns |
| RocksDB | 3,909 ns | 1,446 ns |
| SQLite | 5,217 ns | 3,153 ns |
holt and LMDB are the top tier — both ~3–5× faster than RocksDB and ~4–8× faster than SQLite. At scale this is a pure memory-hierarchy game (more below).
Writes — in-place overwrites fly
| engine | overwrite @1M | overwrite @10M |
|---|---|---|
| holt | 857 ns | 1,548 ns |
| RocksDB | 3,690 ns | 3,309 ns |
| LMDB | 7,322 ns | 10,080 ns |
| SQLite | 10,941 ns | 12,857 ns |
An in-place tree update plus a lock-free WAL append — 2–13× faster than the field.
Concurrency is free queue depth
holt’s read path takes no locks. Point N threads at cold data and throughput rides the device’s queue depth for free:
| threads | cold-read throughput |
|---|---|
| 1 | 1.0× |
| 4 | 3.9× |
| 8 | 6.4× |
| 16 | 11.6× |
Same story on the write side: the byte-tiled WAL ring lets many writers reserve gap-free
ranges with a single fetch_add and group-commit one fdatasync — concurrent durable
writes turn into device queue depth instead of lock contention.
Three ideas that produce those numbers
1. An in-blob routing region. On compaction holt clusters a frame’s internal nodes into a small contiguous region and page-aligns the leaves after it. A cold point read then fetches header + routing region + one leaf page instead of pinning the whole 512 KB frame — a measured ~27× cut in cold read amplification (870 KB → 32 KB).
2. A per-blob Bloom filter. Riding for free at the tail of the routing region, it answers a negative lookup with zero leaf reads — and because it’s built from the stored keys, it can never be a false negative. In our tests negatives are cheaper than positives.
3. Lock-free reads and a byte-tiled WAL. Reads validate a version and never block; the loom-verified WAL ring serializes only on a single atomic. Both convert concurrency into device parallelism rather than CPU contention.
The honest part
We measured where holt doesn’t win, because a metadata engine you trust is one that tells you its edges:
- Single-threaded cold point reads are not holt’s game. holt uses
O_DIRECTwith a self-managed buffer pool (like InnoDB) — deliberately, for predictable caching and no double-buffering — so it never opportunistically rides the OS page cache. When the working set fits RAM, a page-cache engine like LMDB serves cold reads from memory and wins. We checked this is the page cache, not holt: under a strict matched-memory cap (cgroup), LMDB’s cold read slowed 5.5× — most of its apparent lead was free RAM, not structure. - Space: 512 KB frames at ~0.38 fill cost ~3× the bytes-per-key of LMDB and ~5× a compact LSM like RocksDB. A real trade for the read/rollup/write wins.
- Hot reads at scale are DRAM-bound, not algorithm-bound (IPC 0.64, ~38% LLC-miss): the descent is short, but chasing pointers through a multi-GB working set misses cache. That’s the wall, and it’s the same wall every in-memory index hits.
holt’s wins are hot reads, rollups, and writes for path-shaped metadata — exactly nokv’s metadata workload. We didn’t pick the benchmark to flatter it; we picked the workload nokv actually runs.
What this means for nokv
- Listings that feel instant. Prefix rollups at ~100 ns scale to dashboards and
ListObjectsfan-out without a caching tier in front. - Predictable reads under load. Lock-free reads mean p99s that hold as concurrency climbs, not a lock-contention cliff.
- Write headroom. Group-commit durability that amortizes
fsyncacross writers. - One process, no sidecar. holt is embedded in nokv — no extra service, no RPC hop on the metadata path.
Methodology: holt 0.7.0; RocksDB (direct-I/O, matched block cache), SQLite (WAL,
matched cache), LMDB (mmap) at matched memory; cold runs drop the page cache before
timing; read amplification measured via /proc/self/io. Full tables, the matched-memory
fairness control, and the negative results are in the holt evaluation write-up.