← All posts
Benchmarks & Comparisons

List a directory in 131 nanoseconds: holt, the metadata engine inside nokv

TL;DR — nokv’s metadata layer runs on holt, an embedded engine purpose-built for path-shaped keys (object-store names, filesystem paths). On a fair NVMe benchmark against RocksDB, SQLite, and LMDB at matched memory, holt does an S3-style delimiter rollup in ~131 ns — 13–16× over LMDB and up to ~140× over RocksDB/SQLite, serves hot point reads at ~1 µs (top tier, alongside LMDB), and overwrites at ~0.86 µs (2–13× faster). It is honest about where it doesn’t win, too. Here’s the why.


The workload nobody optimizes for

Object stores and filesystems don’t have random keys. They have paths:

bucket-042/tenant-07/path/sub/file-00831245.bin
/usr/local/share/category-118/file-00002231

Long shared prefixes. Hierarchies. And the single most common control-plane question isn’t “get key K” — it’s “list everything under this prefix, rolled up at the next / (the S3 ListObjectsV2 delimiter query, every ls, every dashboard).

General-purpose KV engines treat that as a range scan plus de-duplication: walk every key under the prefix, strip the suffix, dedupe. nokv’s metadata layer doesn’t, because holt turns the path structure itself into the index.

holt in one paragraph

holt slices an Adaptive Radix Tree across self-describing 512 KB blob frames. Shared path prefixes collapse into the tree (that’s what an ART does); each frame is a walkable mini-index; and a delimiter rollup becomes a fast-forward over sub-prefixes instead of a scan-and-dedupe. Writes go through a lock-free, byte-tiled write-ahead log; reads are lock-free end to end. It’s single-node, embedded, Unix-only — it lives inside nokv, not over a socket.


By the numbers

All figures are single-process on a real NVMe box (AMD Ryzen 7 5825U, 16 threads, 27 GiB RAM, ext4, kernel 6.17), versus RocksDB, SQLite (WAL), and LMDB at matched memory. Path-shaped keys (~50 B), JSON values (~60 B).

The headline: directory rollup (S3 delimiter listing)

enginelist_dir rollup, 10M keysvs holt
holt131 ns
LMDB2,040 ns16×
RocksDB17,466 ns133×
SQLite18,589 ns142×

holt rolls up a prefix in ~131 nanoseconds — ~16× over LMDB and 130–142× over RocksDB and SQLite. It fast-forwards over whole sub-trees; the others scan keys and de-duplicate.

Hot point reads — memory speed, top tier

enginepoint read @10Mpoint read @1M
holt1,267 ns550 ns
LMDB1,001 ns581 ns
RocksDB3,909 ns1,446 ns
SQLite5,217 ns3,153 ns

holt and LMDB are the top tier — both ~3–5× faster than RocksDB and ~4–8× faster than SQLite. At scale this is a pure memory-hierarchy game (more below).

Writes — in-place overwrites fly

engineoverwrite @1Moverwrite @10M
holt857 ns1,548 ns
RocksDB3,690 ns3,309 ns
LMDB7,322 ns10,080 ns
SQLite10,941 ns12,857 ns

An in-place tree update plus a lock-free WAL append — 2–13× faster than the field.

Concurrency is free queue depth

holt’s read path takes no locks. Point N threads at cold data and throughput rides the device’s queue depth for free:

threadscold-read throughput
11.0×
43.9×
86.4×
1611.6×

Same story on the write side: the byte-tiled WAL ring lets many writers reserve gap-free ranges with a single fetch_add and group-commit one fdatasync — concurrent durable writes turn into device queue depth instead of lock contention.


Three ideas that produce those numbers

1. An in-blob routing region. On compaction holt clusters a frame’s internal nodes into a small contiguous region and page-aligns the leaves after it. A cold point read then fetches header + routing region + one leaf page instead of pinning the whole 512 KB frame — a measured ~27× cut in cold read amplification (870 KB → 32 KB).

2. A per-blob Bloom filter. Riding for free at the tail of the routing region, it answers a negative lookup with zero leaf reads — and because it’s built from the stored keys, it can never be a false negative. In our tests negatives are cheaper than positives.

3. Lock-free reads and a byte-tiled WAL. Reads validate a version and never block; the loom-verified WAL ring serializes only on a single atomic. Both convert concurrency into device parallelism rather than CPU contention.


The honest part

We measured where holt doesn’t win, because a metadata engine you trust is one that tells you its edges:

  • Single-threaded cold point reads are not holt’s game. holt uses O_DIRECT with a self-managed buffer pool (like InnoDB) — deliberately, for predictable caching and no double-buffering — so it never opportunistically rides the OS page cache. When the working set fits RAM, a page-cache engine like LMDB serves cold reads from memory and wins. We checked this is the page cache, not holt: under a strict matched-memory cap (cgroup), LMDB’s cold read slowed 5.5× — most of its apparent lead was free RAM, not structure.
  • Space: 512 KB frames at ~0.38 fill cost ~3× the bytes-per-key of LMDB and ~5× a compact LSM like RocksDB. A real trade for the read/rollup/write wins.
  • Hot reads at scale are DRAM-bound, not algorithm-bound (IPC 0.64, ~38% LLC-miss): the descent is short, but chasing pointers through a multi-GB working set misses cache. That’s the wall, and it’s the same wall every in-memory index hits.

holt’s wins are hot reads, rollups, and writes for path-shaped metadata — exactly nokv’s metadata workload. We didn’t pick the benchmark to flatter it; we picked the workload nokv actually runs.


What this means for nokv

  • Listings that feel instant. Prefix rollups at ~100 ns scale to dashboards and ListObjects fan-out without a caching tier in front.
  • Predictable reads under load. Lock-free reads mean p99s that hold as concurrency climbs, not a lock-contention cliff.
  • Write headroom. Group-commit durability that amortizes fsync across writers.
  • One process, no sidecar. holt is embedded in nokv — no extra service, no RPC hop on the metadata path.

Methodology: holt 0.7.0; RocksDB (direct-I/O, matched block cache), SQLite (WAL, matched cache), LMDB (mmap) at matched memory; cold runs drop the page cache before timing; read amplification measured via /proc/self/io. Full tables, the matched-memory fairness control, and the negative results are in the holt evaluation write-up.