Architecture — NoKV Docs

NoKV is a Rust-first filesystem for AI training and agent workspaces. The repository is intentionally product-shaped: metadata semantics, object body storage, clients, FUSE, docs, and examples live at the repository root instead of behind a nested workspace.

The implemented tree is the Rust client/server filesystem slice: FUSE and the SDK talk to nokv-server, which commits semantic metadata commands into an embedded Holt MVCC engine on a single node. Distributed metadata is not implemented — the planned direction (subtree sharding + owner-lease + epoch fencing, not consensus-replicated metadata) is described under Distributed Direction. The broader product target — production metadata HA, CSI, Python/fsspec, and node-local cache — is recorded in Product Design.

Layers

Application surface
  nokv-client    Rust SDK
  nokv           CLI
  nokv-fuse      low-level FUSE frontend
  nokv-python    planned Python/fsspec bindings
  nokv-csi       planned Kubernetes CSI integration

Metadata layer
  nokv-types     mount, inode, dentry, body descriptor, watch event types
  nokv-protocol  metadata RPC wire DTOs
  nokv-meta      schema, MetadataCommand, Holt store, service core
  nokv-server    long-running metad process, RPC, health, and control plane

Body storage layer
  nokv-object    S3-compatible object storage, including RustFS

Write Path

Write path — object bytes upload first; one metadata command publishes the namespace entry.

For artifact publication, object bytes are uploaded first. The metadata commit then publishes the dentry, inode projection, and body descriptor atomically. Failed metadata publish leaves staged objects for later garbage collection.

nokv-server runs the same local nokv-meta service in a long-lived process. It owns health, readiness, stats, manual GC endpoints, and the first metadata RPC. The SDK hot path uses a length-prefixed framed RPC on the same port; HTTP stays limited to health, stats, and manual GC control. The RPC supports both inode/name operations and path-oriented SDK operations, so server-side path resolution can avoid multi-round-trip nested creates. It also supports ordered non-atomic batches: each subrequest has its own result/error, but the batch removes per-operation network round trips for SDK workloads. The Rust SDK has a metadata client for namespace operations and an object-backed file client that uploads object blocks directly, asks metad to atomically publish the body manifest, opens a native layout plan through OpenPathReadPlan, and reads object ranges directly from the configured object store. ReadBodyPlan remains available for generation-scoped follow-up reads and prefetch. Read plans carry immutable block keys, range offsets, and digest_uri, so the data path can choose local hot-tier reads or S3-compatible object reads without adding placement truth to metadata. The server stats endpoint reports metadata-store write attribution counters so benchmark runs can distinguish current writes, history writes, watch writes, and dedupe writes. The FUSE frontend uses the same metadata client/server boundary as the SDK.

FUSE Path

The current FUSE frontend is inode-first. It maps kernel lookup, getattr, readdir, open, and read calls to metad inode APIs and object-store range reads. It does not resolve paths through the Rust SDK and does not own metadata semantics. Live mounts register observed directory scopes with the metadata watch log and translate typed watch events into FUSE inval_entry and inval_inode notifications. Snapshot mounts are read-only and do not start the invalidation worker.

Metadata Layout

The canonical model is inode/dentry, described in Metadata Schema:

inode_current:
  mount_id | inode_id -> inode attributes

dentry_current:
  mount_id | parent_inode | name -> dentry + inode projection

chunk_manifest_current:
  mount_id | inode_id | generation | u64::MAX -> body summary
  mount_id | inode_id | generation | chunk_index -> block manifest

history:
  family | user_key_len | user_key | inverted_commit_version -> old value

Path indexes are derived accelerators for artifact and checkpoint fast paths; they are not namespace truth.

Data Fabric And Object Storage

NoKV stores file bodies outside the metadata service. File bytes are split into immutable object blocks and published through metadata manifests. The first production body backend is S3-compatible storage. RustFS, MinIO, Ceph RGW, and AWS S3 all use the same object-store boundary. See Object Layout.

The metadata manifest is the durable truth for block identity and cold storage: inode, generation, logical offsets, block digest, and S3-compatible object key. It must not record node-local NVMe paths or cache slots. Those belong to the data path as soft placement state.

The planned hot path is layered behind the same immutable block contract:

layout lease -> block descriptors -> data fabric
  -> local NVMe hot tier
  -> S3-compatible cold durable tier

That boundary keeps local placement out of metadata semantics. A hot-tier read can miss or fail; the S3-compatible object key remains the durable fallback.

The current nokv-object data-fabric skeleton provides LocalObjectStore for a node-local hot tier, TieredObjectStore for hot-first/cold-fallback reads, ObjectStore::get_many for batched block fetches, and resolve_block_placements for soft local-vs-object placement decisions. The local hot tier rebuilds its residency index from disk on open, can enforce a configured byte cap with LRU eviction, and reports resident bytes, evictions, and admission rejections. Cold-read hot fills can run inline or in the background; background fills coalesce duplicate in-flight object keys. LayoutReadExecutor consumes the metadata layout-open plan through the existing read pipeline, records transport counters, and preserves the batch/coalescing contract. Its batch layout path combines blocks from multiple read plans before calling the object store, so adjacent ranges and multi-sample reads can share one get_many call. The SDK range path sits above this layer and only coalesces logical offsets within one immutable file generation; metadata still sees normal layout opens, and the object layer still sees immutable block descriptors and durable object keys.

Metadata Disaster Recovery

File bodies are durable in the object store, but the namespace that gives them meaning — inodes, dentries, versions, and CoW relationships — lives in the local Holt engine. Losing that node would lose the namespace even though every object survives. To close that single point of total loss, the metadata engine is periodically archived to the same object store.

A background worker exports a Holt checkpoint image and publishes it under a configurable object-key prefix (--metadata-checkpoint-archive-prefix, on by default; disable with --no-metadata-checkpoint-archive). Publication mirrors the body write path — object-first, pointer-second:

1. checkpoint image  -> {prefix}/ckpt/{seq}.image     (object-first)
2. CURRENT manifest  -> {prefix}/CURRENT              (atomic pointer swap)
3. prune checkpoints older than the retained window   (after the swap)

The single CURRENT object names the live checkpoint and the retained-checkpoint window, so retention works without an object list. A crash between steps 1 and 2 leaves an orphan checkpoint object (reclaimed on a later backup), never a manifest that points at a missing checkpoint.

Recovery runs on a replacement node with an empty metadata directory:

nokv restore        # GET CURRENT -> GET checkpoint -> install into a fresh store
nokv serve          # resume serving the recovered namespace

restore installs the checkpoint into a fresh Holt store (which must be empty — a checkpoint install cannot merge into a populated store) and rehydrates the allocator, so the recovered node both serves the prior namespace and accepts new writes. nokv backup triggers an out-of-band archive on a running server, and /stats reports the worker’s metadata_backup state. The recovery-point objective is the worker interval; the bodies were always safe in the object store.

Consistency Checking

nokv fsck verifies the live namespace against the object store: it walks every live file at its current body generation and confirms each referenced block still exists (head). This is the read-side complement to the object-first write ordering — the ordering guarantees metadata never references a missing object, and fsck detects any drift after the fact (an out-of-band deletion, an eventual-consistency anomaly in external storage, or a latent bug), reporting each dangling reference as (inode, generation, object_key). Superseded and snapshot-pinned generations are not mistaken for drift (the scan uses each inode’s current body generation), and a clone’s borrowed block keys resolve against the source objects that still exist. Reclaiming the opposite drift — orphan objects written but never referenced — is a planned extension that needs an object-store list.

Distributed Direction

Status: not implemented. Today NoKV is a single-node metadata service — one embedded Holt MVCC engine owns the entire namespace. There is no replication, no consensus group, and no nokv-cluster crate. This section records the planned direction, not shipped behavior.

The planned distributed layer is deliberately not consensus-replicated metadata (that would double-log against Holt’s own MVCC and erase the embedded-engine advantage) and not a mandatory external transactional KV. The direction is subtree sharding + owner-lease + epoch fencing:

Shard by subtree. Every key is already mount-prefixed and dentries are parent-clustered, so a subtree maps to contiguous key ranges with no key-format change. One single-owner Holt engine serves each shard. Because all N shards of one checkpoint live under one subtree, the common atomic publish stays a single-shard, single-engine transaction — no cross-shard commit on the hot path.
A small control group grants leases and holds only the shard map. A 3–5 node consensus group (the only consensus in the system) replicates a kilobyte-scale routing table {range → owner, epoch, lease, image_pointer} and grants / renews / revokes owner-leases. It never replicates the metadata log — the metadata truth stays single-owner in each shard’s Holt.
Epoch fencing. Each shard carries a monotonic ownership epoch (the allocator record already persists one, recovered with fetch_max). On owner change the control group bumps the epoch and a deposed owner’s commits are rejected at the durability boundary. Wiring this epoch into an actual commit predicate is a prerequisite tracked separately — it is stored today but not yet enforced.
Failover reuses the DR path. The metadata backup/restore mechanism (see Metadata Disaster Recovery, above) — export a Holt checkpoint image to object storage, install it on a fresh node — is also the shard-handoff primitive. A new owner restores the shard image, replays the WAL tail, and takes the new epoch. Zero-loss failover additionally requires per-epoch WAL-tail streaming, allocation-independent request IDs for dedupe, and an atomic install-into-live primitive in Holt — none of which exist yet.

Cross-shard atomic operations (a rename that straddles shards) are out of the v1 contract; the hot path never needs them.

The data fabric is separate from this ownership protocol. NVMe residency is cache state keyed by immutable block identity; it does not decide namespace visibility and does not participate in MetadataCommand atomicity.