Object Layout — NoKV Docs

NoKV stores file bodies outside the metadata service. Metadata stores compact body descriptors; durability of bytes is delegated to the configured object store.

Chunk Layout

Files are split into immutable object blocks:

file inode
  -> body descriptor
  -> chunk manifests
  -> object blocks

Default sizes:

chunk_size = 64 MiB
block_size = 4 MiB

Object block keys are generated by the metadata service:

blocks/<mount>/<inode>/<generation>/<chunk>/<block>

Blocks are never modified in place. A replace or overwrite creates a new inode generation and atomically publishes a new manifest in metadata.

Body Descriptor

producer
digest_uri
size
content_type
generation
manifest_id
chunk_size
block_size

manifest_id is provider-neutral and stable for the artifact publish request. It is not the physical object key. Physical object keys are derived from mount, inode, generation, chunk, and block.

digest_uri is a compact integrity summary. SDK artifact uploads normally use sha256:<content-digest>. Chunk block entries use xxh3-64:<block-checksum> so the write hot path avoids a cryptographic digest per block. FUSE write sessions use manifest-sha256:<manifest-digest> so publish remains proportional to the changed chunk/slice metadata instead of rereading or hashing the whole file body.

The same object boundary works for AWS S3, RustFS, MinIO, and Ceph RGW.

Use --object-backend rustfs for a local RustFS deployment or --object-backend s3 for another S3-compatible provider. See RustFS Backend for the local RustFS shape.

Publish Rule

Artifact publish is staged:

upload object bytes
  -> split into blocks and PUT immutable objects
  -> commit inode + dentry projection + body summary + chunk manifests
  -> expose namespace entry

If object upload succeeds and metadata publish fails, the object is staged but not reachable from the namespace. The caller can pass the staged object set to the explicit cleanup helper.

If metadata remove or replace succeeds, the old body objects are written into a durable metadata GC queue in the same metadata commit that removes namespace reachability. The current local service exposes an explicit cleanup API and a background object GC worker; live FUSE mounts start the worker by default. Active snapshot pins conservatively block object cleanup so snapshot-version artifact reads can still fetch the old blocks. Retiring the snapshot lets later cleanup consume the queued records. Each queued object record also stores its enqueue time. The background worker applies a read-lease grace window before it deletes a queued object, so recently returned read plans have time to finish their object-range reads. The explicit cleanup API can still run with a zero grace window for tests and manual recovery.

Metadata history uses the same retention boundary. Active snapshot pins define the oldest read version that must remain reconstructible. History cleanup keeps the per-key anchor needed by that oldest snapshot and removes older versions; when no snapshot pins remain, history cleanup may remove all historical records. Live FUSE mounts start the history GC worker alongside object GC.

Chunk Manifest

Each chunk_manifest record stores the slice stack for one logical chunk. Newer slices overlay older slices, which lets partial writes publish only dirty blocks while reusing the previous generation’s unchanged blocks:

chunk_index
logical_offset
len
slices:
  slice_id
  logical_offset
  len
  blocks:
    object_key
    logical_offset
    object_offset
    len
    digest_uri

Readers construct a range read plan from the manifests, fetch object ranges, and assemble the requested file range. The first cache layer is a read-through block cache keyed by object range.