Docs Storage
Object Layout
How file bodies become immutable object blocks: chunk layout, body descriptors, the staged publish rule, and snapshot-aware GC.
NoKV stores file bodies outside the metadata service. Metadata stores compact body descriptors; durability of bytes is delegated to the configured object store.
Chunk Layout
Files are split into immutable object blocks:
file inode
-> body descriptor
-> chunk manifests
-> object blocks
Default sizes:
chunk_size = 64 MiB
block_size = 4 MiB
Object block keys are generated by the metadata service:
blocks/<mount>/<inode>/<generation>/<chunk>/<block>
Blocks are never modified in place. A replace or overwrite creates a new inode generation and atomically publishes a new manifest in metadata.
Body Descriptor
producer
digest_uri
size
content_type
generation
manifest_id
chunk_size
block_size
manifest_id is provider-neutral and stable for the artifact publish request.
It is not the physical object key. Physical object keys are derived from mount,
inode, generation, chunk, and block.
digest_uri is a compact integrity summary. SDK artifact uploads normally use
sha256:<content-digest>. Chunk block entries use xxh3-64:<block-checksum> so
the write hot path avoids a cryptographic digest per block. FUSE write sessions
use manifest-sha256:<manifest-digest> so publish remains proportional to the
changed chunk/slice metadata instead of rereading or hashing the whole file body.
The same object boundary works for AWS S3, RustFS, MinIO, and Ceph RGW.
Use --object-backend rustfs for a local RustFS deployment or
--object-backend s3 for another S3-compatible provider. See
RustFS Backend for the local RustFS shape.
Publish Rule
Artifact publish is staged:
upload object bytes
-> split into blocks and PUT immutable objects
-> commit inode + dentry projection + body summary + chunk manifests
-> expose namespace entry
If object upload succeeds and metadata publish fails, the object is staged but not reachable from the namespace. The caller can pass the staged object set to the explicit cleanup helper.
If metadata remove or replace succeeds, the old body objects are written into a durable metadata GC queue in the same metadata commit that removes namespace reachability. The current local service exposes an explicit cleanup API and a background object GC worker; live FUSE mounts start the worker by default. Active snapshot pins conservatively block object cleanup so snapshot-version artifact reads can still fetch the old blocks. Retiring the snapshot lets later cleanup consume the queued records. Each queued object record also stores its enqueue time. The background worker applies a read-lease grace window before it deletes a queued object, so recently returned read plans have time to finish their object-range reads. The explicit cleanup API can still run with a zero grace window for tests and manual recovery.
Metadata history uses the same retention boundary. Active snapshot pins define the oldest read version that must remain reconstructible. History cleanup keeps the per-key anchor needed by that oldest snapshot and removes older versions; when no snapshot pins remain, history cleanup may remove all historical records. Live FUSE mounts start the history GC worker alongside object GC.
Chunk Manifest
Each chunk_manifest record stores the slice stack for one logical chunk.
Newer slices overlay older slices, which lets partial writes publish only dirty
blocks while reusing the previous generation’s unchanged blocks:
chunk_index
logical_offset
len
slices:
slice_id
logical_offset
len
blocks:
object_key
logical_offset
object_offset
len
digest_uri
Readers construct a range read plan from the manifests, fetch object ranges, and assemble the requested file range. The first cache layer is a read-through block cache keyed by object range.