AI Training
NoKV targets training clusters where file bodies live in object storage and metadata needs to be fast, typed, and easy to mount.
Target Workloads
- immutable training datasets;
- manifest-heavy dataset directories;
- checkpoint publish and resume;
- experiment artifacts;
- agent workspace inputs and outputs.
Access Paths
FUSE gives immediate compatibility with tools that expect paths. The Rust SDK is the lower-overhead path for native jobs and future Python bindings.
PyTorch / training process
-> FUSE or SDK
-> nokvfs-meta
-> Holt metadata
-> S3-compatible object storeThe current FUSE frontend maps inode operations to metadata lookups plus object range reads, and uses buffered write-on-close publishing for writes. Read-only snapshot mounts expose pinned input subtrees without allowing mutation.
FUSE should not be the only high-performance path. Training frameworks that can use a native client should bypass kernel/FUSE overhead and call the Rust or Python API directly:
compatibility path:
existing tool -> FUSE -> metadata/object service
performance path:
dataloader/checkpoint writer -> SDK/fsspec -> metadata/object serviceCache Direction
Training jobs should cache attributes, dentries, negative lookups, and object range reads locally. Cache invalidation should come from typed watch events rather than raw key notifications.
The target cluster deployment should run a node-local cache agent on GPU or training nodes. It can prefetch dataset shards, keep hot object ranges on local NVMe, and subscribe to metadata watch events for invalidation.