how do they leverage the characteristics of NVM e.g. non-volatile, byte-addressable
- for write operation, it allocates storage in requested operation granularity instead of block allocation.
- I don’t have another one for this question…
- use RDMA-NVM to accelerate the speed of I/O
🤗 Assise tries to leverage the high speed of read and write operation of NVM, integrating it into a distributed file system. And it improves I/O performance.
Assise allocates storage size in a dynamic granularity in NVM instead of block allocation like tradition block storage device.
For write operation, there are 2 stages.
- libfs directly writes to a process-local cache in NVM
- to outlive node failures, the update log is replicated to reserved NVM of the next cache replica along the replication chain.
For read operation,
- libfs first checks local cache
- if not found, it checks the closest cache replica of the corresponding subtree
- if not found, it checks reserve replica and in parallel, cold storage.
reads from remote nodes will be cached in local dram
How to maintain cache coherence?
cc-nvm tracks write order via update log in process-local NVM. Each posix call that updates state is recorded. cc-nvm leverages the ordering guarantee of RDMA to write the log in order to replicas.
cc-nvm serializes concurrent access to shared state by untrusted libfses and recovers the same serialization after a crash via leases. (leases are just like reader-writer lock.
to localize coherence enforcement, leases are delegated hierarchically. libfses will need to request for leases from local sharedfs, then sharedfs may need to forward the request to the root cluster manager. the leases will be recycled or be expired by cluster manager every 5 seconds.
This allows cc-nvm to migrate lease management to the sharedfs that is local to the libfses requesting them.
hierarchical structure allows cc-nvm to minimize network communication and lease delegation overhead.
crash recovery and fail-over
the local sharedfs will evicts the dead libfs update log, recovering all completed writes and then expires its leases.
os crash, use nvm to dramatically accelerate os reboot by storing a checkpoint of a freshly booted os. by examining the sharedfs log stored in nvm, it can initiate recovery for all previously running libfs instances.
cache replica fail-over
to avoid waiting for node recovery after a power failure, it immediately fail-over to cache replica. writes to the file system can invalidate cached data of the failed node. to track writes, the cluster manager maintains an epoch number, which it increments on node failure and recovery. all sharedfs shared a per-epoch bitmap in a sparse file indicating what inodes have been written during each epoch.
when a node crashes, the cluster manager make sure that ll of the node’s leases expire before the node can rejoin. A recovering sharedfs contacts an online sharedfs to collect relevant epoch bitmaps. sharedfs then invalidate every block from every file that has been written since its crash.
- use hierarchical leases management to localize the acquirement of the lease, thus reducing the network overhead
- quickly failover to reserve replica
- allocate storage in requesting granularity.
- the cluster manager may crash, thus causing single node failure.
- I don’t think this paper fully leverages the characteristic of NVM because the whole system design doesn’t have much work related about the use of NVM especially in crash consistency of CC-NVM part