Persistent Memory (PM) and Remote Direct Memory Access (RDMA) technologies have significantly improved the storage and network performance in data centers and spawned a slew of distributed file systems (DFS) designs. Existing DFSs often consider remote storage a performance constraint, assuming it delivers lower bandwidth and higher latency than local storage. However, the advances in RDMA technology provide an opportunity to bridge the performance gap between local and remote access, enabling DFSs to leverage both local and remote PM bandwidth and achieve higher overall throughput.
We propose Conflux, a new DFS architecture that leverages the aggregated bandwidth of PM and RDMA networks. Conflux dynamically steers I/O requests to local and remote PM to fully utilize PM and RDMA bandwidth under heavy workloads. To decide the I/O destination adaptively, we propose SEIOD, a learning-based policy engine predicting Conflux I/O latency and making decisions in real-time. Furthermore, Conflux adopts a lightweight concurrency control approach to improve its scalability.
Here we present the installation process of Conflux.
We need a cluster (>=2) of machines with real persistent memory and RDMA network interface card available.
As far as I know, the only commercially available PM device is Intel(R) Optane NVDIMM.
We use ipmctl (see Documents here) to provision Optane PM as app direct mode, with the command:
ipmctl create -goal PersistentMemoryType=AppDirect
We use ndctl (see Documents here) to create partitioned PM namespace.
Specifically, we need devdax mode for Conflux shared pool and client cache. For example:
ndctl create-namespace -t pmem -m devdax -s 128G
In case someone want to run Conflux without real PM devices, we recommand QEMU (documents here) for emulation.
Or for the most shabby experimental environment, a shared memory file can be used (with no guarentee of meaningful results) as memory pool.
For RDMA NIC, we use Mellanox ConnectX-6 Infiniband NIC.
The Infiniband driver installation should follow the MLNX_OFED documents (see here).
Note that if we need to test NFS over RDMA, the installation of MLNX_OFED should be coupled with --with-nfsrdma option.
Successful installation of RDMA NIC driver can be verified through ibstatus
command.
To setup ip address of infiniband NICs, use ifconfig
with root privilege.
To check the inter-node RDMA latency and bandwidth, use ib_write_lat
and ib_write_bw
tools.
There are three software utlilities need by Conflux:
- PMDK, for low-level PM operation
- Syscall_Intercept, for user-level system call interception
- Tensorflow lite, for lightweight neural network model API in C code
For PMDK installation, see the official documents here.
For Syscall_Intercept, there is a github project within Intel Perssitent Memory Programming Group (see github link here).
For Tensorflow lite, we use the compile and installation guides from tensorflow official documents here. Moreover, we recommand researchers with little experience in using tensorflow to see examples in the homepage and in the github repo.
With all the hardware and software requirements satisfied, the build process of Conflux is straight-forward.
For Conflux server, go to ./Conflux-Center and run:
make backgr && make owner
For Conflux client, go to ./Conflux-User and run:
make aiagent && make conflux_call
Moreover, we need a background process running neural network training. See ./Conflux-User/PyTfcode for details.
For Octopus, see the github repo here.
For Assise, see the github repo here.
For NFS over RDMA, see Chapter 6 of this NFS white paper (here) for details.
Here we introduce the experiments we have conducted on Conflux, inclusing Fio benchmarks and Filebench benchmarks information.
The current implementation of Conflux need static cluster configuration, i.e., fixed number of server/client nodes and their ip address list. See the ./Conflux-Scripts/.conflux_config file contents for example, and corresponding changes may be needed in that file for test on other cluster.
The first benchmark we use is Fio, a flexible I/O tester. One can see the document page for details about this benchmark, and the github repo for installation guides.
Use
fio --version
to check installation of Fio.
The version we use for experiments is 3.28-47.
The second benchmark tool we use is Filebench, a flexible framework for file system benchmarking. One can see the paper for details about Filebench, and the github repo for installation guides.
Use
filebench -h
to check installation of Filebench.
The version we use for experiments is 1.5-alpha3.
We have provided an example bash script for Conflux test in ./Conflux-Scripts/AlloverTest.sh.
It acts as a contoller from outside the Conflux cluster (i.e., not in the server and client nodes).
Note that the Username and IP address should be correctly settled in that script.
Moreover, the CPU mask flag (for taskset
) used in the script should be consistent with the PM and RDMA numa node.