Can we Test Sequential Reads and Writes using elbencho

Question

Can we Test Sequential Reads and Writes using elbencho

karanveersingh5623 opened this issue 3 years ago · 9 comments

karanveersingh5623 commented 3 years ago

Is there any param for sequential IO like we have for Random --rand ?

Answer 1 · 2022-01-28T21:50:29.000Z

Hi @karanveersingh5623 ,
when you don't provide a parameter that would explicitly request non-sequential IO (like --rand or another example would be --backward) then elbencho is always doing sequential IO. Is this what you mean?

Answer 2 · 2022-02-03T07:02:39.000Z

@breuner , thanks for the reply...I have one more query.
How can I run Storage Sweep benchmark ? Zettar is required , am i right ?
The intent is that , we are making POC for Gen5 NVMe SSDs for HPC clusters which can reach upto 13 GBps Reads / NVMe SSD .

Also I want to know which parameter in elbencho can help in increasing READ / WRITE throughput ?

Answer 3 · 2022-02-03T21:48:12.000Z

Hi @karanveersingh5623 , you can run the Storage Sweep benchmark on any file system, independent of Zettar. But Zettar kindly contributed the Storage Sweep to elbencho. You can find the readme doc here:
https://github.com/breuner/elbencho/tree/master/contrib/storage_sweep#readme
( @fangchin in case you want to add something)

Generally regarding elbencho parameters in such a case, you will likely want to use bypass caching by using elbencho's --direct parameter. Also you will want to use a reasonably high number of threads (-t) and I/O depth (--iodepth) to get to full throughput and iops. I assume you will want to start by testing performance on the raw NVMe device first (which also works with elbencho, just use e.g. elbencho -w --direct -t 64 -b 4k --iodepth 16 /dev/nvme0n1 or the corresponding actual device path as elbencho parameter). And when you're seeing the expected performance values there for IOPS and throughput then you likely want to move on to installing a file system and seeing how that translates the performance numbers.

Answer 4 · 2022-02-03T21:55:56.000Z

Hi @karanveersingh5623, generally speaking, when testing NVMe devices, you can test the raw devices (i.e. without a file system) and with the device layered with a file system. The storage sweep that was contributed by Zettar to elbencho only handles the 2nd scenario. In addition, it handles only writes, as this is far more important use case to what we do. The README says more how we were motivated to create the tool (due to a project with U.S. DOE ESnet). If we have a project in the future that demands read test, we may enhance the tool. You are of course welcome to contribute too.

Answer 5 · 2022-02-04T06:56:26.000Z

@breuner ....thanks for explaining clearly.....I tested elbencho with raw PCIe Gen4 device on Gen4 Intel server...i am able to get required throughput
But our main focus is on senario 2 as we later on also test with realtime workload like MLperf Cosmoflow .
I have beegfs made on single server with 2 NVMe PCIe Gen4 devices , each giving 7GBps Reads and 4 GBps Writes
But when I try with below mentioned commands with different sizes of blk size , io depth or threads , READ throughput not able to go beyond 2 GBps / device .
elbencho -r -b 4096K -t 88 --direct -s 20g /mnt/beegfs/file{1..4}
i am trying various combinations with Block size like 128K , 4M , 64M or 2G ...

Answer 6 · 2022-02-04T07:06:13.000Z

@breuner ....thanks for explaining clearly.....I tested elbencho with raw PCIe Gen4 device on Gen4 Intel server...i am able to get required throughput

So, your elbencho benchmark of raw devices generated the numbers you had expected, right?

But our main focus is on senario 2 as we later on also test with realtime workload like MLperf Cosmoflow . I have beegfs made on single server with 2 NVMe PCIe Gen4 devices , each giving 7GBps Reads and 4 GBps Writes But when I try with below mentioned commands with different sizes of blk size , io depth or threads , READ throughput not able to go beyond 2 GBps / device . elbencho -r -b 4096K -t 88 --direct -s 20g /mnt/beegfs/file{1..4} i am trying various combinations with Block size like 128K , 4M , 64M or 2G ...

Any file system carries overhead, and thus the benchmark numbers will be lower than these for the raw device. Typically, based on what I have seen, local file system takes away ~ 15%, depending on your file system configuration. XFS, EXT4, .ZFS etc are all in this category.

Distributed file systems have even more overheads. BeeGFS, Lustre, IBM Spectrum Scale, Quobyte etc. I have seen overheads as much as 70% or so. Again, it depends on your setup and tuning. There is no hard and fast rule of thumb other than that distributed file systems impose more overheads.

This is the reason why you may wish to automate such testing. Emulating the elbencho storage sweep tools. Covering the many parameters manually, or even semi-manually, is counterproductive.

Answer 7 · 2022-02-04T07:08:41.000Z

@fangchin ...thanks for the reply...I was watching your youtube video.....very informative.
I am also working on how Storage impacts CNNs / DNNs , how NVMe devices can help in faster data pre-processing and model training
I have a query , just taking an example of below mentioned command , how can I get "/var/local/zettar/sweep" ?
I don't have access to zettar documentation as well .

# graph_sweep.sh -s /var/local/zettar/sweep -b 16 -o /var/tmp/full/1 -p -v

Answer 8 · 2022-02-04T07:10:17.000Z

@fangchin ...thanks for the reply...I was watching your youtube video.....very informative. I am also working on how Storage impacts CNNs / DNNs , how NVMe devices can help in faster data pre-processing and model training I have a query , just taking an example of below mentioned command , how can I get "/var/local/zettar/sweep" ? I don't have access to zettar documentation as well .

# graph_sweep.sh -s /var/local/zettar/sweep -b 16 -o /var/tmp/full/1 -p -v

You can change the path to something different. The command is just an "example", not to be used "literally".

Answer 9 · 2022-03-15T16:37:24.000Z

hi @karanveersingh5623 ,
I'm closing this issue under the assumption that your question has been answered. If not, then of course please feel free to reopen this issue or to open a new one.