acozzette/BUSE

Not Good I/O Performance of /dev/nbdx

jgetfun opened this issue · 7 comments

Create jobs! Guys. I have used BUSE to export a userspace block device (a SPDK defined block device) into kernel space, and it's easy to make it. But the I/O performance of the new block device (/dev/nbd0) is not so good.

fio -> /dev/nbd0 <--BUSE --> a SPDK block device
...... (kernel space).................(user space)
...... (reading 200MB/s) ........ (reading 1GB/s)

Like the schematic diagram showing above, I could get about 1GB/s bandwidth when I am reading the userspace block device, but there were only about 200MB/s on /dev/nbd0. Could you give me some advice to keep bandwidth along with exporting a user space block device into the kernel space (becoming a /dev/xxx). Thanks a lot!

Careful measuring disk I/O in user space and kernel space. Kernel does not have disk caching or readahead whereas userspace is more likely has it. When testing I/O performance, make sure to use DIRECT access. That way you are assessing the block device performance.

@bandi13 Thanks for your comment. I have already used direct accessing to measure I/O performance of those block devices, no matter they were in user space or kernel space. I think the low performance of /dev/nbd0 is coming from the mechanism of BUSE, cause it has to read and write data (via buffers) through user/kernel spaces too much times when responding a system call (read or write).

What I would do is simplify your setup and just use a block of memory (like tmpfs), and do your fio test on it, then attach the nbd layer and redo the test to see the performance difference. You may be right that the malloc and free of all the buffers may not be efficient (though it shouldn't cause a 5x performance loss), but how would you do it otherwise? Have your own local memory pool?

@bandi13 Thanks for your comment.
To get the performance loss of nbd, I created a zero device in kernel space by dmsetup (/dev/dmzero), and exported a SPDK defined user space zero device (spdk-zero) to kernel space by BUSE (/dev/nbd0). Then use fio (with SPDK fio-plugin) to measure the reading bandwidth of those three devices, the result are as follows:

Device name rand reading bandwidth
/dev/dmzero 882MB/s
spdk-zero 49947.70 MB/s
/dev/nbd0 (based on spdk-zero) 343MB/s

The fio configure file template is as follows:

[global]
ioengine=libaio
group_reporting
time_based
thread=1

[task1]
filename=/dev/dmzero
numjobs=1
iodepth=128
runtime=30
rw=randread
bs=4k
direct=1

I think the performance loss of BUSE/nbd for high performance device is worth to notice. However now, I don't have a good idea about how to export the high performance device into kernel space (to become a /dev/xxx). BUSE may be not a good choice for me, and I have to create my own kernel model to do this work.

Could you try your experiment with 'ioengine=sync'? With your test, you're issuing 128 concurrent reads which the kernel can reorder based on how they are laid out on disk to make it more efficient to read, whereas with NBD it comes in as the asynchronous calls are handled.

Also, it looks like SPDK can use NDB directly to expose the disk (see here). What are you trying to test with BUSE? SPDK is essentially doing the same thing as BUSE.

Hi @bandi13 ,thanks very much for your comment. I nearly forget that SPDK supporting NBD directly, and the SPDK NBD could provide about 600MB/s random reading bandwidth (based on a SPDK zero device, tested by fio + libaio), which is better than the result of my BUSE implementation.

The sync engine is also useful, which helps the dm-zero device achieving more than 1GB/s random reading bandwidth for a single thread, and this result is scalable. But sync engine will also reduce the I/O performance of NBD (based on spdk-zero) from 600MB/s (SPDK NBD) or 300MB/s (my BUSE implementation) to poor 150MB/s.

Besides, NBD couldn't provide scalability, which means I can't get better performance via increasing the number of working threads, no matter using SPDK NBD for BUSE. That's no good for my demand.

So thanks again for your comments, and I think there is no way to export high performance device from SPDK into kernel, except writing my own special kernel driver.

I mean the bandwidth of dm-zero (or spdk-zero) is scaling linearly with the addition of threads, but the bandwidth of NBD doesn't have this advantage.