ikwzm/udmabuf

Read()/Write() performance

Hello-FPGA opened this issue · 8 comments

Hi, I am using this driver on the Xilinx ZCU102 evaluation board and expected a high-speed DMA transfer from PL to PS memory. The first step is to test the performance of the memory copy from PS reserved DMA space to user space.

The following process is a normal DMA data read/write operation in the user applications.

0, open the udmabuf0 : int hDma = open("/dev/udmabuf0",O_RDWR ) ;//| O_SYNC
1, malloc a user space memory : int* pUserBuf =(int*)_aligned_malloc(0x100000,4096);
2, config the offset with lseek : lseek(hDma, offset, SEEK_SET);
3, Start and check the DMA operation status, I skip this step;
4, Read()/Write() the hDma device: unsigned int actualBytesRead = read(hDma, (void *)pUserBuf, 0x100000);

With this process, the performance is very slow. I tested it on ZCU102 with a thread, the read performance is only 130MB/s.

What is your recommendation process to use this driver doing a high-performance DMA operation?

ikwzm commented

Thank you for the issue.

Does your system have hardware guarantees for cache coherency?
For example, by using the S_AXI_HPC0 port and S_AXI_ACP port, cache coherency can be guaranteed by hardware.

If your system has hardware guarantees for cache coherency, you can enable CPU caching of buffers by specifying the dma-coherent property in the u-dma-buf device tree. It will improve read()/write() performance.

		udmabuf@0x00 {
			compatible = "ikwzm,u-dma-buf";
			size = <0x00100000>;
			dma-coherent;
		};

Unfortunately, it is not possible to enable CPU caching of buffers if your system does not guarantee cache coherence in hardware.

If your system does not guarantee cache coherency in hardware and you want to access with CPU cache enabled, use mmap() instead of read()/write() for access.
mmap() can override CPU cache settings for buffers, so it can be accessed with CPU cache enabled even if the hardware does not guarantee cache coherence.
In that case, you need to manually control the CPU cache. See "2. Manual cache management with the CPU cache still being enabled" in Readme.me for more information.

Hi,
I didn't even connect the S_AXI_HPC0 port or S_AXI_ACP port from PL logic, there is no data transfer between PL and PS actually. I skipped this on my current test, and assume the DMA transfer is always done immediately.

The device tree is like this:

reserved-memory {
	#address-cells = <2>;
	#size-cells = <2>;
	ranges;
	image_buf0: image_buf@0 {
	compatible = "shared-dma-pool";
	reusable;
	reg = <0x8 0x0 0x0 0x80000000>;
	linux,cma-default;
	label = "image_buf0";
	};
};
udmabuf@0 {
	compatible = "ikwzm,u-dma-buf";
	device-name = "udmabuf0";
	size = <0x70000000>; // 64MiB	
	memory-region = <&image_buf0>;
    dma-mask = <64>;
    dma-coherent;
};

I tested the Read()/Write() performance with or without the O_SYNC mask, and it's almost the same, always less than 140MB/s

int hDma = open("/dev/udmabuf0",O_RDWR ) ;//| O_SYNC

If I use mmap() to get the virtual address first, and then use memcpy() to get the data from CMA, the performance will be extremely high, it's almost 5000MB/s.

ikwzm commented

Thank you for valuable information.

I didn't check whether CPU cache is enabled with u-dma-buf allocated on reserved_memory.
The one I checked was without the memory-region property.

As you can see from udmabuf_device_probe() in u-dma-buf.c, the relationship between of_reserved_mem_device_init() and of_dma_configure() is complicated. This is because the operation differs depending on the CPU architecture and Linux Kernel version. Perhaps buffers allocated on reserved_memory ignore the dma-coherent property in some circumstances. It is very difficult to investigate this matter.

Now, as for the solution, it's probably safer to use mmap() than read()/write() where you don't know what the CPU cache will be in some situations.

Thank you for your reply.

I will use mmap() than read()/write(), mmap() can address my consideration.

By the way, I reserved 2GB memory on the device tree, and if I specify the udmabuf size as 0x80000000, it will fail when the driver is loaded. If I specify the udmabuf size as 0x70000000, it can succeed. Does it mean it can't allocate all the reserved 2GB memory to the driver? Or it just depends on the Linux kernel version, my tool is petalinux 2020.2.

ikwzm commented

By the way, I reserved 2GB memory on the device tree, and if I specify the udmabuf size as 0x80000000, it will fail when the driver is loaded. If I specify the udmabuf size as 0x70000000, it can succeed. Does it mean it can't allocate all the reserved 2GB memory to the driver? Or it just depends on the Linux kernel version, my tool is petalinux 2020.2.

It may be the same as issue #88.

Yes, it is definitely the same issue as #88, your reply is really helpful!

Hi,
I tested without linux,cma-default; configuration, It still failed.

/include/ "system-conf.dtsi"
/ {
   memory {
       #address-cells = <2>;
       #size-cells = <2>;
       device_type = "memory";
       reg = <0x0 0x0 0x0 0x80000000>, <0x00000008 0x0 0x0 0x80000000>;
   };
   reserved-memory {
   	#address-cells = <2>;
   	#size-cells = <2>;
   	ranges;
   	image_buf0: image_buf@0 {
   		compatible = "shared-dma-pool";
   		reusable;
   		reg = <0x00000008 0x0 0x0 0x80000000>;
                alignment = <0x0 0x1000>;
   		label = "image_buf0";
   	};
   };
   udmabuf@0 {
   	compatible = "ikwzm,u-dma-buf";
   	device-name = "udmabuf0";
   	size = <0x80000000>;
        dma-coherent;
   	memory-region = <&image_buf0>;
        dma-mask = <64>;
   };

    chosen {
        bootargs = "earlycon clk_ignore_unused   uio_pdrv_genirq.of_id=generic-uio root=/dev/mmcblk0p2 rw rootwait";
        stdout-path = "serial0:115200n8";
    };

};

The error information:

root@xilinx-zcu102-2020_2:~# insmod /home/linaro/test/u-dma-buf.ko
[   29.126201] u_dma_buf: loading out-of-tree module taints kernel.
[   29.134407] u-dma-buf udmabuf@0: assigned reserved memory node image_buf@0
[   29.212638] cma: cma_alloc: alloc failed, req-size: 524288 pages, ret: -16
[   29.219538] cma: number of available pages: 524288@0=> 524288 free of 524288 total pages
[   29.227691] ------------[ cut here ]------------
[   29.232304] WARNING: CPU: 1 PID: 1828 at mm/page_alloc.c:4738 __alloc_pages_nodemask+0x158/0x240
[   29.241081] Modules linked in: u_dma_buf(O+)
[   29.245345] CPU: 1 PID: 1828 Comm: insmod Tainted: G           O      5.4.0-xilinx-v2020.2 #1
[   29.253859] Hardware name: ZynqMP ZCU102 Rev1.0 (DT)
[   29.258815] pstate: 20000005 (nzCv daif -PAN -UAO)
[   29.263599] pc : __alloc_pages_nodemask+0x158/0x240
[   29.268470] lr : __dma_direct_alloc_pages+0x118/0x1c8
[   29.273510] sp : ffff8000140ab780
[   29.276816] x29: ffff8000140ab780 x28: 0000000000000100
[   29.282121] x27: ffffffffffffffff x26: ffff0000634015c0
[   29.287425] x25: 0000000000000000 x24: 0000000080000000
[   29.292729] x23: 0000000080000000 x22: ffff000066b43810
[   29.298033] x21: 0000000000000013 x20: ffff000066042fc0
[   29.303337] x19: 0000000000000cc0 x18: 0000000000000030
[   29.308641] x17: 0000aaaad451aff0 x16: 0000aaaad451aff0
[   29.313945] x15: ffff800011195000 x14: ffff80001123b63a
[   29.319249] x13: 0000000000000000 x12: ffff80001123a000
[   29.324553] x11: ffff800011195000 x10: 0000000000000000
[   29.329857] x9 : 0000000000000007 x8 : 00000000000001c3
[   29.335161] x7 : 0000000000000001 x6 : 0000000000000001
[   29.340465] x5 : 0000000000000000 x4 : 000000000000003f
[   29.345769] x3 : 0000000000000000 x2 : 0000000000000000
[   29.351073] x1 : 0000000000000000 x0 : 0000000000000cc0
[   29.356377] Call trace:
[   29.358818]  __alloc_pages_nodemask+0x158/0x240
[   29.363341]  __dma_direct_alloc_pages+0x118/0x1c8
[   29.368037]  dma_direct_alloc_pages+0x28/0xe8
[   29.372386]  dma_direct_alloc+0x4c/0x58
[   29.376214]  dma_alloc_attrs+0x7c/0xe8
[   29.379965]  udmabuf_platform_driver_probe+0x480/0x968 [u_dma_buf]
[   29.386140]  platform_drv_probe+0x50/0xa0
[   29.390140]  really_probe+0xd8/0x2f8
[   29.393707]  driver_probe_device+0x54/0xe8
[   29.397795]  device_driver_attach+0x6c/0x78
[   29.401970]  __driver_attach+0x54/0xd0
[   29.405711]  bus_for_each_dev+0x6c/0xc0
[   29.409540]  driver_attach+0x20/0x28
[   29.413108]  bus_add_driver+0x148/0x1e0
[   29.416936]  driver_register+0x60/0x110
[   29.420765]  __platform_driver_register+0x44/0x50
[   29.425465]  u_dma_buf_init+0x22c/0x1000 [u_dma_buf]
[   29.430419]  do_one_initcall+0x50/0x190
[   29.434247]  do_init_module+0x50/0x1f0
[   29.437987]  load_module+0x1ca4/0x2218
[   29.441728]  __do_sys_finit_module+0xd0/0xe8
[   29.445991]  __arm64_sys_finit_module+0x1c/0x28
[   29.450516]  el0_svc_common.constprop.0+0x68/0x160
[   29.455298]  el0_svc_handler+0x6c/0x88
[   29.459038]  el0_svc+0x8/0xc
[   29.461910] ---[ end trace 16f0ccc79524ba8a ]---
[   29.466543] dma_alloc_coherent(size=2147483648) failed. return(0)
[   29.472638] u-dma-buf udmabuf@0: driver setup failed. return=-12
[   29.478824] u-dma-buf udmabuf@0: driver installed.
[   29.483623] u-dma-buf: probe of udmabuf@0 failed with error -12

ikwzm commented

Thank you for valuable information.

This new issue has spawned another new issue.
Please refer to issue #98 for further discussion.