udmabuf is a Linux device driver that allocates contiguous memory blocks in the kernel space as DMA buffers and makes them available from the user space. It is intended that these memory blocks are used as DMA buffers when a user application implements device driver in user space using UIO (User space I/O).
A DMA buffer allocated by udmabuf can be accessed from the user space by opneing the device file (e.g. /dev/udmabuf0) and mapping to the user memory space, or using the read()/write() functions.
CPU cache for the allocated DMA buffer can be disabled by setting the O_SYNC
flag
when opening the device file. It is also possible to flush or invalidate CPU cache
while retaining CPU cache enabled.
The physical address of a DMA buffer allocated by udmabuf can be obtained by
reading /sys/class/udmabuf/udmabuf0/phys_addr
.
The size of a DMA buffer and the device minor number can be specified when
the device driver is loaded (e.g. when loaded via the insmod
command).
Some platforms allow to specify them in the device tree.
Figure 1. Architecture
- OS : Linux Kernel Version 3.6 - 3.8, 3.18, 4.4, 4.8, 4.12, 4.14, 4.19 (the author tested on 3.18, 4.4, 4.8, 4.12, 4.14, 4.19).
- CPU: ARM Cortex-A9 (Xilinx ZYNQ / Altera CycloneV SoC)
- CPU: ARM64 Cortex-A53 (Xilinx ZYNQ UltraScale+ MPSoC)
- CPU: x86(64bit) However, verification is not enough. I hope the results from everyone.
In addition, there is a limit to the following feature at the moment.
- Can not control of the CPU cache by O_SYNC flag . Always CPU cache is valid.
- Can not various settings by the device tree.
Another kernel module with the same name as "udmabuf" was added in Linux Kernel 5.0. Therefore, since Linux Kernel 5.0, this "udmabuf" cannot be used. Instead, "u-dma-buf" is provided in this repository. If you use "u-dma-buf", see https://github.com/ikwzm/udmabuf/tree/u-dma-buf-master
The following Makefile
is included in the repository.
HOST_ARCH ?= $(shell uname -m | sed -e s/arm.*/arm/ -e s/aarch64.*/arm64/)
ARCH ?= $(shell uname -m | sed -e s/arm.*/arm/ -e s/aarch64.*/arm64/)
KERNEL_SRC_DIR ?= /lib/modules/$(shell uname -r)/build
ifeq ($(ARCH), arm)
ifneq ($(HOST_ARCH), arm)
CROSS_COMPILE ?= arm-linux-gnueabihf-
endif
endif
ifeq ($(ARCH), arm64)
ifneq ($(HOST_ARCH), arm64)
CROSS_COMPILE ?= aarch64-linux-gnu-
endif
endif
u-dma-buf-obj := udmabuf.o
obj-$(CONFIG_U_DMA_BUF) += $(u-dma-buf-obj)
all:
make -C $(KERNEL_SRC_DIR) ARCH=$(ARCH) CROSS_COMPILE=$(CROSS_COMPILE) M=$(PWD) obj-m=$(u-dma-buf-obj) modules
clean:
make -C $(KERNEL_SRC_DIR) ARCH=$(ARCH) CROSS_COMPILE=$(CROSS_COMPILE) M=$(PWD) obj-m=$(u-dma-buf-obj) clean
Load the udmabuf kernel driver using insmod
. The size of a DMA buffer should be
provided as an argument as follows.
The device driver is created, and allocates a DMA buffer with the specified size.
The maximum number of DMA buffers that can be allocated using insmod
is 8 (udmabuf0/1/2/3/4/5/6/7).
zynq$ insmod udmabuf.ko udmabuf0=1048576
udmabuf udmabuf0: driver installed
udmabuf udmabuf0: major number = 248
udmabuf udmabuf0: minor number = 0
udmabuf udmabuf0: phys address = 0x1e900000
udmabuf udmabuf0: buffer size = 1048576
udmabuf udmabuf0: dma coherent = 0
zynq$ ls -la /dev/udmabuf0
crw------- 1 root root 248, 0 Dec 1 09:34 /dev/udmabuf0
In the above result, the device is only read/write accessible by root.
If the permission needs to be changed at the load of the kernel module,
create /etc/udev/rules.d/99-udmabuf.rules
with the following content.
KERNEL=="udmabuf[0-9]*", GROUP="root", MODE="0666"
The module can be uninstalled by the rmmod
command.
zynq$ rmmod udmabuf
udmabuf udmabuf0: driver uninstalled
For details, refer to the following URL.
In addition to the allocation via the insmod
command and its arguments, DMA
buffers can be allocated by specifying the size in the device tree file.
When a device tree file contains an entry like the following, udmabuf will
allocate buffers and create device drivers when loaded by insmod
.
udmabuf@0x00 {
compatible = "ikwzm,udmabuf-0.10.a";
device-name = "udmabuf0";
minor-number = <0>;
size = <0x00100000>;
};
zynq$ insmod udmabuf.ko
udmabuf udmabuf0: driver installed
udmabuf udmabuf0: major number = 248
udmabuf udmabuf0: minor number = 0
udmabuf udmabuf0: phys address = 0x1e900000
udmabuf udmabuf0: buffer size = 1048576
udmabuf udmabuf0: dma coherent = 0
zynq$ ls -la /dev/udmabuf0
crw------- 1 root root 248, 0 Dec 1 09:34 /dev/udmabuf0
The following properties can be set in the device tree.
compatible
size
minor-number
device-name
sync-mode
sync-always
sync-offset
sync-size
sync-direction
dma-coherent
memory-region
The compatible
property is used to set the corresponding device driver when loading
udmabuf. The compatible
property is mandatory. Be sure to specify compatible
property as "ikwzm,udmabuf-0.10.a".
The size
property is used to set the capacity of DMA buffer in bytes.
The size
property is mandatory.
udmabuf@0x00 {
compatible = "ikwzm,udmabuf-0.10.a";
size = <0x00100000>;
};
The minor-number
property is used to set the minor number.
The valid minor number range is 0 to 255. A minor number provided as insmod
argument will has higher precedence, and when definition in the device tree has
colliding number, creation of the device defined in the device tree will fail.
The minor-number
property is optional. When the minor-number
property is not
specified, udmabuf automatically assigns an appropriate one.
udmabuf@0x00 {
compatible = "ikwzm,udmabuf-0.10.a";
minor-number = <0>;
size = <0x00100000>;
};
The device-name
property is used to set the name of device.
The device-name
property is optional. The device name is determined as follow:
- If
device-name
property is specified, the value ofdevice-name
property is used. - If
device-name
property is not present, and ifminor-number
property is specified,sprintf("udmabuf%d", minor-number)
is used. - If
device-name
property is not present, and ifminor-number
property is not present, the entry name of the device tree is used (udmabuf@0x00
in this example).
udmabuf@0x00 {
compatible = "ikwzm,udmabuf-0.10.a";
device-name = "udmabuf0";
size = <0x00100000>;
};
The sync-mode
property is used to configure the behavior when udmabuf is opened
with the O_SYNC
flag.
sync-mode
=<1>: IfO_SYNC
is specified orsync-always
property is specified, CPU cache is disabled. Otherwise CPU cache is enabled.sync-mode
=<2>: IfO_SYNC
is specified orsync-always
property is specified, CPU cache is disabled but CPU uses write-combine when writing data to DMA buffer improves performance by combining multiple write accesses. Otherwise CPU cache is enabled.sync-mode
=<3>: IfO_SYNC
is specified orsync-always
property is specified, DMA coherency mode is used. Otherwise CPU cache is enabled.
The sync-mode
property is optional.
When the sync-mode
property is not specified, sync-mode
is set to <1>.
udmabuf@0x00 {
compatible = "ikwzm,udmabuf-0.10.a";
size = <0x00100000>;
sync-mode = <2>;
};
Details on O_SYNC
and cache management will be described in the next section.
If the sync-always
property is specified, when opening udmabuf, it specifies that
the operation specified by the sync-mode
property will always be performed
regardless of O_SYNC
specification.
The sync-always
property is optional.
udmabuf@0x00 {
compatible = "ikwzm,udmabuf-0.10.a";
size = <0x00100000>;
sync-mode = <2>;
sync-always;
};
Details on O_SYNC
and cache management will be described in the next section.
The sync-offset
property is used to set the start of the buffer range when manually
controlling the cache of udmabuf.
The sync-offset
property is optional.
When the sync-offset
property is not specified, sync-offset
is set to <0>.
Details on cache management will be described in the next section.
The sync-size
property is used to set the size of the buffer range when manually
controlling the cache of udmabuf.
The sync-size
property is optional.
When the sync-size
property is not specified, sync-size
is set to <0>.
Details on cache management will be described in the next section.
The sync-direction
property is used to set the direction of DMA when manually
controlling the cache of udmabuf.
sync-direction
=<0>: DMA_BIDIRECTIONALsync-direction
=<1>: DMA_TO_DEVICEsync-direction
=<2>: DMA_FROM_DEVICE
The sync-direction
property is optional.
When the sync-direction
property is not specified, sync-direction
is set to <0>.
udmabuf@0x00 {
compatible = "ikwzm,udmabuf-0.10.a";
size = <0x00100000>;
sync-offset = <0x00010000>;
sync-size = <0x000F0000>;
sync-direction = <2>;
};
Details on cache management will be described in the next section.
If the dma-coherent
property is specified, indicates that coherency between DMA
buffer and CPU cache can be guaranteed by hardware.
The dma-coherent
property is optional. When the dma-coherent
property is not
specified, indicates that coherency between DMA buffer and CPU cache can not be
guaranteed by hardware.
udmabuf@0x00 {
compatible = "ikwzm,udmabuf-0.10.a";
size = <0x00100000>;
dma-coherent;
};
Details on cache management will be described in the next section.
Linux can specify the reserved memory area in the device tree. The Linux kernel
excludes normal memory allocation from the physical memory space specified by
reserved-memory
property.
In order to access this reserved memory area, it is nessasary to use a
general-purpose memory access driver such as /dev/mem
, or associate it with
the device driver in the device tree.
By the memory-region
property, it can be associated the reserved memory area with udmabuf.
reserved-memory {
#address-cells = <1>;
#size-cells = <1>;
ranges;
image_buf0: image_buf@0 {
compatible = "shared-dma-pool";
reusable;
reg = <0x3C000000 0x04000000>;
label = "image_buf0";
};
};
udmabuf@0 {
compatible = "ikwzm,udmabuf-0.10.a";
device-name = "udmabuf0";
size = <0x04000000>; // 64MiB
memory-region = <&image_buf0>;
};
In this example, 64MiB of 0x3C000000 to 0x3FFFFFFF is reserved as "image_buf0".
In this "image_buf0", specify "shared-dma-pool" in compatible
property and specify
the reusable
property. By specifying these properties, this reserved memory area
will be allocated by the CMA. Also, you need to be careful about address and size
alignment.
The above "image_buf0" is associated with "udmabuf@0" with memory-region
property.
With this association, "udmabuf@0" reserves physical memory from the CMA area
specifed by "image_buf0".
The memory-region
property is optional.
When the memory-region
property is not specified, udmabuf allocates the DMA buffer
from the CMA area allocated to the Linux kernel.
When udmabuf is loaded into the kernel, the following device files are created.
<device-name>
is a placeholder for the device name described in the previous section.
/dev/<device-name>
/sys/class/udmabuf/<device-name>/phys_addr
/sys/class/udmabuf/<device-name>/size
/sys/class/udmabuf/<device-name>/sync_mode
/sys/class/udmabuf/<device-name>/sync_offset
/sys/class/udmabuf/<device-name>/sync_size
/sys/class/udmabuf/<device-name>/sync_direction
/sys/class/udmabuf/<device-name>/sync_owner
/sys/class/udmabuf/<device-name>/sync_for_cpu
/sys/class/udmabuf/<device-name>/sync_for_device
/sys/class/udmabuf/<device-name>/dma_coherent
/dev/<device-name>
is used when mmap()
-ed to the user space or accessed via read()
/write()
.
if ((fd = open("/dev/udmabuf0", O_RDWR)) != -1) {
buf = mmap(NULL, buf_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
/* Do some read/write access to buf */
close(fd);
}
The device file can be directly read/written by specifying the device as the target of dd
in the shell.
zynq$ dd if=/dev/urandom of=/dev/udmabuf0 bs=4096 count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 3.07516 s, 1.4 MB/s
zynq$dd if=/dev/udmabuf4 of=random.bin
8192+0 records in
8192+0 records out
4194304 bytes (4.2 MB) copied, 0.173866 s, 24.1 MB/s
The physical address of a DMA buffer can be retrieved by reading /sys/class/udmabuf/<device-name>/phys_addr
.
unsigned char attr[1024];
unsigned long phys_addr;
if ((fd = open("/sys/class/udmabuf/udmabuf0/phys_addr", O_RDONLY)) != -1) {
read(fd, attr, 1024);
sscanf(attr, "%x", &phys_addr);
close(fd);
}
The size of a DMA buffer can be retrieved by reading /sys/class/udmabuf/<device-name>/size
.
unsigned char attr[1024];
unsigned int buf_size;
if ((fd = open("/sys/class/udmabuf/udmabuf0/size", O_RDONLY)) != -1) {
read(fd, attr, 1024);
sscanf(attr, "%d", &buf_size);
close(fd);
}
The device file /sys/class/udmabuf/<device-name>/sync_mode
is used to configure
the behavior when udmabuf is opened with the O_SYNC
flag.
unsigned char attr[1024];
unsigned long sync_mode = 2;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_mode", O_WRONLY)) != -1) {
sprintf(attr, "%d", sync_mode);
write(fd, attr, strlen(attr));
close(fd);
}
Details on O_SYNC
and cache management will be described in the next section.
The device file /sys/class/udmabuf/<device-name>/sync_offset
is used to specify
the start address of a memory block of which cache is manually managed.
unsigned char attr[1024];
unsigned long sync_offset = 0x00000000;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_offset", O_WRONLY)) != -1) {
sprintf(attr, "%d", sync_offset); /* or sprintf(attr, "0x%x", sync_offset); */
write(fd, attr, strlen(attr));
close(fd);
}
Details of manual cache management is described in the next section.
The device file /sys/class/udmabuf/<device-name>/sync_size
is used to specify
the size of a memory block of which cache is manually managed.
unsigned char attr[1024];
unsigned long sync_size = 1024;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_size", O_WRONLY)) != -1) {
sprintf(attr, "%d", sync_size); /* or sprintf(attr, "0x%x", sync_size); */
write(fd, attr, strlen(attr));
close(fd);
}
Details of manual cache management is described in the next section.
The device file /sys/class/udmabuf/<device-name>/sync_direction
is used to set the
direction of DMA transfer to/from the DMA buffer of which cache is manually managed.
- 0: sets DMA_BIDIRECTIONAL
- 1: sets DMA_TO_DEVICE
- 2: sets DMA_FROM_DEVICE
unsigned char attr[1024];
unsigned long sync_direction = 1;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_direction", O_WRONLY)) != -1) {
sprintf(attr, "%d", sync_direction);
write(fd, attr, strlen(attr));
close(fd);
}
Details of manual cache management is described in the next section.
The device file /sys/class/udmabuf/<device-name>/dma_coherent
can read whether
the coherency of DMA buffer and CPU cache can be guaranteed by hardware.
It is able to specify whether or not it is able to guarantee by hardware with the
dma-coherent
property in the device tree, but this device file is read-only.
If this value is 1, the coherency of DMA buffer and CPU cache can be guaranteed by hardware. If this value is 0, the coherency of DMA buffer and CPU cache can be not guaranteed by hardware.
unsigned char attr[1024];
int dma_coherent;
if ((fd = open("/sys/class/udmabuf/udmabuf0/dma_coherent", O_RDONLY)) != -1) {
read(fd, attr, 1024);
sscanf(attr, "%x", &dma_coherent);
close(fd);
}
The device file /sys/class/udmabuf/<device-name>/sync_owner
reports the owner of
the memory block in the manual cache management mode.
If this value is 1, the buffer is owned by the device. If this value is 0, the buffer is owned by the cpu.
unsigned char attr[1024];
int sync_owner;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_owner", O_RDONLY)) != -1) {
read(fd, attr, 1024);
sscanf(attr, "%x", &sync_owner);
close(fd);
}
Details of manual cache management is described in the next section.
In the manual cache management mode, CPU can be the owner of the buffer by writing
non-zero to the device file /sys/class/udmabuf/<device-name>/sync_for_cpu
.
This device file is write only.
If '1' is written to device file, if sync_direction
is 2(=DMA_FROM_DEVICE) or 0(=DMA_BIDIRECTIONAL),
the write to the device file invalidates a cache specified by sync_offset
and sync_size
.
unsigned char attr[1024];
unsigned long sync_for_cpu = 1;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_for_cpu", O_WRONLY)) != -1) {
sprintf(attr, "%d", sync_for_cpu);
write(fd, attr, strlen(attr));
close(fd);
}
The value written to this device file can include sync_offset, sync_size, and sync_direction.
unsigned char attr[1024];
unsigned long sync_offset = 0;
unsigned long sync_size = 0x10000;
unsigned int sync_direction = 1;
unsigned long sync_for_cpu = 1;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_for_cpu", O_WRONLY)) != -1) {
sprintf(attr, "0x%08X%08X", (sync_offset & 0xFFFFFFFF), (sync_size & 0xFFFFFFF0) | (sync_direction << 2) | sync_for_cpu);
write(fd, attr, strlen(attr));
close(fd);
}
The sync_offset/sync_size/sync_direction specified by sync_for_cpu
is temporary and does not affect the sync_offset
or sync_size
or sync_direction
device files.
Details of manual cache management is described in the next section.
In the manual cache management mode, DEVICE can be the owner of the buffer by
writing non-zero to the device file /sys/class/udmabuf/<device-name>/sync_for_device
.
This device file is write only.
If '1' is written to device file, if sync_direction
is 1(=DMA_TO_DEVICE) or 0(=DMA_BIDIRECTIONAL),
the write to the device file flushes a cache specified by sync_offset
and sync_size
(i.e. the
cached data, if any, will be updated with data on DDR memory).
unsigned char attr[1024];
unsigned long sync_for_device = 1;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_for_device", O_WRONLY)) != -1) {
sprintf(attr, "%d", sync_for_device);
write(fd, attr, strlen(attr));
close(fd);
}
The value written to this device file can include sync_offset, sync_size, and sync_direction.
unsigned char attr[1024];
unsigned long sync_offset = 0;
unsigned long sync_size = 0x10000;
unsigned int sync_direction = 1;
unsigned long sync_for_device = 1;
if ((fd = open("/sys/class/udmabuf/udmabuf0/sync_for_device", O_WRONLY)) != -1) {
sprintf(attr, "0x%08X%08X", (sync_offset & 0xFFFFFFFF), (sync_size & 0xFFFFFFF0) | (sync_direction << 2) | sync_for_device);
write(fd, attr, strlen(attr));
close(fd);
}
The sync_offset/sync_size/sync_direction specified by sync_for_device
is temporary and does not affect the sync_offset
or sync_size
or sync_direction
device files.
Details of manual cache management is described in the next section.
CPU usually accesses to a DMA buffer on the main memory using cache, and a hardware accelerator logic accesses to data stored in the DMA buffer on the main memory. In this situation, coherency between data stored on CPU cache and them on the main memory should be considered carefully.
When hardware assures the coherency, CPU cache can be turned on without additional treatment. For example, ZYNQ provides ACP (Accelerator Coherency Port), and the coherency is maintained by hardware as long as the accelerator accesses to the main memory via this port.
In this case, accesses from CPU to the main memory can be fast by using CPU cache
as usual. To enable CPU cache on the DMA buffer allocated by udmabuf, open udmabuf
without specifying the O_SYNC
flag.
/* To enable CPU cache on the DMA buffer, */
/* open udmabuf without specifying the `O_SYNC` flag. */
if ((fd = open("/dev/udmabuf0", O_RDWR)) != -1) {
buf = mmap(NULL, buf_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
/* Read/write access to the buffer */
close(fd);
}
The manual management of cache, described in the following section, will not be necessary when hardware maintains the coherency.
If the dma-coherent
property is specified in the device tree, specify that
coherency can be guaranteed with hardware. In this case, the cache control described
in "2. Manual cache management with the CPU cache still being enabled" described
later is not performed.
To maintain coherency of data between CPU and the main memory, another coherency mechanism is necessary. udmabuf supports two different ways of coherency maintenance; one is to disable CPU cache, and the other is to involve manual cache flush/invalidation with CPU cache being enabled.
To disable CPU cache of allocated DMA buffer, specify the O_SYNC
flag when opening udmabuf.
/* To disable CPU cache on the DMA buffer, */
/* open udmabuf with the `O_SYNC` flag. */
if ((fd = open("/dev/udmabuf0", O_RDWR | O_SYNC)) != -1) {
buf = mmap(NULL, buf_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
/* Read/write access to the buffer */
close(fd);
}
As listed below, sync_mode
can be used to configure the cache behavior when the
O_SYNC
flag is present in open()
:
- sync_mode=0: CPU cache is enabled regardless of the
O_SYNC
flag presense. - sync_mode=1: If
O_SYNC
is specified, CPU cache is disabled. IfO_SYNC
is not specified, CPU cache is enabled. - sync_mode=2: If
O_SYNC
is specified, CPU cache is disabled but CPU uses write-combine when writing data to DMA buffer improves performance by combining multiple write accesses. IfO_SYNC
is not specified, CPU cache is enabled. - sync_mode=3: If
O_SYNC
is specified, DMA coherency mode is used. IfO_SYNC
is not specified, CPU cache is enabled. - sync_mode=4: CPU cache is enabled regardless of the
O_SYNC
flag presense. - sync_mode=5: CPU cache is disabled regardless of the
O_SYNC
flag presense. - sync_mode=6: CPU uses write-combine to write data to DMA buffer regardless of
O_SYNC
presence. - sync_mode=7: DMA coherency mode is used regardless of
O_SYNC
presence.
As a practical example, the execution times of a sample program listed below were measured under several test conditions as presented in the table.
int check_buf(unsigned char* buf, unsigned int size)
{
int m = 256;
int n = 10;
int i, k;
int error_count = 0;
while(--n > 0) {
for(i = 0; i < size; i = i + m) {
m = (i+256 < size) ? 256 : (size-i);
for(k = 0; k < m; k++) {
buf[i+k] = (k & 0xFF);
}
for(k = 0; k < m; k++) {
if (buf[i+k] != (k & 0xFF)) {
error_count++;
}
}
}
}
return error_count;
}
int clear_buf(unsigned char* buf, unsigned int size)
{
int n = 100;
int error_count = 0;
while(--n > 0) {
memset((void*)buf, 0, size);
}
return error_count;
}
Table-1 The execution time of the sample program checkbuf
sync_mode | O_SYNC | DMA buffer size | ||
1MByte | 5MByte | 10MByte | ||
0 | Not specified | 0.437[sec] | 2.171[sec] | 4.340[sec] |
Specified | 0.437[sec] | 2.171[sec] | 4.340[sec] | |
1 | Not specified | 0.434[sec] | 2.179[sec] | 4.337[sec] |
Specified | 2.283[sec] | 11.414[sec] | 22.830[sec] | |
2 | Not specified | 0.434[sec] | 2.169[sec] | 4.337[sec] |
Specified | 1.616[sec] | 8.262[sec] | 16.562[sec] | |
3 | Not specified | 0.434[sec] | 2.169[sec] | 4.337[sec] |
Specified | 1.600[sec] | 8.391[sec] | 16.587[sec] | |
4 | Not specified | 0.437[sec] | 2.171[sec] | 4.337[sec] |
Specified | 0.437[sec] | 2.171[sec] | 4.337[sec] | |
5 | Not specified | 2.283[sec] | 11.414[sec] | 22.809[sec] |
Specified | 2.283[sec] | 11.414[sec] | 22.840[sec] | |
6 | Not specified | 1.655[sec] | 8.391[sec] | 16.587[sec] |
Specified | 1.655[sec] | 8.391[sec] | 16.587[sec] | |
7 | Not specified | 1.655[sec] | 8.391[sec] | 16.587[sec] |
Specified | 1.655[sec] | 8.391[sec] | 16.587[sec] |
Table-2 The execution time of the sample program clearbuf
sync_mode | O_SYNC | DMA buffer size | ||
1MByte | 5MByte | 10MByte | ||
0 | Not specified | 0.067[sec] | 0.359[sec] | 0.713[sec] |
Specified | 0.067[sec] | 0.362[sec] | 0.716[sec] | |
1 | Not specified | 0.067[sec] | 0.362[sec] | 0.718[sec] |
Specified | 0.912[sec] | 4.563[sec] | 9.126[sec] | |
2 | Not specified | 0.068[sec] | 0.360[sec] | 0.721[sec] |
Specified | 0.063[sec] | 0.310[sec] | 0.620[sec] | |
3 | Not specified | 0.068[sec] | 0.361[sec] | 0.715[sec] |
Specified | 0.062[sec] | 0.310[sec] | 0.620[sec] | |
4 | Not specified | 0.068[sec] | 0.360[sec] | 0.718[sec] |
Specified | 0.067[sec] | 0.360[sec] | 0.710[sec] | |
5 | Not specified | 0.913[sec] | 4.562[sec] | 9.126[sec] |
Specified | 0.913[sec] | 4.562[sec] | 9.126[sec] | |
6 | Not specified | 0.062[sec] | 0.310[sec] | 0.618[sec] |
Specified | 0.062[sec] | 0.310[sec] | 0.619[sec] | |
7 | Not specified | 0.062[sec] | 0.310[sec] | 0.620[sec] |
Specified | 0.062[sec] | 0.310[sec] | 0.621[sec] |
Note: on using O_SYNC
flag on ARM64
For v1.4.4 or earier, udmabuf used pgprot_writecombine()
on ARM64 and sync_mode=1(noncached). The reason is that a bus error occurred in memset() in udmabuf_test.c when using pgprot_noncached()
.
However, as reported in ikwzm#28, when using pgprot_writecombine()
on ARM64, it was found that there was a problem with cache coherency.
Therefore, since v1.4.5, when sync_mode = 1, it was changed to use pgprot_noncached()
. This is because cache coherency issues are very difficult to understand and difficult to debug. Rather than worrying about the cache coherency problem, we decided that it was easier to understand when the bus error occurred.
This change requires alignment attention when using O_SYNC cache control on ARM64. You probably won't be able to use memset().
If a problem occurs, either cache coherency is maintained by hardware, or use a method described bellow that manually cache management with CPU cache still being enabled.
As explained above, by opening udmabuf without specifying the O_SYNC
flag, CPU cache can be left turned on.
/* To enable CPU cache on the DMA buffer, */
/* open udmabuf without specifying the `O_SYNC` flag. */
if ((fd = open("/dev/udmabuf0", O_RDWR)) != -1) {
buf = mmap(NULL, buf_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
/* Read/write access to the buffer */
close(fd);
}
To manualy manage cache coherency, users need to follow the
- Specify a memory area shared between CPU and accelerator via
sync_offset
andsync_size
device files.sync_offset
accepts an offset from the start address of the allocated buffer in units of bytes. The size of the shared memory area should be set tosync_size
in units of bytes. - Data transfer direction should be set to
sync_direction
. If the accelerator performs only read accesses to the memory area,sync_direction
should be set to1(=DMA_TO_DEVICE)
, and to2(=DMA_FROM_DEVICE)
if only write accesses. - If the accelerator reads and writes data from/to the memory area,
sync_direction
should be set to0(=DMA_BIDIRECTIONAL)
.
Following the above configuration, sync_for_cpu
and/or sync_for_device
should
be used to set the owner of the buffer specified by the above-mentioned offset and
the size.
When CPU accesses to the buffer, '1' should be written to sync_for_cpu
to set
CPU as the owner. Upon the write to sync_for_cpu
, CPU cache is invalidated if
sync_direction
is 2(=DMA_FROM_DEVICE)
or 0(=DMA_BIDIRECTIONAL)
.
Once CPU is becomes the owner of the buffer, the accelerator cannot access the buffer.
On the other hand, when the accelerator needs to access the buffer, '1' should be
written to sync_for_device
to change owership of the buffer to the accelerator.
Upon the write to sync_for_device
, the CPU cache of the specified memory area is
flushed using data on the main memory.
However, if the dma-coherent
property is specified in the device tree, CPU cache
is not invalidated and flushed.
The programming language "Python" provides an extension called "NumPy".
This section explains how to do the same operation as "ndarry" by mapping the DMA
buffer allocated in the kernel with memmap
of "NumPy" with udmabuf.
import numpy as np
class Udmabuf:
"""A simple udmabuf class"""
def __init__(self, name):
self.name = name
self.device_name = '/dev/%s' % self.name
self.class_path = '/sys/class/udmabuf/%s' % self.name
self.phys_addr = self.get_value('phys_addr', 16)
self.buf_size = self.get_value('size')
self.sync_offset = None
self.sync_size = None
self.sync_direction = None
def memmap(self, dtype, shape):
self.item_size = np.dtype(dtype).itemsize
self.array = np.memmap(self.device_name, dtype=dtype, mode='r+', shape=shape)
return self.array
def get_value(self, name, radix=10):
value = None
for line in open(self.class_path + '/' + name):
value = int(line, radix)
break
return value
def set_value(self, name, value):
f = open(self.class_path + '/' + name, 'w')
f.write(str(value))
f.close
def set_sync_area(self, direction=None, offset=None, size=None):
if offset is None:
self.sync_offset = self.get_value('sync_offset')
else:
self.set_value('sync_offset', offset)
self.sync_offset = offset
if size is None:
self.sync_size = self.get_value('sync_size')
else:
self.set_value('sync_size', size)
self.sync_size = size
if direction is None:
self.sync_direction = self.get_value('sync_direction')
else:
self.set_value('sync_direction', direction)
self.sync_direction = direction
def set_sync_to_device(self, offset=None, size=None):
self.set_sync_area(1, offset, size)
def set_sync_to_cpu(self, offset=None, size=None):
self.set_sync_area(2, offset, size)
def set_sync_to_bidirectional(self, offset=None, size=None):
self.set_sync_area(3, offset, size)
def sync_for_cpu(self):
self.set_value('sync_for_cpu', 1)
def sync_for_device(self):
self.set_value('sync_for_device', 1)
from udmabuf import Udmabuf
import numpy as np
import time
def test_1(a):
for i in range (0,9):
a *= 0
a += 0x31
if __name__ == '__main__':
udmabuf = Udmabuf('udmabuf0')
test_dtype = np.uint8
test_size = udmabuf.buf_size/(np.dtype(test_dtype).itemsize)
udmabuf.memmap(dtype=test_dtype, shape=(test_size))
comparison = np.zeros(test_size, dtype=test_dtype)
print ("test_size : %d" % test_size)
start = time.time()
test_1(udmabuf.mem_map)
elapsed_time = time.time() - start
print ("udmabuf0 : elapsed_time:{0}".format(elapsed_time)) + "[sec]"
start = time.time()
test_1(comparison)
elapsed_time = time.time() - start
print ("comparison : elapsed_time:{0}".format(elapsed_time)) + "[sec]"
if np.array_equal(udmabuf.mem_map, comparison):
print ("udmabuf0 == comparison : OK")
else:
print ("udmabuf0 != comparison : NG")
Install udmabuf. In this example, 8MiB DMA buffer is reserved as "udmabuf0".
zynq# insmod udmabuf.ko udmabuf0=8388608
[34654.622746] udmabuf udmabuf0: driver installed
[34654.627153] udmabuf udmabuf0: major number = 237
[34654.631889] udmabuf udmabuf0: minor number = 0
[34654.636685] udmabuf udmabuf0: phys address = 0x1f300000
[34654.642002] udmabuf udmabuf0: buffer size = 8388608
[34654.642002] udmabuf udmabuf0: dma-coherent = 0
Executing the script in the previous section gives the following results.
zynq# python udmabuf_test.py
test_size : 8388608
udmabuf0 : elapsed_time:1.53304982185[sec]
comparison : elapsed_time:1.536673069[sec]
udmabuf0 == comparison : OK
The execution time for "udmabuf0"(buffer area secured in the kernel) and the same operation with ndarray (comparison) were almost the same. That is, it seems that "udmabuf0" is also effective CPU cache.
I confirmed the contents of "udmabuf0" after running this script.
zynq# dd if=/dev/udmabuf0 of=udmabuf0.bin bs=8388608
1+0 records in
1+0 records out
8388608 bytes (8.4 MB) copied, 0.151531 s, 55.4 MB/s
shell#
shell# od -t x1 udmabuf0.bin
0000000 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
*
40000000
After executing the script, it was confirmed that the result of the execution remains in the buffer. Just to be sure, let's check that NumPy can read it.
zynq# python
Python 2.7.9 (default, Aug 13 2016, 17:56:53)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> a = np.memmap('/dev/udmabuf0', dtype=np.uint8, mode='r+', shape=(8388608))
>>> a
memmap([49, 49, 49, ..., 49, 49, 49], dtype=uint8)
>>> a.itemsize
1
>>> a.size
8388608
>>>