(Again) memory cache management and coherency
pmdaye opened this issue · 7 comments
Hello,
First of all, this is an AMAZING driver and I am really grateful that you provide it through github.
I am working with an Zynqmp and petalinux 2020.2
There is something unclear to me for the manual cache management.
The driver allocates a coherent memory region.
I have a second module which interacts with the PL to provide the physical address (given by u_dma_buf) where our IP must directly writes.
My module receives an interrupt when the data are written within the memory area by the PL (I am sure that there is no cache here as we are directly writing in the memory area).
When I read the data, I see that I do have cache issues, even if I use sync_for_cpu.
What am I missing? From my understanding, I should not have any cache issue as I only read some memory region.
Thanks,
PMD
By the way, we are not using the ACP. Could it be the problem?
Thank you for the issue.
However, I cannot answer because I do not know the situation you are encountering.
Can you provide me with source code or device tree?
Sorry for the lack of description.
This is my device tree
` reserved-memory {
#address-cells = <2>;
#size-cells = <2>;
ranges;
image_buf0: image_buf@0 {
compatible = "shared-dma-pool";
reusable;
reg = <0x0 0x40000000 0x0 0x2000000>;
label = "image_buf0";
};
};
udmabuf@0 {
compatible = "ikwzm,u-dma-buf";
device-name = "p3-udmabuf0";
size = <0x2000000>; // 64MiB
memory-region = <&image_buf0>;
dma-coherent;
};
To be more clear, we wrote a v4l2 driver that will use CMA buffers within the p3-udmabuf0 to access images written by the PL. Therefore we divide the p3-udmabuf0 into a series of sub memory range using thiis. I found in one of the issues that I have to initialize the memory region to ensure that It can be accessed by the PL afterwards
if (int fd; (fd = open(_buf_file.c_str(), O_RDWR | O_SYNC)) != -1) {
_mptr = mmap(NULL, _bufferSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (_mptr == MAP_FAILED) {
std::error_code ec{errno, std::system_category()};
throw std::system_error(ec, "Can't map CMA memory");
}
_fd = fd;
if (init) {
uint64_t* word_ptr = reinterpret_cast<uint64_t*>(_mptr);
size_t words = _bufferSize / sizeof(uint64_t);
for(int i = 0 ; i < words; i++) {
word_ptr[i] = idx << 0 | idx << 8 | idx << 16 | idx << 24 | idx << 32 | idx << 40 | idx << 48 | idx << 56;
}
}
get_prop("phys_addr") >> std::hex >> _phys_addr;
get_prop("size") >> std::dec >> _bufferSize;
_fd = -1;
} else {
return;
}
_singleBufferPointers.clear();
_singleBufferPointers.resize(_n_buffers);
uint64_t* ptr = reinterpret_cast<uint64_t*>(_mptr);
setSyncDirection();
for (size_t ii = 0; ii < _n_buffers; ii++) {
_singleBufferPointers[ii] = ptr + ii * _singleBufferSize / sizeof(uint64_t);
}`
Then we queued these buffers with v4l2.
Once one image had been written in one of the buffer, we set the corresponding buffer as being ready and we dequeued it from the userspace.
The results we have is that the first image is OK. However, if we have 32 buffers, the buffers from 1 to 31 are empty (they remain at their initialization value) but when we loop over the same buffers again, now we have our images...
From all the analyses we made, it seems that it is an issue with cache coherency but we are unable to correct the issue.
This part of the code is responsible for dequeueing/enqueueing the buffer.
` CLEAR(buf);
buf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
buf.memory = V4L2_MEMORY_USERPTR;
xioctl(fd, VIDIOC_DQBUF, &buf); buffers.sync_for_cpu(buf.index);
buffers.sync_for_cpu(buf.index);
pfile.addSaccadeFrame(kk, buffers.getMappedPtr(buf.index));
xioctl(fd, VIDIOC_QBUF, &buf);`
This is the function that should clean the cache problems (if I understood correctly)
`void p3_udmabuf::sync_for_cpu(uint32_t buf_idx) const
{
char attr[1024];
const unsigned long sync_offset = buf_idx * _singleBufferSize;
const unsigned long sync_size = _singleBufferSize;
unsigned int sync_direction = 2;
unsigned long sync_for_cpu = 1;
auto path = (sys_prefix + "/").append(_requested_buf).append("/sync_for_cpu");
if (int fd; (fd = open(path.c_str(), O_WRONLY)) != -1) {
snprintf(attr, sizeof(attr), "0x%08X%08X", (sync_offset & 0xFFFFFFFF), (sync_size & 0xFFFFFFF0) | (sync_direction << 2) | sync_for_cpu);
write(fd, attr, strlen(attr));
close(fd);
}
}`
(We also tried with sync_direction=1 but it doesn't work.
I hope that this is clearer and thank you in advance for the time you would take to read this rather long post...
It seems that the cache is not enabled because you specified the O_SYNC flag when opening the file.
Maybe it's not a cache issue.
By the way, do you need buf.m.userptr = _singleBufferPointers [buf.index];
before xioctl (fd, VIDIOC_QBUF, &buf)
?
Thanks for your reply.
A little more context would maybe help you understanding what we are doing.
The PL IP reads the data from an ONSemi Python300 sensor.
The goal here is to minimize the amount of copy when trigger is emitted by the PL to signal that an image is ready.
Using the approach of the userptr in v4l2 seems like the perfect solution combined with your driver.
Therefore the temporal order we have is:
- We signal to v4l2 that it will have n buffers to work with (32 is the maximum)
- We create the corresponding number of buffers using u_dma_buf
- We queue all the created buffers with the corresponding index, the mmaped address and the size of the buffer (in userspace)
- Within the kernel, the function that adds the buffer to the queue extract the corresponding physical address and sends it to a FIFO within the PL (kernel space)
- Now, the PL has a FIFO with a series of physical addresses to which it has to write each image sequentially
- We start the video stream
- We wait for an interrupt (in kernel space)
- We signal to v4l2 that an buffer is ready to be exported to userspace
- We dequeue the buffer (in userspace)
- We memcpy the data (in userspace)
- We requeue the buffer within the PL FIFO as in point 3.
To validate that the data are correctly written by the PL to the physical addresses, we built a baremetal example that runs smoothly and in which we can check that the data are correctly written (using a test pattern instead of the actual image sensor).
HOWEVER!! To make working this baremetal software, we need to invalidate the DCacheLine using Xil_DCacheInvalidateRange();
We would like to keep the manual cache management as we have shown in the baremetal that it is way faster than without it (as expected I believe).
I believed that I could make it work using sync_for_cpu with the sync_direction set to 2 to force and invalidate cache as explained in your documentation.
However the pattern we observe is very strange.
To demonstrate the behavior, I logged different pieces of information.
- In this test, we read the data from the sensor with 32 buffers allocated through u_dma_buf.
- To initialize the buffers, I set the value of each byte to the corresponding index of the buffer (buffer 0 is filled with 0, buffer 1 is filled with 1, etc)
- Every time I read a frame, I print the number of the sequence (increasing monotonicaly), the index of the buffer (between 0 and 31), the mmaped address (ptr), the corresponding physical address and the value of the 32 first bytes,
The results are below:
sequence: 0, buf_idx: 0 ptr: 0xffff947dc000 phys: 40000000 val: 0x181e161e151d151c,171c161d161b161e,191c1a1d181b191b,1b1c1a1d1a1d1b1c
sequence: 1, buf_idx: 1 ptr: 0xffff94872000 phys: 40096000 val: 0x161e161f171d151f,161e171d161e181e,101010101010101,101010101010101
sequence: 2, buf_idx: 2 ptr: 0xffff94908000 phys: 4012c000 val: 0x161c171b151c161d,171b161a151c151d,202020202020202,202020202020202
sequence: 3, buf_idx: 3 ptr: 0xffff9499e000 phys: 401c2000 val: 0x161f171f171d151f,1620161e1620181e,303030303030303,303030303030303
sequence: 4, buf_idx: 4 ptr: 0xffff94a34000 phys: 40258000 val: 0x151c151d151d141c,171d161d151d161e,404040404040404,404040404040404
sequence: 5, buf_idx: 5 ptr: 0xffff94aca000 phys: 402ee000 val: 0x171e171d161e171e,1820181f181e1620,505050505050505,505050505050505
sequence: 6, buf_idx: 6 ptr: 0xffff94b60000 phys: 40384000 val: 0x151c171b141e151b,161c161d181e151d,606060606060606,606060606060606
sequence: 7, buf_idx: 7 ptr: 0xffff94bf6000 phys: 4041a000 val: 0x161e151d181f151d,1820171f181e181e,707070707070707,707070707070707
sequence: 8, buf_idx: 8 ptr: 0xffff94c8c000 phys: 404b0000 val: 0x161d161c151d151c,161c151d171c151c,808080808080808,808080808080808
sequence: 9, buf_idx: 9 ptr: 0xffff94d22000 phys: 40546000 val: 0x171e171d1520171f,1720171f1820181e,909090909090909,909090909090909
sequence: 10, buf_idx: 10 ptr: 0xffff94db8000 phys: 405dc000 val: 0x151d151c151b141d,161c171c151c161c,a0a0a0a0a0a0a0a,a0a0a0a0a0a0a0a
sequence: 11, buf_idx: 11 ptr: 0xffff94e4e000 phys: 40672000 val: 0x1620161f161d1620,1720181e1720161f,b0b0b0b0b0b0b0b,b0b0b0b0b0b0b0b
sequence: 12, buf_idx: 12 ptr: 0xffff94ee4000 phys: 40708000 val: 0x141c151d141e141b,151d151b141e151f,c0c0c0c0c0c0c0c,c0c0c0c0c0c0c0c
sequence: 13, buf_idx: 13 ptr: 0xffff94f7a000 phys: 4079e000 val: 0x182015201520151f,1620171f1620161f,d0d0d0d0d0d0d0d,d0d0d0d0d0d0d0d
sequence: 14, buf_idx: 14 ptr: 0xffff95010000 phys: 40834000 val: 0x151e151f151d131c,141d141d131c151e,e0e0e0e0e0e0e0e,e0e0e0e0e0e0e0e
sequence: 15, buf_idx: 15 ptr: 0xffff950a6000 phys: 408ca000 val: 0x1521141f15221620,1720162115211620,f0f0f0f0f0f0f0f,f0f0f0f0f0f0f0f
sequence: 16, buf_idx: 16 ptr: 0xffff9513c000 phys: 40960000 val: 0x131c121c151e141d,131e141c151d141d,1010101010101010,1010101010101010
sequence: 17, buf_idx: 17 ptr: 0xffff951d2000 phys: 409f6000 val: 0x1520162115211621,161f152015201420,1111111111111111,1111111111111111
sequence: 18, buf_idx: 18 ptr: 0xffff95268000 phys: 40a8c000 val: 0x141f131c131c131e,131c141e131d141d,1212121212121212,1212121212121212
sequence: 19, buf_idx: 19 ptr: 0xffff952fe000 phys: 40b22000 val: 0x1521141f14201420,1521152115201620,1313131313131313,1313131313131313
sequence: 20, buf_idx: 20 ptr: 0xffff95394000 phys: 40bb8000 val: 0x131e141c121e141d,131d141d141d131d,1414141414141414,1414141414141414
sequence: 21, buf_idx: 21 ptr: 0xffff9542a000 phys: 40c4e000 val: 0x1521141f1520151f,1620141f16201620,1515151515151515,1515151515151515
sequence: 22, buf_idx: 22 ptr: 0xffff954c0000 phys: 40ce4000 val: 0x131e141e121d141f,141c141d141f131c,1616161616161616,1616161616161616
sequence: 23, buf_idx: 23 ptr: 0xffff95556000 phys: 40d7a000 val: 0x15211421151e1421,1620161f141e151e,1717171717171717,1717171717171717
sequence: 24, buf_idx: 24 ptr: 0xffff955ec000 phys: 40e10000 val: 0x141c141d141b141d,151e141c141d131d,1818181818181818,1818181818181818
sequence: 25, buf_idx: 25 ptr: 0xffff95682000 phys: 40ea6000 val: 0x1520151f141f1521,151f181f151f1520,1919191919191919,1919191919191919
sequence: 26, buf_idx: 26 ptr: 0xffff95718000 phys: 40f3c000 val: 0x151d141d131b121c,161e131d141d141d,1a1a1a1a1a1a1a1a,1a1a1a1a1a1a1a1a
sequence: 27, buf_idx: 27 ptr: 0xffff957ae000 phys: 40fd2000 val: 0x162015211620161f,1720171e15201720,1b1b1b1b1b1b1b1b,1b1b1b1b1b1b1b1b
sequence: 28, buf_idx: 28 ptr: 0xffff95844000 phys: 41068000 val: 0x141b141d141d131d,151e151f141d151d,1c1c1c1c1c1c1c1c,1c1c1c1c1c1c1c1c
sequence: 29, buf_idx: 29 ptr: 0xffff958da000 phys: 410fe000 val: 0x1620152115201622,1820182017201620,1d1d1d1d1d1d1d1d,1d1d1d1d1d1d1d1d
sequence: 30, buf_idx: 30 ptr: 0xffff95970000 phys: 41194000 val: 0x151e131d131d131f,161d161c131b151d,1e1e1e1e1e1e1e1e,1e1e1e1e1e1e1e1e
sequence: 31, buf_idx: 31 ptr: 0xffff95a06000 phys: 4122a000 val: 0x1622162014201521,1622171f15211621,1f1f1f1f1f1f1f1f,1f1f1f1f1f1f1f1f
sequence: 32, buf_idx: 0 ptr: 0xffff947dc000 phys: 40000000 val: 0x141e131d131e141f,161e151f161e151d,1d1c1a1e1c1c1a1c,1b1d1d1e1c1e1c1c
sequence: 33, buf_idx: 1 ptr: 0xffff94872000 phys: 40096000 val: 0x1423172215201621,1621192017221620,1a1e181c191b181b,181c191d181f191c
sequence: 34, buf_idx: 2 ptr: 0xffff94908000 phys: 4012c000 val: 0x151e141e151e121e,151f161e141f1520,1e1d1b1e1d1d1d1e,1e1f1d1e1b1f1c1c
sequence: 35, buf_idx: 3 ptr: 0xffff9499e000 phys: 401c2000 val: 0x1423162316211522,1821182217231822,1a1c191b181b181c,191a181d191c191b
sequence: 36, buf_idx: 4 ptr: 0xffff94a34000 phys: 40258000 val: 0x161e151e141f141f,151e151e131e1520,1f1f1b1e1c1d1a1c,1d1f1d1d1d1d1c1d
sequence: 37, buf_idx: 5 ptr: 0xffff94aca000 phys: 402ee000 val: 0x1823162116221622,1623172217221621,1a1d1a1d181c191d,191c191c181b1a1b
- The first image (sequence 0) is perfect.
- Images 1 to 31: the first 16 bytes are OK but the following 16 are not changed (I highlighted the value in bold above). Their value remains the one from the initialization as if nothing was written in the DDR by the PL. However, we have validated that PL is correctly writing in the DDR through the baremetal application
- Images 32 to the end. The first 16 bytes are OK and the following ones corresponds to the image in the previous iteration of the buffer.
It seems that only the first 16 bytes are updated with the sync_to_cpu.
I am sincerely apologizing for the lack of clarity. I really would appreciate if you had any idea on how to solve this issue...
Thank you again!!
It seems that the cache is not enabled because you specified the O_SYNC flag when opening the file.
Maybe it's not a cache issue.
By the way, do you need buf.m.userptr = _singleBufferPointers [buf.index]; before xioctl (fd, VIDIOC_QBUF, &buf)?
What happened to the answer to this point?
What happened to the answer to this point?
excuse me,have you solved this problem? Can you share the solution