open-power/snap

Action hangs while writing on DDR

Closed this issue · 21 comments

Hi all,
I have written an HDL code functions some how as hls_memcpy. it reads and/or writes data from/to host memory or FPGA card DDRs.
I faced a problem when writing to DDR. when I run application for first time it writes to DDR and action finishes, but in 2nd run the action hangs and sometimes I need to run snap_maint again (in simulation it works properly). the main problem is that last time snap_maint gave this error back:
Error: Can not open CAPI-SNAP Device: /dev/cxl/afu0.0m
I ran snap_maint many times and still same error. unfortunately FPGA card is not detected anymore! shall I flash FPGA with factory bitstream?
why this should happen!

Hi Abbas
The error you are referring is displayed when :

  • you have no rights (anymore) to access this /dev/cxl/afu0.0m
  • this /dev/cxl/afu0.0m doesn't exist anymore => A reboot of the system should bring you your cards back
    Can the root cause of that be due to a PCIe slot reset? Did you correctly detach the action and the card before reattaching them?
    can you paste the result of dmesg | grep EEH or send me the dmesg file to bruno.mesnet@fr.ibm.com
    Thx

Hi Bruno,
reboot solved problem.
I do not think PCIe reset caused the problem.
when the action (or application) hangs it does not detach action. I remember that sometimes it timeouts but I had to kill it also sometimes.
but why this happens when destination is DDR. have you ever faced such problem in your developments?
the axi interface module is exactly same for both Host Memory and DDR interfaces.

Hi Abbas, this seems very similar to issue #882.
Are you using a AD9V3 card ?
Are you using the latest git code in master branch?
I'm surprised that you see the issue only at your second run.
Can you try and get a core dump by running ulimit -c unlimited before running your code?

Yes I am using AD9V3.
I just downloaded snap-master, and tried one more time. still same problem. this time it happened at 1st run.
I have 3 images with small differences but same functionality, all have same problem and happens in almost every run.

Thanks to your code, after some analysis, for some reasons, axi_card_mem0_wready seems related to the c0_ddr4_s_axi_bready signal which goes down and prevents from any other new burst.
This signal m_axi_bready is an output of the axi_clock_convert module which is the Xilinx IP axi_clock_converter_v2_1_16_axi_clock_converter.

I am continuing investigations to understand what drives to this condition, and if there is a relationship with the reset you get. Thanks

Hi @abbasBSC
The axi_clock_convert is just copying the axi_card_mem0_bvalid signal getting out of your action. Can you check why this signal is going down and never coming back to 1? This seems to be the reason why all writes to DDR are stopped

Hi Bruno,
"axi_card_mem0_bvalid" is an input to my action, not output.

Sorry my fault. please read axi_card_mem0_bready

you are right Bruno when writing to DDR some situation prevents axi_card_mem0_bready to be one. I will force this signal to 1 always. But why this kind of problem should cause a reset?

Thanks Abbas. Did you observed this reset in simulation or on real hardware only? On my side, I haven't been able to observe any reset occurring in simulation else than the timeout which stops the action. From what I know from AXI, there is no retry or timeout or reset, if the slave takes more time than expected, or never answers.
Debugging on real FPGA means adding an ILA debugger on the different resets to understand what conditions are met to drive a PCIe reset. This takes some time but i'll try to find time to catch this condition . I'll keep you posted.

I have seen reset in simulation but not with this code, with other codes.
This code cause reset in real hardware almost always when it is run with some specific configurations.

you are right Bruno when writing to DDR some situation prevents axi_card_mem0_bready to be one. I will force this signal to 1 always. But why this kind of problem should cause a reset?

In my code, when writing to DDR stops (due to bready signal) it also stops reading from host DDR because the FPGA FIFO will be full and do not accept new data. Hence, the axi bus on capi side hangs.

Based on my observations, the reset problem occurs when the axi bus on capi side hangs and never finishes. Then on timeout, when it tries to detach action it will cause the reset problem. It might not be the only case that causes PCIe reset but for sure is one of them as I tested with different codes both on simulation and real system.

To solve PCIe reset problem I tried to reset action code and axi_read/write modules by setting a bit in ACTION_CONTROL registers after each action timeout or action idle (before detach) and use it as a reset to my FPGA modules. It solved the problem in some cases but not always.

When axi hangs, axi_read/write modules go to a state and never come out. But by resetting them on timeout they will go to their initial/idle state before action detach (and also in next run start from initial/idle state). But this is only one side of the axi bus. I am not able to reset the the other side which is psl part. So if psl is trapped in a state and no one reset it after timeout, in next run it will continue from its previous unknown state.

So it is not only about when axi (on capi side) hangs, it could happen if your action is timed out while it is sending or receiving data through axi. In this case even your next runs are not valid because probably psl state machine is in some undesired state, where it was stopped in previous run. That's why some times you might face reset problem even if your axi is working correctly.

In a nutshell, if axi bus on capi side does not work properly, it would be possible to get a PCIe reset. In my opinion resetting all FPGA modules after each action idle or timeout should solve the problem.

Try to run "capi-reset 0 user" when you timed out. This will reload the FPGA image and also put PSL into an initial state.

But I still want to understand what happened on DDR interface? Why axi_card_mem0_bready doesn't go up?

You get "capi-reset" after you clone https://github.com/ibm-capi/capi-utils and make, sudo make install. I think you already have that.

Try to run "capi-reset 0 user" when you timed out. This will reload the FPGA image and also put PSL into an initial state.

But I still want to understand what happened on DDR interface? Why axi_card_mem0_bready doesn't go up?

While writing to DDR in some situations, my code was not able to set bready again. It was a mistake in my code. This situation never happens in writing to host mem.

You get "capi-reset" after you clone https://github.com/ibm-capi/capi-utils and make, sudo make install. I think you already have that.

Yes it is a manual reset. It is not nice to reset FPGA card manually after each timeout. Besides I think reset should happen before action detach to avoid PCIe reset. Some where in snap code.

Hi @abbasBSC .
I was wondering if this was problem solved and if we could close it or not. Thanks

Hi,
This is not a problem now, but it would be better if snap_maint command could reset capi FPGA module. Currently it only resets user action.

snap_maint is just to "discover" the action(s) that can be in a card, so it has to be executed only once. When you call it a 2nd time, it doesn't even try to re-access the card.
In the case there are multiple actions (not yet implemented), we cannot reset the whole capi FPGA module since other users/actions can be working at that time. This was the reason why we didn't followed this path.

it makes sense. But if one action expires, it affects the functionality of other actions as well as the expired action itself in next run since capi modules are not in idle state any more.

Understand and agree. This was my main difficulty when building the actions/hls_latency, being sure that the action had a timeout in any way to come back to the initial state.
We need to think if this is feasible and how to implement this: have a way to reset an action from the application. I'll put this in our todo list. Thanks for your constructive feedback.