open-power/snap

hls_intersect: Hardware stuck and returns error on mmio_read()

Closed this issue · 7 comments

R hw_snap_mmio_read32(0xc7e510, f000, 1) -1
R hw_snap_mmio_read32(0xc7e510, f000, 1) -1
R hw_snap_mmio_read32(0xc7e510, f000, 1) -1
R hw_snap_mmio_read32(0xc7e510, f000, 1) -1
R hw_snap_mmio_read32(0xc7e510, f000, 1) -1
R hw_snap_mmio_read32(0xc7e510, f000, 1) -1

I saw this in a failing intersect test-case. We should stop and return an error, but we seem not doing this, but instead continuing to poll for results on a broken card.

Seems I was creating corner case again?

Hi Frank, would you check the log again? sim.log
I see this error with following message:

top.ddr3_dimm.mem.rank[0].sodimm[8].ddr3.memory_write: at time 44414534.0 ps ERROR: Memory overflow.  Write to Address 0080008 with Data xxxxxxxx0000000a will be lost.
You must increase the MEM_BITS parameter or define MAX_MEM.
/afs/bb/u/luyong/capi/mysnap/puresnap/hardware/ip/ddr3sdram_ex/imports/ddr3.v:756                 if (STOP_ON_ERROR) $stop(0);
ncsim>  exit

I have to delete the // in the first line of ip/ddr3sdram_ex/imports/ddr3.v or modify sim/core/ddr3_dimm.sv to add MAX_MEM and make model again. (For simluation to pass)

But thus it will generate many files and uses many disk space.

We know that 3 processes are invoked for a simulation:

1. Simulator
2. PSLSE
3. application

when the first one, i.e, NCSIM, meets some error condition and exit (like in this example) , script hardware/sim/run_sim should have a way to terminate the application.
run_sim script knows the SIM_PID and is there a way to watch it? When it is terminated in middle, kill the Application also.

To reproduce this, you can run hls_intersect simulation in current master branch,
and even easier way is to kill NCSIM process in the middle. Then you will see tons of messages as Frank posted.
Hi @joergkayser , would you also have a look at this? Thanks a lot.

I tried to reproduce this case with Vivado 2016.4, card=ku3, action=hls_intersect, SDRAM_USED=TRUE, Simulator=irun. The error was:
the application hls_interact caused a memory overflow in the simulator irun. irun stopped running.
run_sim detected, that the simulator was gone, so it killed the other processes from the list, which was PSLSE and testlist.sh.
As hls_interact was a child to testlist.sh, it continued to run and was stopped alter, after the timeout expired.

I changed run_sim such, that it not only kills the recorded PIDs, but also all child processes generated by thos PIDs. Please test branch=sim_kill_childPID

Mhm. You hijacked my bug to discuss intersect problems. So leave this open even though I fixed the problem in the low-level code causing not to abort when getting a bad return code by the register read function.

Hi Frank, I think this has been fixed, right?

@luyong6 closing as fixed. Please open a new bug if an issue remains.