JTAG halt request gets missed if it occures during an illegal instruction

Describe the bug
While debugging the OCD interface using a Segger J-Link we've noticed that if we issue a halt request and the core is executing an illegal instructions then the halt request gets discarded.

To Reproduce
The simplest way to reproduce it, which is how we found it, is to create an 0'ed ROM image which will cause the core to go into an infinite illegal instruction loop. This as mtvec as default points to the ROM whose first location is all 0s which triggers the illegal instruction trap.
When the core is in this state you can see the trap.cause shift to 0x63 (halt request seen) but it gets overwritten by the illegal instruction trap 0x02 as that has a higher priority.

I've attached a screen grab from a Xilinx FPGA image with ILAs showing the trap_ctrl.cause, the cpu_halt_req_o from the DM module and the PC. The PC is looping between the first 2 ROM instructions (all 0's) and you can the cause register flipping between 0x63 and 0x02 with 0x02 winning as we keep looping around the first ROM address.

Expected behavior
This I'm a bit unsure of as I'm not 100% into the RISCV specification regarding debug behavior. I'm assuming that halt request would take the highest priority to ensure that no matter the state of the core you can always halt it and get into debug mode.

Screenshots

Environment:

OS: RHEL 7
GCC Version RISCV gcc 13.1
Libraries used: newlib

Hardware:

Hardware version (1.9.3) bug patched

Additional context
I have a local fix to the issue which is to re-order the trap_ctrl.cause priority to make halt request the highest priority. With the fix we can successfully connect to the neorv32 core with Segger J-Link.

      -- if we are supposed to halt, always halt
      if (trap_ctrl.irq_buf(irq_db_halt_c)  = '1') then trap_ctrl.cause <= trap_db_halt_c;  -- external halt request (async)
--      if    (trap_ctrl.exc_buf(exc_ialign_c)   = '1') then trap_ctrl.cause <= trap_ima_c;  -- instruction address misaligned

I also have a local DM modification that I believe better meets the debug spec 1.0 with regards to halt_req. With the added benefit that its a lot easier to support haltresetreq. I'm also holding reset for a bit long than 1 clock cycle, just to ensure that any "stuff" that has randomly pipelined somewhere gets time to get flushed.
Do note that for my implementation the reset to the OCD (DM and DMT) is seperated out from the rest of the system to support ndmreset. I can't remember what the current version of the core does.

  -- CPU halt/resume request --
  -- Ensure we hold halt request until the core has entered debug state
  dmi_halt_core: process(rstn_i, clk_i)
  begin
    if (rstn_i = '0') then
      hold_halt_req <= '0';
    elsif rising_edge(clk_i) then
      if (cpu_debug_i = '0') then
        if ((dm_reg.halt_req = '1') or (dm_reg.setresethaltreq = '1')) then
          hold_halt_req <= '1';
        end if;
      end if;
      if (cpu_debug_i = '1') then
        hold_halt_req <= '0';
      end if;
    end if;
  end process dmi_halt_core;
  cpu_halt_req_o <= hold_halt_req;
  --cpu_halt_req_o <= (dm_reg.halt_req and dm_reg.dmcontrol_dmactive) or bool_to_ulogic_f((unsigned(resetHaltRequestHold) > 0)); -- single-shot
  dci.resume_req <= dm_ctrl.hart_resume_req; -- active until explicitly cleared

  -- SoC reset --
--  cpu_ndmrstn_o <= '0' when ((dm_reg.dmcontrol_ndmreset = '1') or (dm_reg.setresethaltreq = '1')) and (dm_reg.dmcontrol_dmactive = '1') else '1'; -- to processor's reset generator
  dmi_reset_extend: process(rstn_i, clk_i)
  begin
    if (rstn_i = '0') then
      resetHaltRequestHold <= (others => '0'); -- clear reset halt request
    elsif rising_edge(clk_i) then
      if ((dm_reg.dmcontrol_ndmreset = '1') or (dm_reg.setresethaltreq = '1')) then
        resetHaltRequestHold <= (others => '1'); -- Set hold register to max count if we have a reset halt req
      elsif (dm_reg.clrresethaltreq = '1') then
        resetHaltRequestHold <= (others => '0'); -- clear reset halt request
      end if;
      -- as long as we are not 0 decrement
      if (unsigned(resetHaltRequestHold) > 0) then
        resetHaltRequestHold <= std_ulogic_vector(unsigned(resetHaltRequestHold) - 1);
      end if;
    end if; 
  end process dmi_reset_extend;

  cpu_ndmrstn_o <= '0' when bool_to_ulogic_f((unsigned(resetHaltRequestHold) > 0)) else '1'; -- to processor's reset generator

Thanks for the detailed report!

The simplest way to reproduce it, which is how we found it, is to create an 0'ed ROM image which will cause the core to go into an infinite illegal instruction loop. This as mtvec as default points to the ROM whose first location is all 0s which triggers the illegal instruction trap.

This is a scenario I did not consider...

But yes, all CPU exception have higher priority than the debugger halt request. The RISC-V spec is not really clear about this but I see the problem.

Your fix looks great and I think we should adopt this. However, there is one problem I see: imagine the processor executes an instruction that causes some exception - like an illegal instruction exception. If the debugger halt request kicks in right in this specific moment the illegal instruction exception will get lost.

I also have a local DM modification that I believe better meets the debug spec 1.0 with regards to halt_req. With the added benefit that its a lot easier to support haltresetreq. I'm also holding reset for a bit long than 1 clock cycle, just to ensure that any "stuff" that has randomly pipelined somewhere gets time to get flushed.

I think this not required. 🤔 The halt request keeps active until the debugger (openOCD) explicitly clears it:

neorv32/rtl/core/neorv32_debug_dm.vhd

Line 416 in 183fa56

    
           dm_reg.halt_req           <= dmi_req_i.data(31); -- haltreq (-/w): write 1 to request halt; has to be cleared again by debugger

I like the way you have added support for the "halt on reset" feature. However, according to the RISC-V debug spec. this feature is optional and can (should) be emulated by the debugger?! 🙈

As always @stnolting you are on point :) I'm not disagreeing with the - what if we miss an illegal instruction - point. With the counter being, if they want to halt we halt. But yeah not an easy thing. In our case we just so randomly happen happen to have an infinite illegal instruction loop, which then never can get exited from.

My perspective here is that we halt as that is the users/debuggers request, independent of anything else. If there is an issue with e.g. illegal instruction it is likely that the sole reason for the debug session is to find that instruction to figure out why XYZ isn't doing what its doing.
The counter is that what I'm highlighting is indeed a crazy corner case where the code is all 0s and the MTVEC must point to the all 0s code. Which honestly is only likely to happen on a blank core in an FPGA, as all the BRAMs come up as all 0s as default. So why care? E.g. if we had the ROM pre-loaded with code this wouldn't be an issue. The counter to this, is that this an exact use case for an initial bring-up for a FW team to start development of a boot room anyways.
The alternative is to correct the halt on reset/halt then reset part of the core to allow the core to go straight to debug mode out of reset vs always fetching and decoding the first instruction.
This assumes that no user is "smart" enough to set themselves up for an infinite loop of illegal instructions and or traps. The other way btw. is setting MTVEC outside instruction space, so any trap will cause an infinite illegal access loop. I note this down as saying there are many ways for users to... make stuff break. Out of principle I never write defense hardware, if we had to write hardware to prevent software folks from themselves it would be a fools errand.

-- notes on the other comments
Yup the halt on reset is indeed optional. We hit a snag when debugging jtag using Segger J-link, and that is that Segger - for some reason - doesn't read the feature list and ends up issuing a halt on reset of halt doesn't work... which we reported to Segger.
So I added the halt on reset feature as part of a debug session.

Now even with a halt on reset "quick feature", we aren't really forcing the core to enter debug mode out of reset. Instead the core is still allowed to boot and we use the normal halt feature to enter debug mode - at least to my quick code read. As our funky corner case is that the very first instruction (all 0s) is illegal it seems the core never enters debug mode because of the above priority. So the when the Segger J-link falls back on a halt-on-reset to detect the core, that also failed.

Now I might have read the note wrong (halt and ndbreset is pulsed as its gated with dm.active). If it isn't pulsed any more - ignore my comment :) The other feature is to hold haltreq towards the core until the core is in debug mode, then we deassert halt request. This allows haltreq to survive across a core reset as we aren't relying on the core to remember the halt request across a reset boundary.
The reset thing was mostly to extend reset to more than 1 clock (32 in my case) to ensure anything thats pipelined gets cleaned out.

My perspective here is that we halt as that is the users/debuggers request, independent of anything else. If there is an issue with e.g. illegal instruction it is likely that the sole reason for the debug session is to find that instruction to figure out why XYZ isn't doing what its doing.

Good point.

The alternative is to correct the halt on reset/halt then reset part of the core to allow the core to go straight to debug mode out of reset vs always fetching and decoding the first instruction.

This is what I prefer. It somehow doesn't feel right to have this “corner case”... 😅
I am currently carrying out some tests and believe I have found a solution. PR is on the rise.

Yup the halt on reset is indeed optional. We hit a snag when debugging jtag using Segger J-link, and that is that Segger - for some reason - doesn't read the feature list and ends up issuing a halt on reset of halt doesn't work... which we reported to Segger.
So I added the halt on reset feature as part of a debug session.

Do you think we should add "support" for halt-on-reset to the upstream core?

Btw, did you implement the hasresethaltreq bit in dmstatus? Without this bit the debugger does not know there is any halt-on-reset functionality implemented.

We could add support for halt-on-reset request, it is in principle just setting halt request and reset at the same time.. so shouldn't be too burdensome.
Some of the feedback I'm getting it seems most debuggers just do the halt->ndbreset model as that is required instead of the halt-on-reset. I would flag it as a "nice to have" as there isn't general support for it in the debugging community.

I did set the hasresethaltreq bit in dmstatus. Didn't make a difference as the debugging tool in question never used the feature. Go figure 🤷

I'm not sure how openOCD is actually handling this... Is there any specific GDB command for "reset and halt"? 🤔

Most of the time I just upload the executable and then do a reset + halt by hand (openOCD's monitor reset halt command), which does not require the DM's halt-on-reset feature.