foss-for-synopsys-dwc-arc-processors/openocd

"Software breakpoint has been overwritten outside of debugger" warning when debugging linux

mrnuke-adaptrum opened this issue · 4 comments

When hitting a breakpoint while debugging a linux kernel, on an ARCCompact target, when a breakpoint is hit, the next instruction is replaced by a 'brk'.

OpenOCD prints the following warning:

Warn : Software breakpoint @0x802be9d4 has been overwritten outside of debugger.Expected: 0x256f003f, got: 0x00010000   

Examining the disassembly shows a 'brk' instruction has been inserted:

(gdb) x/5i $pc
=> 0x802be9d4 <instr_service+200>:      brk
   0x802be9d8 <instr_service+204>:      lr      r9,[status32]

From this point, trying to step or continue will not work, as the CPU continuously hits the 'brk':

(gdb) stepi
Breakpoint 7, instr_service () at arch/arc/kernel/entry.S:75
75              mov r1, sp
(gdb) stepi
Breakpoint 7, instr_service () at arch/arc/kernel/entry.S:75
75              mov r1, sp
(gdb) 

Hi,

I've tried to reproduce this on AXS101, but the problem doesn't happen to me - there is no stray BRK in the instr_service and stepi works good. Are there any specifics needed to reproduce this problem? Does this problem reproduces reliably?

From the top of my head I don't know what place in OpenOCD could cause this issue, although it is clear that kernel flushing cache and OpenOCD doing so might interfere with each other, but it doesn't seem like this would be the case in this function.

Also if we look at the log messages you have, there is some discrepancy. OpenOCD expects a BRK at 0x802be9d4, but instead gets 0x0001000; yet when doing disassembly and reading same memory location - BRK is there. So it looks like some issue when first reading this memory location, and because OpenOCD cannot find it's breakpoint there, it leaves it as-is, thus it becomes permanent. It might be some JTAG issue, because I've seen some problems with its behaviour. Could you please try a few things to see if they work:

  1. Reduce JTAG frequency by several times, to reduce risk of timing issues.
  2. arc jtag wait-until-write-finished on in your OpenOCD configuration file, right after target create command.
  3. Also arc jtag always-check-status-rd on command might change something. Those are off by default, because they are known to break things in some cases, but they can help in some cases - I wasn't able to understand any particular rule of where any of them is needed and where it is actually causing errors. Currently they are set for certain AXS boards for targets where I was able to devise that this is needed.
  4. You can also enable more verbose logging in OpenOCD with -d 3 option - perhaps it would shed the light on what's going on.

I'm already running the JTAG at 1MHz. running it any slower would make testing impractical. I do notice unexplained phenomenon with the JTAG at higher frequencies, but 1M seems just fine.

I tried point (4), and the openocd output is in this gist

issuing arc jtag always-check-status-rd on then stepi, or hitting a breakpoint seems to make openocd hang completely, and just prints:

Debug: 703902 3447389 arc_jtag.c:427 arc_wait_until_jtag_ready(): JTAG on core is not ready: reg=0x22: RA FL

If there's nothing informative in the openocd log above, I could try to find an AXS101 and reproduce the issue there. Right now I'm running this on a emulation FPGA.

Log doesn't really help, because as I now see it doesn't record what are the actual values that are read from memory, because it would blow up the size of log file. It would be nice if you could check this on AXS101, because if it can be reproduced on it, then I'd have an easier time troubleshooting this problem.

I have a few more questions as well:

  1. Is your chip multicore?
  2. What is the FPGA target core frequency? When dealing with FPGAs and memory even 1MHz could be a bit too much, so to overcome this I had to insert extra delays at certain places, so that overall frequency can be left high. But amount of delay is not a precise science so that might be not enough in some cases, although I think it is unlikely.
  3. Could you try with disabled caches? Or at least data cache. One possible cause could be that memory read comes before the cache is completely flushed, so debugger reads garbage from JTAG. We had such a problem with EM6 on AXS101, where problems were starting if D$ was filled by more than 50%. You can also try disabling D$ flushing in the OpenOCD, by doing arc has-dacache off - that will make debugger unusable to debug variable values, but if that fixes the problem with first read from memory, that would indicate likely issues in cooperating with cache. I suspect that this is related to caches, because only the first read (after cache has been flushed) returns the wrong number, rest of reads which are done in disassembly show correct values

While looking at the log, I've noted an interesting thing at the end.

Writing to aux registers: addr[0]=0x48;count=1;buffer[0]=0x000000c2
Writing to aux registers: addr[0]=0x47;count=1;buffer[0]=0x00000001
Writing to aux registers: addr[0]=0x48;count=1;buffer[0]=0x00000082
......
Writing to aux registers: addr[0]=0x10;count=1;buffer[0]=0xffffffff
Reading aux registers: addr[0]=0x48;count=1
Read from register: buf[0]=0x1
Writing to aux registers: addr[0]=0x48;count=1;buffer[0]=0x00000001
Writing to aux registers: addr[0]=0x47;count=1;buffer[0]=0x00000001
Writing to aux registers: addr[0]=0x48;count=1;buffer[0]=0x00000001

So when Openocd first reads the memory, it flushes D$ as it should. For this it reads 0x48 DC_CTRL, sets a bit there, flushes cache, restores DC_CTRL to original value of 0x82. When a second request comes in and it invalides memory (because it was assumed that that OpenOCD would write an original content of memory to replace breakpoint). First it invalidates I$ by writing to 0x10 IC_IVIL, then it reads DC_CTRL again, but it reads value 0x1 instead of a 0x82 which used to be there. So eventually OpenOCD writes 0x1 to DC_CTRL, completely disabling D$ afterwards. So once again operation on cache yields a 0x1 result for the following operation. So I think you need to try to try with disabled caches.

  1. Well, yes and no. It's supposed to have six CPUs (currently only two are present in the FPGA image), but we're treating it as a heterogeneous system, since it's not SMP. We have two JTAG taps at the moment.
  2. 12 MHz
  3. Turning off dcache sems to have eliminated the brk issue. I say seems because I can't reliably reproduce the brk issue, though the software changed since last week.

I tactically acquired a AXS101, so I will try to see if I can reproduce on it.