Weird behaviour: "list index out of range" CPU bug

Question

Weird behaviour: "list index out of range" CPU bug

asierfdln opened this issue 10 months ago · 12 comments

Hi! While playing around with do_fuzzdesign.py for the cva6-c1 CPU, I have noticed three program descriptors whose "bug" correspongs to list index out of range. The descriptors in question are as follows, all for design_name = 'cva6-c1' in either do_fuzzsingle.py or do_reducesingle.py:

(487340, design_name, 416, 72, False)
(111180, design_name, 2358, 85, False)
(850907, design_name, 3083, 24, False)

Apparently, all three errors are related to the verilated cva6-c1 not dumping register values in the output message of its subprocess.run call, under the runsim_verilator() function of fuzzsim.py (Python file under the cascade-meta repository).

Since there are no register dumps in the resulting output message of the Verilator subprocess.run call in lines 46 and 47 within fuzzsim.py (therefore, nothing with the format "Dump of reg x{reg_id:02}: 0x"), an Exception of the type list index out of range appears later in line 60 when trying to retrieve the integer register values.

It seems like it should be as simple as replacing the itertools.count(current_index) logic to something simpler like range(curr_index, len(outlines)), as is the case in line 77, both for the integer registers and the floating-point registers (lines 59 and 66, respectively). But, with this, trying to reduce the faulty programs results in all kinds of AssertionErrors taking place in lines 181-184 of function call runtest_simulator(), where a certain number of integer-register values and floating-point-register values are expected.

I'm not experienced enough with Verilator or the (faulty) Ariane processor to know why register values are/aren't being dumped, just thought I would point out this weird "CPU bug".

Answer 1 · 2024-02-28T09:58:02.000Z

Hi @asierfdln,

Thank you for opening this issue!
It sounds somehow familiar.
My first guess is that this means that the text provided by the Verilator testbench to Cascade is not of the expected format.
Maybe this is an issue with the cva6-c1 testbench.

Could you maybe start by dumping this text? It should be exec_out.stdout in fuzzsim.py.

While not impossible, this does not look like a CPU bug (it could be one, if the bug causes the CPU to write to the stop address, but I've never seen something like this, so unlikely a priori :) )

Thank you.

Answer 2 · 2024-02-29T08:39:31.000Z

Here is a .zip file with a series of .elf.dump files and outputs from exec_out.stdout. All dumps and outputs have been captured by stopping the execution of do_fuzzsingle.py at the point of exec_out = ... in fuzzsim.py, as you mention.

For each descriptor, there are two .elf.dump files: one .elf.dump that is generated and executed when calling the profile_get_medeleg_mask() function, and a second .elf.dump that is generated and executed when doing the normal fuzz_single_from_descriptor(), i.e. the "main" .elf. For each .elf.dump, in turn, there are two output .txt files: one before filtering the "Writing ELF word to" lines (called *_nofilter.txt), and another after filtering said lines (called *_yesfilter.txt).

exec_outputs/
├── descriptor_111180_cva6-c1_2358_85_False
│   ├── medelegprofilingcva6-c1.elf.dump
│   ├── medelegprofilingcva6-c1_lastbasicblockregisterdump_nofilter.txt
│   ├── medelegprofilingcva6-c1_lastbasicblockregisterdump_yesfilter.txt
│   ├── rtl111180_cva6-c1_2358_85.elf.dump
│   ├── rtl111180_cva6-c1_2358_85_lastbasicblockregisterdump_nofilter.txt
│   └── rtl111180_cva6-c1_2358_85_lastbasicblockregisterdump_yesfilter.txt
├── descriptor_487340_cva6-c1_416_72_False
│   ├── medelegprofilingcva6-c1.elf.dump
│   ├── medelegprofilingcva6-c1_lastbasicblockregisterdump_nofilter.txt
│   ├── medelegprofilingcva6-c1_lastbasicblockregisterdump_yesfilter.txt
│   ├── rtl487340_cva6-c1_416_72.elf.dump
│   ├── rtl487340_cva6-c1_416_72_lastbasicblockregisterdump_nofilter.txt
│   └── rtl487340_cva6-c1_416_72_lastbasicblockregisterdump_yesfilter.txt
└── descriptor_850907_cva6-c1_3083_24_False
    ├── medelegprofilingcva6-c1.elf.dump
    ├── medelegprofilingcva6-c1_lastbasicblockregisterdump_nofilter.txt
    ├── medelegprofilingcva6-c1_lastbasicblockregisterdump_yesfilter.txt
    ├── rtl850907_cva6-c1_3083_24.elf.dump
    ├── rtl850907_cva6-c1_3083_24_lastbasicblockregisterdump_nofilter.txt
    └── rtl850907_cva6-c1_3083_24_lastbasicblockregisterdump_yesfilter.txt

As you say, the text provided by the verilated cva6-c1 to Cascade is not in the expected format, notice how the filtered dump .txt files for the "main" .elf's (the rtl*.elf.dump) do not have any "Dump register" messages. I'm guessing that these .elf.dump files, for whatever reason, make it so that cva6-c1 doesn't write correctly into the regdumpaddr (0x10) and fpregdump (0x18) addresses but, for whatever reason, writes correctly into the stopsignaladdr (0x0)?

exec_outputs.zip

Answer 3 · 2024-02-29T09:33:41.000Z

Hi @asierfdln thank you for the data! The ELF dumps look ok at first glance. You see that it intends to dump first 0x10:

    8002dad4:	00000f37          	lui	t5,0x0
    8002dad8:	010f0f13          	add	t5,t5,16 # 0x10
    8002dadc:	0ff0000f          	fence
    8002dae0:	001f3023          	sd	ra,0(t5)
    8002dae4:	0ff0000f          	fence
    8002dae8:	002f3023          	sd	sp,0(t5)
    8002daec:	0ff0000f          	fence
    8002daf0:	003f3023          	sd	gp,0(t5)
    8002daf4:	0ff0000f          	fence
    8002daf8:	004f3023          	sd	tp,0(t5)
    8002dafc:	0ff0000f          	fence
    8002db00:	005f3023          	sd	t0,0(t5)
    8002db04:	0ff0000f          	fence
    8002db08:	006f3023          	sd	t1,0(t5)
    8002db0c:	0ff0000f          	fence
    8002db10:	007f3023          	sd	t2,0(t5)
    8002db14:	0ff0000f          	fence
    8002db18:	008f3023          	sd	s0,0(t5)
    8002db1c:	0ff0000f          	fence
    8002db20:	009f3023          	sd	s1,0(t5)
    8002db24:	0ff0000f          	fence
    8002db28:	00af3023          	sd	a0,0(t5)
    8002db2c:	0ff0000f          	fence
    8002db30:	00bf3023          	sd	a1,0(t5)
    8002db34:	0ff0000f          	fence
    8002db38:	00cf3023          	sd	a2,0(t5)
    8002db3c:	0ff0000f          	fence
    8002db40:	00df3023          	sd	a3,0(t5)
    8002db44:	0ff0000f          	fence
    8002db48:	00ef3023          	sd	a4,0(t5)
    8002db4c:	0ff0000f          	fence
    8002db50:	00ff3023          	sd	a5,0(t5)
    8002db54:	0ff0000f          	fence
    8002db58:	010f3023          	sd	a6,0(t5)
    8002db5c:	0ff0000f          	fence
    8002db60:	011f3023          	sd	a7,0(t5)
    8002db64:	0ff0000f          	fence
    8002db68:	012f3023          	sd	s2,0(t5)
    8002db6c:	0ff0000f          	fence
    8002db70:	013f3023          	sd	s3,0(t5)
    8002db74:	0ff0000f          	fence
    8002db78:	014f3023          	sd	s4,0(t5)
    8002db7c:	0ff0000f          	fence
    8002db80:	015f3023          	sd	s5,0(t5)
    8002db84:	0ff0000f          	fence
    8002db88:	016f3023          	sd	s6,0(t5)
    8002db8c:	0ff0000f          	fence
    8002db90:	017f3023          	sd	s7,0(t5)
    8002db94:	0ff0000f          	fence
    8002db98:	018f3023          	sd	s8,0(t5)
    8002db9c:	0ff0000f          	fence
    8002dba0:	00000f37          	lui	t5,0x0
    8002dba4:	000f0f13          	mv	t5,t5
    8002dba8:	000f3023          	sd	zero,0(t5) # 0x0
    8002dbac:	0ff0000f          	fence
    8002dbb0:	0000006f          	j	0x8002dbb0

Could you please run cva6-c1 with traces and see what happens on the memory side (i.e., whether the signals reach the top output)?

Answer 4 · 2024-02-29T09:59:00.000Z

Just to check, by "with traces" you mean that I should recompile cva6-c1 with a make run_vanilla_trace instead of with the default make run_vanilla_notrace set by default in the make_all_designs.py? Or is there a flag for the verilator executable Variane_tiny_soc I'm missing somewhere?

Answer 5 · 2024-02-29T10:25:36.000Z

make run_vanilla_trace

Exactly 👍 . You may be missing the .core file for that. The easiest is to duplicate the *_notrace.core, change its name (also on the top of the file contents) and add the trace lines to the Verilator options like here .

Answer 6 · 2024-02-29T12:03:13.000Z

Apparently SIMLEN is not defined when doing trying to run make rules:

root@208de54b6df4:/cascade-cva6-c1/cascade# make run_vanilla_trace
rm -f fusesoc.conf
fusesoc library add run_vanilla_trace .
INFO: Interpreting sync-uri '.' as location for local provider.
fusesoc run --build run_vanilla_trace
INFO: Preparing ::run_vanilla_trace:0.1
INFO: Setting up project
INFO: Building simulation model
cp generated/out/vanilla.sv.log build/run_vanilla_trace_0.1/default-verilator/Variane_tiny_soc.log
mkdir -p traces
cd build/run_vanilla_trace_0.1/default-verilator && ./Variane_tiny_soc
Starting getting env variables.
SIMLEN environment variable not set.
make: *** [Makefile:173: run_vanilla_trace] Error 1
root@208de54b6df4:/cascade-cva6-c1/cascade# make run_vanilla_notrace
cd build/run_vanilla_notrace_0.1/default-verilator && ./Variane_tiny_soc
Starting getting env variables.
SIMLEN environment variable not set.
make: *** [Makefile:173: run_vanilla_notrace] Error 1

Can I safely declare some value of SIMLEN within /cascade-meta/env.sh or am I missing something? Each CPU apparently has some SIMLEN definitions within some other files (which aren't being touched, by the looks of the above messages):

root@135ecb749226:/# grep -r "export SIMLEN=" cascade-*
cascade-cva6/cascade/tests.sh:export SIMLEN=10000
cascade-cva6/cascade/env.sh:export SIMLEN=100000
cascade-cva6-c1/cascade/tests.sh:export SIMLEN=10000
cascade-cva6-c1/cascade/env.sh:export SIMLEN=100000
cascade-cva6-y1/cascade/tests.sh:export SIMLEN=10000
cascade-cva6-y1/cascade/env.sh:export SIMLEN=100000
cascade-kronos/cascade/env.sh:export SIMLEN=10000
cascade-kronos-k1/cascade/env.sh:export SIMLEN=10000
cascade-kronos-k2/cascade/env.sh:export SIMLEN=10000
cascade-picorv32/cascade/env.sh:export SIMLEN=200000
cascade-picorv32-p5/cascade/env.sh:export SIMLEN=200000

Answer 7 · 2024-02-29T12:09:07.000Z

Yeah the lower-level parts are a bit less documented sorry for that ^^'. You should set SIMSRAMELF to the path of your elf, SIMLEN to the number of cycles you'd like to run at most (you can safely put it in excess if you know it will stop) and TRACEFILE to the path of the vcd that you want to obtain.

Answer 8 · 2024-02-29T12:53:45.000Z

Managed to get the .vcd file, any clues about which signals to look out for? (Currently going through it in Questasim)

Edit: attached the .vcd and .wlf files
Uploading waveforms_487340_cva6-c1_416_72_False.zip…

Answer 9 · 2024-02-29T13:05:55.000Z

Nice. I'd recommend to first look at the signals that reach the toplevel memory. You may also want to try generating super short programs, so the VCDs will be shorter ;)

Answer 10 · 2024-03-14T16:35:39.000Z

Any updates @asierfdln ? :)

Answer 11 · 2024-03-18T07:04:42.000Z

Hey! I have been busy trying to get my own CPU design working, hopefully I can take a look at this (and the other two issues) during the next two weeks or so :)