comsec-group/cascade-artifacts

Weird behaviour: cva6-c1 and descriptor (881540, design_name, 5000017, 51, True/False)

asierfdln opened this issue · 5 comments

Hi! While playing around with cva6-c1 and some program descriptors, I have found some weird behaviour for two variants of a descriptor:

  • Descriptor 1: (881540, design_name, 5000017, 51, True)
  • Descriptor 2: (881540, design_name, 5000017, 51, False)

For the first descriptor, running it in do_fuzzsingle.py for design_name cva6-c1 doesn't output any failure message, thus indicatings that this descriptor does not result in the generation of a faulty program (i.e. that finds a bug in the CPU). However, when running that same descriptor and design_name in do_reducesingle.py, an error seems to be detected for that program (mismatching basic blocks, expected register value X but instead got some other thing Y, etc.).

The second descriptor runs into the following issue: when running it in do_fuzzsingle.py, the error message indicates it is a faulty program. However, running that same descriptor through do_reducesingle.py outputs a Spike timeout when trying to detect the faulty instruction. Bumping up the timeout threshold to 300seconds (5minutes) doesn't seem to be the way forward either. Are there many instances of this?

Hi @asierfdln,

Thank you for opening an issue!
Indeed you may expect some imperfections in Cascade, after all it is a decently recent and written just by 1 person at the moment :)

a) This scenario is possible, in principle. For example, a bug was triggered but immediately the buggy value was overwritten. During the reduction process, this bug may be re-discovered. I haven't checked whether this is what happens or something else.

b) This looks like an issue in do_reducesingle.py indeed. Would you be willing to look into it and make a pull request?

Thank you!
Flavien

No worries about the imperfections xD, at least you guys published the code for others to take a look at and use/contribute to it :)

Regarding a): I agree that this would be a pain to check xD, maybe I get back to it on the weekend. Could this type of issue be solved by tinkering around with the probability with which consumed registers (i.e. addi x3,x1,x2, so x3 is the produced register and x1 and x2 are consumed registers) are more/less likely to be picked from more-recently-used produced registers available in the general pool of produced registers? Per the paper:

Some CPU fuzzers [32] dump
the register values via storage instructions at the end of a test
case and compare the results with the ones from the ISS. Such
a comparison is only possible if a test case completes and
intermediate deviations between the ISS and the CPU under
test are propagated through time until the end of the test case.

and

Cascade biases the choice of operands by granting higher probabilities to registers recently used as outputs.

I dunno where exactly these probabilities are atm, but it seems like this could be the way to go, right? Plus, this seems like a cool idea for yet another paper (or short paper) to submit to another conference, something like "Generation of Sound and Deep Register-dependence Chains in Random Instruction Sequences: A Hyperparameter Study" ;) Congrats on being accepted to USENIX 2024 btw!! :)

Regarding b): If I manage to understand what is going on, you betcha :) In fact, I have already opened a PR on the cascade-meta repo with some quality-of-life improvements, so expect some more activity from my side for the time being xD

Hi!
The randomization is here.
There's a trade-off between having some non-propagations and being able to find more types of bugs (if you use always the same reg in a chain you'll have less non-propagations but it will be poor).
Given Cascade's speed in finding bugs, I don't see non-propagations as a big issue. Especially because non-propagations are not correlated to the underlying bug.

Also thanks for the little PR!

Especially because non-propagations are not correlated to the underlying bug.

I would argue that the risk of non-propagation could, in theory, mask certain bugs in specific .elf programs. The simple answer to this seems to be to generate more programs to reduce the chances that non-propagation of bugs takes place, ideally without much time penalty on the program-generation and program-execution fronts. And, so far, Cascade seems to check those two boxes, so I agree that non-propagation isn't a big issue on the grand scheme of things, for long enough runs of fuzzing. Also agree with the non-propagation vs randomization trade-off cunundrum point.

I will try to take a look at both issues a) and b), see what I manage to find out.

Feel free to reopen!