sosy-lab/sv-benchmarks

Why is real-world busybox code rewritten into something artificial and weird?

zvonimir opened this issue · 4 comments

So I've been studying this weird if condition in a busybox benchmarks:
https://github.com/sosy-lab/sv-benchmarks/blob/master/c/busybox-1.22.0/basename-2.c#L145

Here is the source:

if(4294967294u + (unsigned int)argc >= 2u)
    return 1; /* was: bb_show_usage(); which should exit(1), but this would leak argv */

I looked at the actual real-world source code for this benchmark that you can find here. I believe that the above if condition matches this one:

	if (!argv[1])
		bb_show_usage();

The latter if condition is the one that looks realistic to me and is something that most programmers would probably write. The latter one is also what the actual benchmark uses.

Does someone know why the actual code from a real-world benchmarks would be artificially rewritten into something convoluted like this?

It makes no sense to me to do this in the system category, where the goal (I would guess) should be to be as faithful as possible to the real-world applications.

Any ideas?

@tautschnig could you shed some light on this?
The only explanation I have is that the code was extracted from some internal CBMC format, which does these rewrites automatically, or something like that. Thanks!

Ok, I found my answer in the README file. I guess these files were generated by goto-cc, which is why they look so weird.
I propose to plug in the actual real-world code instead, at least for some of these benchmarks. I think that would be an improvement over what we have right now since it would make them more realistic.

Thoughts?

I would say that the code in your example certainly comes from lines 59-61 of coreutils/basename.c in BusyBox 1.22.0, which are

	if ((unsigned)(argc-2) >= 2) {
		bb_show_usage();
	}

The line that you quote does not appear in basename.c in this BusyBox version, and it has different semantics.

You are referring only to the change in the condition of the if and not to the change in the branch, right?

Of course this is rewritten in a way that humans would not write, but it has the same semantics, right? And the difference is not that big, I would say this is not a problem for tasks.

Note that while the rewritten code requires tools to be able to handled unsigned-integer overflows, the original requires tools to be able to handle unsigned-integer underflows, so there is no difference in difficulty I would say.

(Edit: fixed)

Ah, got it! Sorry that I was looking at a wrong version! Ok, this is not perfect (in my opinion), but much better than what I was looking at. So I am good to close this.