euspectre/kedr

leak_check.flush.01 and leak_check.clear.01 tests fail

euspectre opened this issue · 4 comments

After I have merged "several_targets" branch, the 2 test mentioned above report failures.

The test script is generated from sources/tests/leak_check/test_several_targets.sh.in.

From the test log:

57/136 Testing: leak_check.flush.01
57/136 Test: leak_check.flush.01
Command: "/bin/bash" "test_flush.sh"
Directory: /var/opt/kedr/tests/4.1.16-nrj-desktop-1rosa-x86_64/leak_check
"leak_check.flush.01" start time: Mar 08 21:58 MSK
Output:
----------------------------------------------------------
Starting KEDR...
/sbin/insmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr.ko target_name=kedr_sample_target
mkdir -p "/var/tmp/kedr/test/4.1.16-nrj-desktop-1rosa-x86_64/leak_check/debugfs"
mount debugfs -t debugfs "/var/tmp/kedr/test/4.1.16-nrj-desktop-1rosa-x86_64/leak_check/debugfs"
/sbin/insmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr_leak_check.ko
/sbin/insmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr_lc_common_mm.ko
KEDR started.
Loaded the target.
Total number of allocations is 0 but it should be 1
The data in report_flush/01_target_loaded.log differs from what has been expected (expected: allocs=1, leaks=1, bad_frees=0)
Stopping KEDR...
/sbin/rmmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr_lc_common_mm.ko
/sbin/rmmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr_leak_check.ko
umount "/var/tmp/kedr/test/4.1.16-nrj-desktop-1rosa-x86_64/leak_check/debugfs"
/sbin/rmmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr.ko
KEDR stopped.

---------------------------
58/136 Testing: leak_check.clear.01
58/136 Test: leak_check.clear.01
Command: "/bin/bash" "test_clear.sh"
Directory: /var/opt/kedr/tests/4.1.16-nrj-desktop-1rosa-x86_64/leak_check
"leak_check.clear.01" start time: Mar 08 21:58 MSK
Output:
----------------------------------------------------------
Starting KEDR...
/sbin/insmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr.ko target_name=kedr_sample_target
mkdir -p "/var/tmp/kedr/test/4.1.16-nrj-desktop-1rosa-x86_64/leak_check/debugfs"
mount debugfs -t debugfs "/var/tmp/kedr/test/4.1.16-nrj-desktop-1rosa-x86_64/leak_check/debugfs"
/sbin/insmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr_leak_check.ko
/sbin/insmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr_lc_common_mm.ko
KEDR started.
Loaded the target.
Writing to /dev/cfake0.
Total number of allocations is 0 but it should be 1
The data in report_clear/02_write1.log differs from what has been expected (expected: allocs=1, leaks=1, bad_frees=0)
Stopping KEDR...
/sbin/rmmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr_lc_common_mm.ko
/sbin/rmmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr_leak_check.ko
umount "/var/tmp/kedr/test/4.1.16-nrj-desktop-1rosa-x86_64/leak_check/debugfs"
/sbin/rmmod /lib/modules/4.1.16-nrj-desktop-1rosa-x86_64/extra/kedr.ko
KEDR stopped.

The files report_flush/01_target_loaded.log and report_clear/02_write1.log are empty.

OS: ROSA R7 x86_64, kernel 4.1.16.

8tab commented

There is a race between echo 1 > flush and cat info (simiar situation to clear test case).
Looks like the work queued in klc_do_flush is not finished yet when we trying to read 'info' file.
We could get rid of separate worker and move the code from work_func_flush to klc_do_flush, but I'm not sure if this is a correct solution (and I suppose this worker was introduced intentionally).

Do you have any ideas how to fix it?

Nice catch! Yes, that could be the reason. Need to take a closer look though.

Yes, the worker thread itself is there for a reason: it keeps the events ordered. As for flush, it might be done directly now, if I am not mistaken. We already do flush_workqueue(lc_object->wq); before it to make sure all the previous events have been processed by then.

Still, it is needed to review the code once more: I cannot say right now if the direct flushing might break something else.

I see it now. The problem is in the accesses to the tables of allocations and deallocations.
klc_flush_*() functions do not use locks for such accesses, that is why they are expected to be called from a work item or on unload. IOW, - when nothing else can access these tables simultaneously. We need a single-threaded ordered wq in LeakCheck anyway, so it was used here as well.

On the other hand, flush_workqueue(lc->wq); is not that necessary before the calls to klc_do_flush(lc);. Placing it after such calls instead might be a better idea: after writing to "flush" file returns, it will be guaranteed that flushing is complete.

Will do a bit more testing and commit the fix.

Again, thanks for pointing this out!

Fixed in KEDR 0.6.