0vercl0k/wtf

socket.cc `recv()` timeout

extf33 opened this issue · 23 comments

I implemented a custom mutator, and it takes about 1~5 mins for one mutate cycle.

When I run wtf with the custom mutator, I get the following error:

Could not receive size (-1)
Receive failed

wtf/src/wtf/socket.cc

Lines 241 to 245 in 3ccad88

if (const int R = recv(Fd, (char *)CurrentBuffer, sizeof(uint32_t), 0);
R == -1 || R != sizeof(uint32_t)) {
fmt::print("Could not receive size ({})\n", R);
return std::nullopt;
}

Is there any part of the code that sets the timeout or flag such as O_NONBLOCK?

Actually, I already checked O_NONBLOCK.

int is_block = !(fnctl(Fd, F_GETFL, 0) & O_NONBLOCK);
// Result: 1

Oh interesting - if you start a server w/ no clients and let it run for 1-5min; does the receive fail as well? If not, I don't think it comes from timing. Also on what OS are you running into this?

And no, that recv call is meant to block (so NON_BLOCK isn't set anywhere afaiaa).

Cheers

Okay actually, just spinning a server is not enough; I added a Sleep(5*60*1000); inside the fuzzer_dummy.cc module and it seems to work fine on Windows (no timeout).

Any special thing about your set-up?

Cheers

I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep().

I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep().

I tested with a sleep(5*60*1000);, the same error occured.

strace result is as follows:

recvfrom(8, 0x563e8545c340, 4, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)

Okay this is weird - I can't seem to repro this on an Ubuntu running in WSL2:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

I wonder what's going on 🤔

My bad, it was me.
I gave --limit 100 and thought it is only for execution time.

Okay sorry for the lag, I just started to look into this.

Do you remember which backend were you using?

From what I can see in the code, the bxcpu backend stop executing and sends a timeout to the server (as expected), the KVM backend uses a SIGALRM to stop KVM and WHV uses TimerQ_t to cancel the VCPU.

Cheers

I only tested on the KVM backend.

If I remember correctly, on the fuzzer side.

That's right, it matches what I saw :)

As in bochscpu, I thought about specifying the KVM/WHv backend (time) limit only for execution (not execution+mutation), but I think it will be fine as it is now.

Yes, that's the issue here..

Okay so that makes sense that the fuzzer side receives a SIGALRM; this
should stop KVM from running the testcase and the code will send a timeout
result to the server; but the server shouldn't stop.

In the above scenario,
SIGALRM does not occur while KVM is 'running' the testcase, but while the server is performing mutation (before KVM executes the testcase).

Okay, I set-up a VM w/ nested virtualization to try this out, and I played with it some more. At first, I thought I had a repro when simulating a long InsertTestcase using a sleep like I recommended earlier. When doing that, the first testcase executed as expected but the next ones would no sleep the full time and it would get interrupted which was weird. After reading the sleep manual more, I realized it doesn't play nice w/ signals so it's not a good way to faking a slow testcase generation.

At that point, I removed the sleep and made a long loop to simulate a long workload. The first testcase executes as expected, the generation completes, the testcase is run and returns a cr3. Now here's what I think you observed. The second testcase takes a long time to generate like the first one, which is expected, but this time when it gets executed it ends up timing out before hitting the expected cr3 result and generate a timeout result.

# Server log
$ ../../src/build/wtf master --runs=100 --max_len=10000 --name dummy --target .
Seeded with 1181810723460784320
Iterating through the corpus..
Sorting through the 1 entries..
Running server on tcp://localhost:31337..
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: 9223372036854.8m (1 nodes) lastcov: 2.0s crash: 0 timeout: 0 cr3: 0 uptime: 2.0s
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: 0.0 (1 nodes) lastcov: 1.4min crash: 0 timeout: 0 cr3: 0 uptime: 1.4min
#1 cov: 13 (+13) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 1.4min crash: 0 timeout: 0 cr3: 1 uptime: 2.9min
#2 cov: 13 (+0) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 2.8min crash: 0 timeout: 1 cr3: 1 uptime: 4.2min
#3 cov: 13 (+0) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 4.2min crash: 0 timeout: 2 cr3: 1 uptime: 5.6min

# Fuzzer log
$ sudo ../../src/build/wtf fuzz --backend kvm --name dummy --limit 1
The debugger instance is loaded with 6 items
Parsing coverage/HEVD.cov..
Applied 432 code coverage breakpoints
Setting debug register status to zero.
Setting debug register status to zero.
Setting mxcsr_mask to 0xffbf.
Resolved breakpoint 0xfffff8046f1c3880 at GPA 0x27c3880 aka HVA 0x55a4bf491970
Dialing to tcp://localhost:31337/..
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#1 cov: 13 exec/s: 0.0 lastcov: 0.0s crash: 0 timeout: 0 cr3: 1 uptime: 1.4min
Generating the testcase, sleeping for 60s..
Done
#2 cov: 13 exec/s: 0.0 lastcov: 1.4min crash: 0 timeout: 1 cr3: 1 uptime: 2.8min
Generating the testcase, sleeping for 60s..
Done
#3 cov: 13 exec/s: 0.0 lastcov: 2.8min crash: 0 timeout: 2 cr3: 1 uptime: 4.2min
Generating the testcase, sleeping for 60s..
Done
#4 cov: 13 exec/s: 0.0 lastcov: 4.2min crash: 0 timeout: 3 cr3: 1 uptime: 5.6min
Generating the testcase, sleeping for 60s..

At this point I think there's definitely a bug related to the limit feature in KVM; maybe it comes from the way I handled SIGARLM or something. Maybe the signal from the testcase n-1 is triggered when the testcase n is executed which shuts it down prior to what it should be.

Anyways, now that I have a repro I can investigate :)

Cheers

Boy, this is a bad bug actually. The issue might actually be that I never reset immediate_exit to 0 🤦🏽‍♂️. Which means any testcase that triggers a timeout will make all the next testcases triggering timeouts as well 😬

If you still have you environment to repro, I'd love if you verify that the below patch fixes your issue:

--- a/src/wtf/kvm_backend.cc
+++ b/src/wtf/kvm_backend.cc
@@ -1454,7 +1454,7 @@ std::optional<TestcaseResult_t> KvmBackend_t::Run(const uint8_t *Buffer,
   Stop_ = false;
   TestcaseRes_ = Ok_t();
   Coverage_.clear();
-
+  Run_->immediate_exit = 0;
   while (!Stop_) {

It fixes it on my test bench:

$ sudo ../../src/build/wtf fuzz --backend kvm --name dummy --limit 1
[sudo] password for over:
The debugger instance is loaded with 6 items
Parsing coverage/HEVD.cov..
Applied 432 code coverage breakpoints
Setting debug register status to zero.
Setting debug register status to zero.
Setting mxcsr_mask to 0xffbf.
Resolved breakpoint 0xfffff8046f1c3880 at GPA 0x27c3880 aka HVA 0x563171252970
Dialing to tcp://localhost:31337/..
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#1 cov: 13 exec/s: 0.0 lastcov: 0.0s crash: 0 timeout: 0 cr3: 1 uptime: 1.6min
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#2 cov: 13 exec/s: 0.0 lastcov: 1.5min crash: 0 timeout: 0 cr3: 2 uptime: 3.1min
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#3 cov: 13 exec/s: 0.0 lastcov: 2.8min crash: 0 timeout: 0 cr3: 3 uptime: 4.4min

Cheers

FYI I will land #161 & close this issue next week if I don't hear anything back :)

Cheers