socket.cc `recv()` timeout

I implemented a custom mutator, and it takes about 1~5 mins for one mutate cycle.

When I run wtf with the custom mutator, I get the following error:

Could not receive size (-1)
Receive failed

wtf/src/wtf/socket.cc

Lines 241 to 245 in 3ccad88

    
           if (const int R = recv(Fd, (char *)CurrentBuffer, sizeof(uint32_t), 0); 
        
               R == -1 || R != sizeof(uint32_t)) { 
        
             fmt::print("Could not receive size ({})\n", R); 
        
             return std::nullopt; 
        
           }

Is there any part of the code that sets the timeout or flag such as O_NONBLOCK?

Actually, I already checked O_NONBLOCK.

int is_block = !(fnctl(Fd, F_GETFL, 0) & O_NONBLOCK);
// Result: 1

Oh interesting - if you start a server w/ no clients and let it run for 1-5min; does the receive fail as well? If not, I don't think it comes from timing. Also on what OS are you running into this?

And no, that recv call is meant to block (so NON_BLOCK isn't set anywhere afaiaa).

Cheers

Okay actually, just spinning a server is not enough; I added a Sleep(5*60*1000); inside the fuzzer_dummy.cc module and it seems to work fine on Windows (no timeout).

Any special thing about your set-up?

Cheers

I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep().

I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep().

I tested with a sleep(5*60*1000);, the same error occured.

strace result is as follows:

recvfrom(8, 0x563e8545c340, 4, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)

I'll try to repro this this weekend then, thanks for all the details! Cheers

…

On Fri, Mar 31, 2023 at 3:01 AM extf33 ***@***.***> wrote: I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep(). I tested with a sleep(5*60*1000);, the same error occured. strace result is as follows: recvfrom(8, 0x563e8545c340, 4, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORLCEXKHPOKHZP36UXDW62TPRANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

Okay this is weird - I can't seem to repro this on an Ubuntu running in WSL2:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

I wonder what's going on 🤔

My bad, it was me.
I gave --limit 100 and thought it is only for execution time.

Hmmmm, it is meant to prevent an infinite testcase running and if it hits it should send a timeout status to the server. I might need to check that code then cause it might be a bug in its own 🤔

…

On Tue, Apr 4, 2023 at 6:27 PM extf33 ***@***.***> wrote: My bad, it was me. I gave --limit 100 and thought it is only for execution time. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORK4PLYWMDODWN3RNPTW7TDATANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

Okay sorry for the lag, I just started to look into this.

Do you remember which backend were you using?

From what I can see in the code, the bxcpu backend stop executing and sends a timeout to the server (as expected), the KVM backend uses a SIGALRM to stop KVM and WHV uses TimerQ_t to cancel the VCPU.

Cheers

I only tested on the KVM backend.

Do you remember on which side the above strace output was from? On the fuzzer side or on the master? Cheers

…

On Thu, Apr 20, 2023 at 7:01 PM extf33 ***@***.***> wrote: I only tested on the KVM backend. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORPJOBHRG4EFGGXWQKTXCHS6NANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

If I remember correctly, on the fuzzer side.

Okay so that makes sense that the fuzzer side receives a SIGALRM; this should stop KVM from running the testcase and the code will send a timeout result to the server; but the server shouldn't stop. Does this match what you saw? If so, then we're good :-D Cheers

…

On Thu, Apr 20, 2023 at 7:43 PM extf33 ***@***.***> wrote: If I remember correctly, on the fuzzer side. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORPLEOKRZOQS3UC4L3LXCHX4ZANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

That's right, it matches what I saw :)

Okay cool awesome! Thanks for sharing all those details with me, I think this issue is resolved then. Let me know if you disagree, otherwise I'll close it down. Cheers

…

On Thu, Apr 20, 2023 at 7:52 PM extf33 ***@***.***> wrote: That's right, it matches what I saw :) — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORNRWHGXLNFNC72PBN3XCHZABANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

As in bochscpu, I thought about specifying the KVM/WHv backend (time) limit only for execution (not execution+mutation), but I think it will be fine as it is now.

That's what it should be, at least that's the intended way. Generation of the testcase happens before execution of the testcase; and the execution length of the testcase is something hidden aways by the backends so there shouldn't be any link 🤔

…

On Thu, Apr 20, 2023 at 8:00 PM extf33 ***@***.***> wrote: As in bochscpu, I thought about specifying the KVM/WHv backend (time) limit only for execution (not execution+mutation), but I think it will be fine as it is now. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORL6WVRLOJBG54K5YMTXCHZ45ANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

Yes, that's the issue here..

Okay so that makes sense that the fuzzer side receives a SIGALRM; this
should stop KVM from running the testcase and the code will send a timeout
result to the server; but the server shouldn't stop.

In the above scenario,
SIGALRM does not occur while KVM is 'running' the testcase, but while the server is performing mutation (before KVM executes the testcase).

OK something that might happen is that the alarm gets canceled and gets triggered after the end of the testcase... I will investigate this further, I might have an idea. Thank you! Cheers

…

On Thu, Apr 20, 2023 at 8:13 PM extf33 ***@***.***> wrote: Yes, that's the issue here.. Okay so that makes sense that the fuzzer side receives a SIGALRM; this should stop KVM from running the testcase and the code will send a timeout result to the server; but the server shouldn't stop. In the above scenario, SIGALRM does not occur while KVM is 'running' the testcase, but while the server is performing mutation (before KVM executes the testcase). — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORJXHYKWRRQFPVOVFADXCH3ORANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

Actually the weird thing is that even if the SIGALRM happens after, it just turns a boolean to true: ```c++ void KvmBackend_t::SignalAlarm() { __atomic_store_n(&Run_->immediate_exit, 1, __ATOMIC_RELAXED); } void KvmBackend_t::StaticSignalAlarm(int, siginfo_t *, void *) { KvmBackend_t *KvmBackend = reinterpret_cast<KvmBackend_t *>(g_Backend); KvmBackend->SignalAlarm(); } ``` Oh well, back to the drawing board 😅 Cheers

Okay, I set-up a VM w/ nested virtualization to try this out, and I played with it some more. At first, I thought I had a repro when simulating a long InsertTestcase using a sleep like I recommended earlier. When doing that, the first testcase executed as expected but the next ones would no sleep the full time and it would get interrupted which was weird. After reading the sleep manual more, I realized it doesn't play nice w/ signals so it's not a good way to faking a slow testcase generation.

At that point, I removed the sleep and made a long loop to simulate a long workload. The first testcase executes as expected, the generation completes, the testcase is run and returns a cr3. Now here's what I think you observed. The second testcase takes a long time to generate like the first one, which is expected, but this time when it gets executed it ends up timing out before hitting the expected cr3 result and generate a timeout result.

# Server log
$ ../../src/build/wtf master --runs=100 --max_len=10000 --name dummy --target .
Seeded with 1181810723460784320
Iterating through the corpus..
Sorting through the 1 entries..
Running server on tcp://localhost:31337..
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: 9223372036854.8m (1 nodes) lastcov: 2.0s crash: 0 timeout: 0 cr3: 0 uptime: 2.0s
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: 0.0 (1 nodes) lastcov: 1.4min crash: 0 timeout: 0 cr3: 0 uptime: 1.4min
#1 cov: 13 (+13) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 1.4min crash: 0 timeout: 0 cr3: 1 uptime: 2.9min
#2 cov: 13 (+0) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 2.8min crash: 0 timeout: 1 cr3: 1 uptime: 4.2min
#3 cov: 13 (+0) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 4.2min crash: 0 timeout: 2 cr3: 1 uptime: 5.6min

# Fuzzer log
$ sudo ../../src/build/wtf fuzz --backend kvm --name dummy --limit 1
The debugger instance is loaded with 6 items
Parsing coverage/HEVD.cov..
Applied 432 code coverage breakpoints
Setting debug register status to zero.
Setting debug register status to zero.
Setting mxcsr_mask to 0xffbf.
Resolved breakpoint 0xfffff8046f1c3880 at GPA 0x27c3880 aka HVA 0x55a4bf491970
Dialing to tcp://localhost:31337/..
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#1 cov: 13 exec/s: 0.0 lastcov: 0.0s crash: 0 timeout: 0 cr3: 1 uptime: 1.4min
Generating the testcase, sleeping for 60s..
Done
#2 cov: 13 exec/s: 0.0 lastcov: 1.4min crash: 0 timeout: 1 cr3: 1 uptime: 2.8min
Generating the testcase, sleeping for 60s..
Done
#3 cov: 13 exec/s: 0.0 lastcov: 2.8min crash: 0 timeout: 2 cr3: 1 uptime: 4.2min
Generating the testcase, sleeping for 60s..
Done
#4 cov: 13 exec/s: 0.0 lastcov: 4.2min crash: 0 timeout: 3 cr3: 1 uptime: 5.6min
Generating the testcase, sleeping for 60s..

At this point I think there's definitely a bug related to the limit feature in KVM; maybe it comes from the way I handled SIGARLM or something. Maybe the signal from the testcase n-1 is triggered when the testcase n is executed which shuts it down prior to what it should be.

Anyways, now that I have a repro I can investigate :)

Cheers

Boy, this is a bad bug actually. The issue might actually be that I never reset immediate_exit to 0 🤦🏽‍♂️. Which means any testcase that triggers a timeout will make all the next testcases triggering timeouts as well 😬

If you still have you environment to repro, I'd love if you verify that the below patch fixes your issue:

--- a/src/wtf/kvm_backend.cc
+++ b/src/wtf/kvm_backend.cc
@@ -1454,7 +1454,7 @@ std::optional<TestcaseResult_t> KvmBackend_t::Run(const uint8_t *Buffer,
   Stop_ = false;
   TestcaseRes_ = Ok_t();
   Coverage_.clear();
-
+  Run_->immediate_exit = 0;
   while (!Stop_) {

It fixes it on my test bench:

$ sudo ../../src/build/wtf fuzz --backend kvm --name dummy --limit 1
[sudo] password for over:
The debugger instance is loaded with 6 items
Parsing coverage/HEVD.cov..
Applied 432 code coverage breakpoints
Setting debug register status to zero.
Setting debug register status to zero.
Setting mxcsr_mask to 0xffbf.
Resolved breakpoint 0xfffff8046f1c3880 at GPA 0x27c3880 aka HVA 0x563171252970
Dialing to tcp://localhost:31337/..
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#1 cov: 13 exec/s: 0.0 lastcov: 0.0s crash: 0 timeout: 0 cr3: 1 uptime: 1.6min
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#2 cov: 13 exec/s: 0.0 lastcov: 1.5min crash: 0 timeout: 0 cr3: 2 uptime: 3.1min
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#3 cov: 13 exec/s: 0.0 lastcov: 2.8min crash: 0 timeout: 0 cr3: 3 uptime: 4.4min

Cheers

FYI I will land #161 & close this issue next week if I don't hear anything back :)

Cheers

	if (const int R = recv(Fd, (char *)CurrentBuffer, sizeof(uint32_t), 0);
	R == -1 \|\| R != sizeof(uint32_t)) {
	fmt::print("Could not receive size ({})\n", R);
	return std::nullopt;
	}