zmap/zmap

ZMap crashes when the host clock goes backwards

maxmouchet opened this issue · 3 comments

Describe the bug

ZMap randomly crashes with an assertion failure:

3:22 16% (17m left); send: 7052636 35.0 Kp/s (34.9 Kp/s avg); recv: 733668 3.39 Kp/s (3.63 Kp/s avg); drops: 0 p/s (0 p/s avg); hitrate: 10.40%
3:23 17% (17m left); send: 7087436 34.8 Kp/s (34.9 Kp/s avg); recv: 736653 2.98 Kp/s (3.63 Kp/s avg); drops: 0 p/s (0 p/s avg); hitrate: 10.39%
zmap: /root/zmap/src/send.c:322: send_run: Assertion `t > last_time' failed.

CLI Arguments

zmap -r 35000 -M icmp_echoscan -w zmap_targets.csv -b /root/blacklist_v4 -o icmp.csv

Expected behavior

ZMap should not crash.

Environment:

Additional context

There is probably an issue on our host causing the clock to go backwards (e.g. NTP).

However, I noticed that ZMap uses gettimeofday which makes no guarantee on the monotonicity of the clock:

The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time).  If you need a monotonically increasing clock, see clock_gettime(2).

Here is a small program to reproduce the issue (on affected hosts):

#include <stdio.h>
#include <sys/time.h>

// From zmap/src/logger.c
double now(void) {
    struct timeval now;
    gettimeofday(&now, NULL);
    return (double) now.tv_sec + (double) now.tv_usec / 1000000.;
}

int main() {
    double last_time = now();
    while (1) {
        double t = now();
        if (t < last_time) {
            printf("Non-monotonic clock: now=%f last=%f diff=%f\n", t, last_time, t - last_time);
            break;
        }
        last_time = t;
    }
    return 0;
}
gcc -O3 main.c && ./a.out
# After a few seconds:
# Non-monotonic clock: now=1671705231.484864 last=1671705232.725731 diff=-1.240867

This issue is bound to happen randomly when ZMap runs for a long time.
Would it makes sense for ZMap to use clock_gettime instead of gettimeofday?

I don't believe clock_gettime is POSIX standard, but I'd be fine using it on Linux (and other OSes where it's implemented). I'd be happy to look at a PR that changes this, but it's likely not going to rise the top of the priority list for the main development team to tackle in the foreseeable future compared to some other pending issues.

Sure, and that's a good point! We'll try this internally and I'll open a PR if it solves the problem.

Edit: it seems it is part of POSIX (https://pubs.opengroup.org/onlinepubs/9699919799/functions/clock_getres.html).

Fixed by #708