kilobyte/topline

glibc 2.33 does not change select's timeout, hence breaks topline

Closed this issue · 8 comments

SELECT (2) ...

       On Linux, select() modifies timeout to reflect the amount of time not slept; most  other  implementations
       do  not  do  this.   (POSIX.1  permits either behavior.)  This causes problems both when Linux code which
       reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct
       timeval for multiple select()s in a loop without reinitializing it.  Consider timeout to be undefined af‐
       ter select() returns

@rbalint Why do you think that select in glibc 2.33 does not update the timeout value? The implementation has code to do exactly that. Do you see problems on specific architectures? Thanks.

Yes topline broke on all architectures with glibc 2.33 in Ubuntu CI. Amd64:
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/t/topline/20210210_153219_c271c@/log.gz

autopkgtest [15:32:08]: test run-and-signal:  - - - - - - - - - - results - - - - - - - - - -
run-and-signal       FAIL non-zero exit status 1

There's a new bug in glibc. The code snippet is:

      if (r == 0 || errno != ENOSYS)
        {
          if (timeout != NULL)
            TIMEVAL_TO_TIMESPEC (timeout, &ts64);
          return r;
        }

Two pages down the result is handled:

  int r = __select64 (nfds, readfds, writefds, exceptfds, ptv64);
  if (r >= 0 && timeout != NULL)
    /* The remanining timeout will be always less the input TIMEOUT.  */
    *timeout = valid_timeval64_to_timeval (tv64);

So... it depends what the (undefined!) value of errno happens to be.

Hmm, that's not what the description of this issue suggests.
On amd64, the ENOSYS fallback code is not actually compiled in. (SYSCALL_CANCEL in glibc sets errno.)

SYSCALL_CANCEL sets errno only on failure; here we have a successful return with r>0

I don't think that's something that has changed in glibc. The old select did not set errno unconditionally, either.
Do you know what the run-and-signal test is doing that is reported as failing?

@fweimer-rh The test does this basically:

$ topline sleep 2
loop(⠀⠀⠀⠀⠀⠀⠀⠀)sd(⠀)dm(⠀⠀⠀)loop(⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀) (▁   )

With glibc 2.33 the line does now show up.

I've added a minimal debugging patch on top of topline:

diff --git a/topline.c b/topline.c
index 361d02a..480a53a 100644
--- a/topline.c
+++ b/topline.c
@@ -326,13 +326,16 @@ int main(int argc, char **argv)
                 fds|=1<<linebuf[i].fd;
         if (select(fds?NFDS:0, (void*)&fds, 0, 0, &delay)==-1)
         {
+            printf("(select() == -1) delay: {%ld, %ld}\n", delay.tv_sec, delay.tv_usec);
             if (errno!=EINTR)
                 die("select: %m\n");
         }
-        else for (int i=0; i<ARRAYSZ(linebuf); i++)
+        else {
+            printf("(select() != -1) delay: {%ld, %ld}\n", delay.tv_sec, delay.tv_usec);
+           for (int i=0; i<ARRAYSZ(linebuf); i++)
             if (linebuf[i].fd && linebuf[i].fd<sizeof(fds)*8 && fds&1<<linebuf[i].fd)
                 copy_line(&linebuf[i]);
-    }
+    }}
 
     if (child_pid)
     {

With glibc 2.32 this is the output:

./topline sleep 2
(select() != -1) delay: {0, 0}
loop(⠀⠀⠀⠀⠀⠀⠀⠀)sd(⠀)dm(⢀⢀⠀)loop(⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀) (    )
(select() == -1) delay: {0, 3588}

with glibc 2.33:

./topline sleep 2
(select() != -1) delay: {1, 0}
(select() == -1) delay: {1, 0}

Fixed in 8981b07 1½ years ago, by using the syscall directly.