golang/go

internal/poll: can get EINTR reading from CIFS

dsoprea opened this issue · 20 comments

What version of Go are you using (go version)?

1.14.2

Does this issue reproduce with the latest release?

(This is the latest release)

What operating system and processor architecture are you using (go env)?

linux, amd64

What did you do?

I'm running a Go tool of mine over a bunch of images. I'm getting periodic "interrupted system call" errors:

*os.PathError read /general_images/dense_album_staging/2015/google_photos/takeout-20150928-0047/dump/2012-12-05-/IMAG0507-edited.jpg: interrupted system call
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go:69 (0x7c8a4c)
	GetExif: log.PanicIf(err)

I've restarted several times. On earlier runs, I've gotten large clumps of these in the output (seems racy).

They always happen in the same place:

    65		} else {
    66			var err error
    67	
    68			data, err = ioutil.ReadFile(imageFilepath)
    69			log.PanicIf(err)
    70		}

Is it possible that ioutil.ReadFile() is not correctly managing signals?

(https://golang.org/pkg/os/signal/#hdr-Go_programs_that_use_cgo_or_SWIG)

If the non-Go code installs any signal handlers, it must use the SA_ONSTACK flag with sigaction. Failing to do so is likely to cause the program to crash if the signal is received. Go programs routinely run with a limited stack, and therefore set up an alternate signal stack. Also, the Go standard library expects that any signal handlers will use the SA_RESTART flag. Failing to do so may cause some library calls to return "interrupted system call" errors.

I believe this is expected due to runtime changes in Go 1.14, see: https://golang.org/doc/go1.14#runtime.

No, 1.14 triggers more signals, but Go installs signal handlers with SA_RESTART set, so any signals that occur during system calls called by ioutil.ReadFile should be restarted.

Is this a pure Go program?

Can you run the program under strace -f and see exactly what system call is returning EINTR, and show us a dozen lines around that call? Thanks.

Yes, it's definitely pure Go (a principal feature of the project).

Okay. I'm running it with strace right now. I don't know how long it will take to encounter the first one. Hopefully the slowdown won't completely stop it from occurring.

Got a couple.

1: Triggered in lines 423-427

   401  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}}) = 0
   402  [pid  2791] <... nanosleep resumed> NULL) = 0
   403  [pid  2791] epoll_pwait(4,  <unfinished ...>
   404  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}} <unfinished ...>
   405  [pid  2791] <... epoll_pwait resumed> [], 128, 0, NULL, 139628791573936) = 0
   406  [pid  2790] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   407  [pid  2791] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   408  [pid  2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d7a8c) = -1 EPERM (Operation not permitted)
   409  [pid  2791] <... nanosleep resumed> NULL) = 0
   410  [pid  2791] nanosleep({tv_sec=0, tv_nsec=40000},  <unfinished ...>
   411  [pid  2790] fstat(3, {st_mode=S_IFREG|0755, st_size=2296983, ...}) = 0
   412  [pid  2791] <... nanosleep resumed> NULL) = 0
   413  [pid  2791] nanosleep({tv_sec=0, tv_nsec=80000},  <unfinished ...>
   414  [pid  2790] mmap(NULL, 1439992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
   415  [pid  2791] <... nanosleep resumed> NULL) = 0
   416  [pid  2790] <... mmap resumed> )        = 0x7efdd6e9e000
   417  [pid  2791] getpid()                    = 2790
   418  [pid  2791] tgkill(2790, 2790, SIGURG <unfinished ...>
   419  [pid  2790] read(3,  <unfinished ...>
   420  [pid  2791] <... tgkill resumed> )      = 0
   421  [pid  2791] nanosleep({tv_sec=0, tv_nsec=160000}, NULL) = 0
   422  [pid  2791] nanosleep({tv_sec=0, tv_nsec=320000},  <unfinished ...>

   423  [pid  2790] <... read resumed> 0xc000180000, 2297495) = -1 EINTR (Interrupted system call)
   424  [pid  2790] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=2790, si_uid=1000} ---
   425  [pid  2790] rt_sigreturn({mask=[]} <unfinished ...>
   426  [pid  2791] <... nanosleep resumed> NULL) = 0
   427  [pid  2790] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)

   428  [pid  2791] nanosleep({tv_sec=0, tv_nsec=640000},  <unfinished ...>
   429  [pid  2790] close(3)                    = 0
   430  [pid  2791] <... nanosleep resumed> NULL) = 0
   431  [pid  2791] nanosleep({tv_sec=0, tv_nsec=1280000},  <unfinished ...>
   432  [pid  2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go", O_RDONLY|O_CLOEXEC) = 3
   433  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -1 EPERM (Operation not permitted)
   434  [pid  2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
   435  [pid  2790] read(3, "package exifknife\n\nimport (\n\t\"er"..., 4096) = 4096
   436  [pid  2790] close(3)                    = 0
   437  [pid  2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go", O_RDONLY|O_CLOEXEC) = 3
   438  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}} <unfinished ...>
   439  [pid  2791] <... nanosleep resumed> NULL) = 0
   440  [pid  2790] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   441  [pid  2791] nanosleep({tv_sec=0, tv_nsec=2560000},  <unfinished ...>
   442  [pid  2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
   443  [pid  2790] read(3, "package exifkniferead\n\nimport (\n"..., 4096) = 4096
   444  [pid  2790] close(3)                    = 0
   445  [pid  2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/command/go-exif-knife/main.go", O_RDONLY|O_CLOEXEC) = 3
   446  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -1 EPERM (Operation not permitted)
   447  [pid  2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
   448  [pid  2790] read(3, "package main\n\nimport (\n\t\"os\"\n\n\t\""..., 4096) = 3957
   449  [pid  2790] close(3)                    = 0
   450  [pid  2790] openat(AT_FDCWD, "/usr/local/go/src/runtime/proc.go", O_RDONLY|O_CLOEXEC) = 3
   451  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -

2: Triggered in lines 475-481

   452  [pid  3276] pipe2([5, 6], O_NONBLOCK|O_CLOEXEC) = 0
   453  [pid  3277] <... nanosleep resumed> NULL) = 0
   454  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}} <unfinished ...>
   455  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   456  [pid  3276] <... epoll_ctl resumed> )   = 0
   457  [pid  3277] <... nanosleep resumed> NULL) = 0
   458  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}} <unfinished ...>
   459  [pid  3277] epoll_pwait(4,  <unfinished ...>
   460  [pid  3276] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   461  [pid  3277] <... epoll_pwait resumed> [], 128, 0, NULL, 140490176716208) = 0
   462  [pid  3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000dfa8c <unfinished ...>
   463  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   464  [pid  3276] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   465  [pid  3277] <... nanosleep resumed> NULL) = 0
   466  [pid  3276] fstat(3,  <unfinished ...>
   467  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   468  [pid  3276] <... fstat resumed> {st_mode=S_IFREG|0755, st_size=310986, ...}) = 0
   469  [pid  3277] <... nanosleep resumed> NULL) = 0
   470  [pid  3277] getpid()                    = 3276
   471  [pid  3277] tgkill(3276, 3276, SIGURG <unfinished ...>
   472  [pid  3276] read(3,  <unfinished ...>
   473  [pid  3277] <... tgkill resumed> )      = 0
   474  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0

   475  [pid  3276] <... read resumed> 0xc000180000, 311498) = -1 EINTR (Interrupted system call)
   476  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   477  [pid  3276] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=3276, si_uid=1000} ---
   478  [pid  3277] <... nanosleep resumed> NULL) = 0
   479  [pid  3276] rt_sigreturn({mask=[]} <unfinished ...>
   480  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   481  [pid  3276] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)

   482  [pid  3277] <... nanosleep resumed> NULL) = 0
   483  [pid  3276] close(3 <unfinished ...>
   484  [pid  3277] nanosleep({tv_sec=0, tv_nsec=40000}, NULL) = 0
   485  [pid  3277] nanosleep({tv_sec=0, tv_nsec=80000}, NULL) = 0
   486  [pid  3276] <... close resumed> )       = 0
   487  [pid  3277] nanosleep({tv_sec=0, tv_nsec=160000},  <unfinished ...>
   488  [pid  3276] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go", O_RDONLY|O_CLOEXEC <unfinished ...>
   489  [pid  3277] <... nanosleep resumed> NULL) = 0
   490  [pid  3276] <... openat resumed> )      = 3
   491  [pid  3277] nanosleep({tv_sec=0, tv_nsec=320000},  <unfinished ...>
   492  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}}) = -1 EPERM (Operation not permitted)
   493  [pid  3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000df7b4) = -1 EPERM (Operation not permitted)
   494  [pid  3277] <... nanosleep resumed> NULL) = 0
   495  [pid  3276] read(3,  <unfinished ...>
   496  [pid  3277] nanosleep({tv_sec=0, tv_nsec=640000},  <unfinished ...>
   497  [pid  3276] <... read resumed> "package exifknife\n\nimport (\n\t\"er"..., 4096) = 4096
   498  [pid  3276] close(3)                    = 0
   499  [pid  3276] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go", O_RDONLY|O_CLOEXEC) = 3
   500  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}}) = -1 EPERM (Operation not permitted)
   501  [pid  3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000df7b4 <unfinished ...>
   502  [pid  3277] <... nanosleep resumed> NULL) = 0

Thanks. I see that the read calls are being interrupted, but I don't see the calls that created those file descriptors. They are likely to be just shortly before the traces you show. What I'm wondering is whether those files are on an unusual file system, such as /sys.

This corresponds to trace (2) above (about twenty lines before):

   432  [pid  3276] openat(AT_FDCWD, "/general_images/dense_album_staging/2015/google_photos/takeout-20150928-0047/dump/2014-05-10 #2/20140510_161510-edited.jpg", O_RDONLY|O_CLOEXEC <unfinished ...>

   433  [pid  3277] <... nanosleep resumed> NULL) = 0
   434  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   435  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   436  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   437  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   438  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   439  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   440  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   441  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   442  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   443  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   444  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   445  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   446  [pid  3276] <... openat resumed> )      = 3
   447  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   448  [pid  3276] epoll_create1(EPOLL_CLOEXEC <unfinished ...>
   449  [pid  3277] <... nanosleep resumed> NULL) = 0
   450  [pid  3276] <... epoll_create1 resumed> ) = 4
   451  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   452  [pid  3276] pipe2([5, 6], O_NONBLOCK|O_CLOEXEC) = 0
   453  [pid  3277] <... nanosleep resumed> NULL) = 0
   454  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}} <unfinished ...>
   455  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   456  [pid  3276] <... epoll_ctl resumed> )   = 0
   457  [pid  3277] <... nanosleep resumed> NULL) = 0
   458  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}} <unfinished ...>
   459  [pid  3277] epoll_pwait(4,  <unfinished ...>
   460  [pid  3276] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   461  [pid  3277] <... epoll_pwait resumed> [], 128, 0, NULL, 140490176716208) = 0
   462  [pid  3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000dfa8c <unfinished ...>
   463  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   464  [pid  3276] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   465  [pid  3277] <... nanosleep resumed> NULL) = 0
   466  [pid  3276] fstat(3,  <unfinished ...>
   467  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   468  [pid  3276] <... fstat resumed> {st_mode=S_IFREG|0755, st_size=310986, ...}) = 0
   469  [pid  3277] <... nanosleep resumed> NULL) = 0
   470  [pid  3277] getpid()                    = 3276
   471  [pid  3277] tgkill(3276, 3276, SIGURG <unfinished ...>
   472  [pid  3276] read(3,  <unfinished ...>
   473  [pid  3277] <... tgkill resumed> )      = 0
   474  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0

   475  [pid  3276] <... read resumed> 0xc000180000, 311498) = -1 EINTR (Interrupted system call)

As a kludge, I added a loop in the scripting that's calling this tool to loop whenever it sees this failure. It looks like, that, when it occurs, it may perfectly reproduce, at least under similar conditions. It was likely exposed by network latency. Your guess/suggestion was very good.

$ time ./parse_images.py      
interrupted system call; retrying (1): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (2): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (3): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (4): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (5): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
Too many tries.                                               
                                                    
ERROR: [dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
['/tmp/go-exif-knife', 'read', '--filepath', '/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg', '--ignore-no-exif']
                                                      
                                                       
Stack:                                                                                        
                                                                 
*os.PathError read /general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg: interrupted system call
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go:69 (0x7c8ddc)
        GetExif: log.PanicIf(err)                       
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go:39 (0x7cafe2)
        (*ExifRead).Read: mc, err := exifknife.GetExif(imageFilepath)      
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/command/go-exif-knife/main.go:79 (0x7cd6d9)
        main: err := er.Read(options.Filepath, options.JustTry, options.SpecificIfd, options.SpecificTags, options.JustValues, options.Json, options.IgnoreNoExif)
/usr/local/go/src/runtime/proc.go:203 (0x436d72)                                          
        main: fn()                                                       
/usr/local/go/src/runtime/asm_amd64.s:1373 (0x463491)
        goexit: BYTE    $0x90   // NOP

Thanks for the info. It's the use of CIFS that causes this. I think the only fix is to fix #20400.

That sucks. This is happening in the context of ioutil.ReadFile(), but I should assume it will happen regardless of the method?

Yes, it can happen in general.

For 1.14 you can make the problem less likely to occur by setting GODEBUG=asyncpreemptoff=1 in the environment. But it can still happen if your process receives a signal.

What's the source of the handler which didn't set SA_RESTART in this case?

The handler did set SA_RESTART. What happened here is that reading from CIFS didn't honor it. At least, that is my assumption, since CIFS is a networked file system.

Meaning a bug in Linux? Maybe a later kernel doesn't do this? Possibly related:
https://patchwork.kernel.org/patch/10887855/
https://spinics.net/lists/linux-cifs/msg16565.html

Change https://golang.org/cl/232862 mentions this issue: internal/poll, os: loop on EINTR

I posted a 1.14 backport request for this: #39026

tmm1 commented

I am running into EINTR failures with a CIFS mount as well, on go1.14.

I cherry-picked 8c1db77 onto upstream/release-branch.go1.14 and have been trying that.

Before the cherry-pick, the errors were always in the read or write syscalls:

read /Channels-Video/Streaming/channels-cache-114767979/b7fbdba671ed14a2cb5c0ad7eb058ce45ad63bff1c8923cc9ac30911e6658f13: interrupted system call
write /Channels-Video/Streaming/channels-cache-525223662/99607c430416930d925f9399b8763fb79528ff4147f36f18ea6d53d462b2e7a0: interrupted system call

After the cherry-pick, the situation has improved but I still see occasional errors from open:

open /Channels-Video/Streaming/channels-cache-385377817/162c68b4fce222635832db9996c5eb8841b1d4ae4beb4a08689aeb5c51133ef6: interrupted system call

This fix isn't supported in 1.14, sadly. Can you try the 1.15 beta, and if it fails, file a new issue?

tmm1 commented

Unfortunately I'm not able to do that since I don't control the machine/network where it occurs.

If there are other related CLs I would appreciate a pointer to them. But from what I saw only the read/write() paths were fixed, so I would guess 1.15 beta has the same problem with open()

You probably also want https://golang.org/cl/236997. See #39237.