internal/poll: can get EINTR reading from CIFS
dsoprea opened this issue · 20 comments
What version of Go are you using (go version)?
1.14.2
Does this issue reproduce with the latest release?
(This is the latest release)
What operating system and processor architecture are you using (go env)?
linux, amd64
What did you do?
I'm running a Go tool of mine over a bunch of images. I'm getting periodic "interrupted system call" errors:
*os.PathError read /general_images/dense_album_staging/2015/google_photos/takeout-20150928-0047/dump/2012-12-05-/IMAG0507-edited.jpg: interrupted system call
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go:69 (0x7c8a4c)
GetExif: log.PanicIf(err)
I've restarted several times. On earlier runs, I've gotten large clumps of these in the output (seems racy).
They always happen in the same place:
65 } else {
66 var err error
67
68 data, err = ioutil.ReadFile(imageFilepath)
69 log.PanicIf(err)
70 }
Is it possible that ioutil.ReadFile() is not correctly managing signals?
(https://golang.org/pkg/os/signal/#hdr-Go_programs_that_use_cgo_or_SWIG)
If the non-Go code installs any signal handlers, it must use the SA_ONSTACK flag with sigaction. Failing to do so is likely to cause the program to crash if the signal is received. Go programs routinely run with a limited stack, and therefore set up an alternate signal stack. Also, the Go standard library expects that any signal handlers will use the SA_RESTART flag. Failing to do so may cause some library calls to return "interrupted system call" errors.
I believe this is expected due to runtime changes in Go 1.14, see: https://golang.org/doc/go1.14#runtime.
No, 1.14 triggers more signals, but Go installs signal handlers with SA_RESTART set, so any signals that occur during system calls called by ioutil.ReadFile should be restarted.
Is this a pure Go program?
Can you run the program under strace -f and see exactly what system call is returning EINTR, and show us a dozen lines around that call? Thanks.
Yes, it's definitely pure Go (a principal feature of the project).
Okay. I'm running it with strace right now. I don't know how long it will take to encounter the first one. Hopefully the slowdown won't completely stop it from occurring.
Got a couple.
1: Triggered in lines 423-427
401 [pid 2790] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}}) = 0
402 [pid 2791] <... nanosleep resumed> NULL) = 0
403 [pid 2791] epoll_pwait(4, <unfinished ...>
404 [pid 2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}} <unfinished ...>
405 [pid 2791] <... epoll_pwait resumed> [], 128, 0, NULL, 139628791573936) = 0
406 [pid 2790] <... epoll_ctl resumed> ) = -1 EPERM (Operation not permitted)
407 [pid 2791] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
408 [pid 2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d7a8c) = -1 EPERM (Operation not permitted)
409 [pid 2791] <... nanosleep resumed> NULL) = 0
410 [pid 2791] nanosleep({tv_sec=0, tv_nsec=40000}, <unfinished ...>
411 [pid 2790] fstat(3, {st_mode=S_IFREG|0755, st_size=2296983, ...}) = 0
412 [pid 2791] <... nanosleep resumed> NULL) = 0
413 [pid 2791] nanosleep({tv_sec=0, tv_nsec=80000}, <unfinished ...>
414 [pid 2790] mmap(NULL, 1439992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
415 [pid 2791] <... nanosleep resumed> NULL) = 0
416 [pid 2790] <... mmap resumed> ) = 0x7efdd6e9e000
417 [pid 2791] getpid() = 2790
418 [pid 2791] tgkill(2790, 2790, SIGURG <unfinished ...>
419 [pid 2790] read(3, <unfinished ...>
420 [pid 2791] <... tgkill resumed> ) = 0
421 [pid 2791] nanosleep({tv_sec=0, tv_nsec=160000}, NULL) = 0
422 [pid 2791] nanosleep({tv_sec=0, tv_nsec=320000}, <unfinished ...>
423 [pid 2790] <... read resumed> 0xc000180000, 2297495) = -1 EINTR (Interrupted system call)
424 [pid 2790] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=2790, si_uid=1000} ---
425 [pid 2790] rt_sigreturn({mask=[]} <unfinished ...>
426 [pid 2791] <... nanosleep resumed> NULL) = 0
427 [pid 2790] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
428 [pid 2791] nanosleep({tv_sec=0, tv_nsec=640000}, <unfinished ...>
429 [pid 2790] close(3) = 0
430 [pid 2791] <... nanosleep resumed> NULL) = 0
431 [pid 2791] nanosleep({tv_sec=0, tv_nsec=1280000}, <unfinished ...>
432 [pid 2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go", O_RDONLY|O_CLOEXEC) = 3
433 [pid 2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -1 EPERM (Operation not permitted)
434 [pid 2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
435 [pid 2790] read(3, "package exifknife\n\nimport (\n\t\"er"..., 4096) = 4096
436 [pid 2790] close(3) = 0
437 [pid 2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go", O_RDONLY|O_CLOEXEC) = 3
438 [pid 2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}} <unfinished ...>
439 [pid 2791] <... nanosleep resumed> NULL) = 0
440 [pid 2790] <... epoll_ctl resumed> ) = -1 EPERM (Operation not permitted)
441 [pid 2791] nanosleep({tv_sec=0, tv_nsec=2560000}, <unfinished ...>
442 [pid 2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
443 [pid 2790] read(3, "package exifkniferead\n\nimport (\n"..., 4096) = 4096
444 [pid 2790] close(3) = 0
445 [pid 2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/command/go-exif-knife/main.go", O_RDONLY|O_CLOEXEC) = 3
446 [pid 2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -1 EPERM (Operation not permitted)
447 [pid 2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
448 [pid 2790] read(3, "package main\n\nimport (\n\t\"os\"\n\n\t\""..., 4096) = 3957
449 [pid 2790] close(3) = 0
450 [pid 2790] openat(AT_FDCWD, "/usr/local/go/src/runtime/proc.go", O_RDONLY|O_CLOEXEC) = 3
451 [pid 2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -
2: Triggered in lines 475-481
452 [pid 3276] pipe2([5, 6], O_NONBLOCK|O_CLOEXEC) = 0
453 [pid 3277] <... nanosleep resumed> NULL) = 0
454 [pid 3276] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}} <unfinished ...>
455 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
456 [pid 3276] <... epoll_ctl resumed> ) = 0
457 [pid 3277] <... nanosleep resumed> NULL) = 0
458 [pid 3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}} <unfinished ...>
459 [pid 3277] epoll_pwait(4, <unfinished ...>
460 [pid 3276] <... epoll_ctl resumed> ) = -1 EPERM (Operation not permitted)
461 [pid 3277] <... epoll_pwait resumed> [], 128, 0, NULL, 140490176716208) = 0
462 [pid 3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000dfa8c <unfinished ...>
463 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
464 [pid 3276] <... epoll_ctl resumed> ) = -1 EPERM (Operation not permitted)
465 [pid 3277] <... nanosleep resumed> NULL) = 0
466 [pid 3276] fstat(3, <unfinished ...>
467 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
468 [pid 3276] <... fstat resumed> {st_mode=S_IFREG|0755, st_size=310986, ...}) = 0
469 [pid 3277] <... nanosleep resumed> NULL) = 0
470 [pid 3277] getpid() = 3276
471 [pid 3277] tgkill(3276, 3276, SIGURG <unfinished ...>
472 [pid 3276] read(3, <unfinished ...>
473 [pid 3277] <... tgkill resumed> ) = 0
474 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
475 [pid 3276] <... read resumed> 0xc000180000, 311498) = -1 EINTR (Interrupted system call)
476 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
477 [pid 3276] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=3276, si_uid=1000} ---
478 [pid 3277] <... nanosleep resumed> NULL) = 0
479 [pid 3276] rt_sigreturn({mask=[]} <unfinished ...>
480 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
481 [pid 3276] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
482 [pid 3277] <... nanosleep resumed> NULL) = 0
483 [pid 3276] close(3 <unfinished ...>
484 [pid 3277] nanosleep({tv_sec=0, tv_nsec=40000}, NULL) = 0
485 [pid 3277] nanosleep({tv_sec=0, tv_nsec=80000}, NULL) = 0
486 [pid 3276] <... close resumed> ) = 0
487 [pid 3277] nanosleep({tv_sec=0, tv_nsec=160000}, <unfinished ...>
488 [pid 3276] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go", O_RDONLY|O_CLOEXEC <unfinished ...>
489 [pid 3277] <... nanosleep resumed> NULL) = 0
490 [pid 3276] <... openat resumed> ) = 3
491 [pid 3277] nanosleep({tv_sec=0, tv_nsec=320000}, <unfinished ...>
492 [pid 3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}}) = -1 EPERM (Operation not permitted)
493 [pid 3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000df7b4) = -1 EPERM (Operation not permitted)
494 [pid 3277] <... nanosleep resumed> NULL) = 0
495 [pid 3276] read(3, <unfinished ...>
496 [pid 3277] nanosleep({tv_sec=0, tv_nsec=640000}, <unfinished ...>
497 [pid 3276] <... read resumed> "package exifknife\n\nimport (\n\t\"er"..., 4096) = 4096
498 [pid 3276] close(3) = 0
499 [pid 3276] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go", O_RDONLY|O_CLOEXEC) = 3
500 [pid 3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}}) = -1 EPERM (Operation not permitted)
501 [pid 3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000df7b4 <unfinished ...>
502 [pid 3277] <... nanosleep resumed> NULL) = 0
Thanks. I see that the read calls are being interrupted, but I don't see the calls that created those file descriptors. They are likely to be just shortly before the traces you show. What I'm wondering is whether those files are on an unusual file system, such as /sys.
This corresponds to trace (2) above (about twenty lines before):
432 [pid 3276] openat(AT_FDCWD, "/general_images/dense_album_staging/2015/google_photos/takeout-20150928-0047/dump/2014-05-10 #2/20140510_161510-edited.jpg", O_RDONLY|O_CLOEXEC <unfinished ...>
433 [pid 3277] <... nanosleep resumed> NULL) = 0
434 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
435 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
436 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
437 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
438 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
439 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
440 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
441 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
442 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
443 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
444 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
445 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
446 [pid 3276] <... openat resumed> ) = 3
447 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
448 [pid 3276] epoll_create1(EPOLL_CLOEXEC <unfinished ...>
449 [pid 3277] <... nanosleep resumed> NULL) = 0
450 [pid 3276] <... epoll_create1 resumed> ) = 4
451 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
452 [pid 3276] pipe2([5, 6], O_NONBLOCK|O_CLOEXEC) = 0
453 [pid 3277] <... nanosleep resumed> NULL) = 0
454 [pid 3276] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}} <unfinished ...>
455 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
456 [pid 3276] <... epoll_ctl resumed> ) = 0
457 [pid 3277] <... nanosleep resumed> NULL) = 0
458 [pid 3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}} <unfinished ...>
459 [pid 3277] epoll_pwait(4, <unfinished ...>
460 [pid 3276] <... epoll_ctl resumed> ) = -1 EPERM (Operation not permitted)
461 [pid 3277] <... epoll_pwait resumed> [], 128, 0, NULL, 140490176716208) = 0
462 [pid 3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000dfa8c <unfinished ...>
463 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
464 [pid 3276] <... epoll_ctl resumed> ) = -1 EPERM (Operation not permitted)
465 [pid 3277] <... nanosleep resumed> NULL) = 0
466 [pid 3276] fstat(3, <unfinished ...>
467 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
468 [pid 3276] <... fstat resumed> {st_mode=S_IFREG|0755, st_size=310986, ...}) = 0
469 [pid 3277] <... nanosleep resumed> NULL) = 0
470 [pid 3277] getpid() = 3276
471 [pid 3277] tgkill(3276, 3276, SIGURG <unfinished ...>
472 [pid 3276] read(3, <unfinished ...>
473 [pid 3277] <... tgkill resumed> ) = 0
474 [pid 3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
475 [pid 3276] <... read resumed> 0xc000180000, 311498) = -1 EINTR (Interrupted system call)
As a kludge, I added a loop in the scripting that's calling this tool to loop whenever it sees this failure. It looks like, that, when it occurs, it may perfectly reproduce, at least under similar conditions. It was likely exposed by network latency. Your guess/suggestion was very good.
$ time ./parse_images.py
interrupted system call; retrying (1): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (2): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (3): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (4): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (5): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
Too many tries.
ERROR: [dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
['/tmp/go-exif-knife', 'read', '--filepath', '/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg', '--ignore-no-exif']
Stack:
*os.PathError read /general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg: interrupted system call
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go:69 (0x7c8ddc)
GetExif: log.PanicIf(err)
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go:39 (0x7cafe2)
(*ExifRead).Read: mc, err := exifknife.GetExif(imageFilepath)
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/command/go-exif-knife/main.go:79 (0x7cd6d9)
main: err := er.Read(options.Filepath, options.JustTry, options.SpecificIfd, options.SpecificTags, options.JustValues, options.Json, options.IgnoreNoExif)
/usr/local/go/src/runtime/proc.go:203 (0x436d72)
main: fn()
/usr/local/go/src/runtime/asm_amd64.s:1373 (0x463491)
goexit: BYTE $0x90 // NOP
Thanks for the info. It's the use of CIFS that causes this. I think the only fix is to fix #20400.
That sucks. This is happening in the context of ioutil.ReadFile(), but I should assume it will happen regardless of the method?
Yes, it can happen in general.
For 1.14 you can make the problem less likely to occur by setting GODEBUG=asyncpreemptoff=1 in the environment. But it can still happen if your process receives a signal.
What's the source of the handler which didn't set SA_RESTART in this case?
The handler did set SA_RESTART. What happened here is that reading from CIFS didn't honor it. At least, that is my assumption, since CIFS is a networked file system.
Meaning a bug in Linux? Maybe a later kernel doesn't do this? Possibly related:
https://patchwork.kernel.org/patch/10887855/
https://spinics.net/lists/linux-cifs/msg16565.html
Change https://golang.org/cl/232862 mentions this issue: internal/poll, os: loop on EINTR
I posted a 1.14 backport request for this: #39026
I am running into EINTR failures with a CIFS mount as well, on go1.14.
I cherry-picked 8c1db77 onto upstream/release-branch.go1.14 and have been trying that.
Before the cherry-pick, the errors were always in the read or write syscalls:
read /Channels-Video/Streaming/channels-cache-114767979/b7fbdba671ed14a2cb5c0ad7eb058ce45ad63bff1c8923cc9ac30911e6658f13: interrupted system call
write /Channels-Video/Streaming/channels-cache-525223662/99607c430416930d925f9399b8763fb79528ff4147f36f18ea6d53d462b2e7a0: interrupted system call
After the cherry-pick, the situation has improved but I still see occasional errors from open:
open /Channels-Video/Streaming/channels-cache-385377817/162c68b4fce222635832db9996c5eb8841b1d4ae4beb4a08689aeb5c51133ef6: interrupted system call
This fix isn't supported in 1.14, sadly. Can you try the 1.15 beta, and if it fails, file a new issue?
Unfortunately I'm not able to do that since I don't control the machine/network where it occurs.
If there are other related CLs I would appreciate a pointer to them. But from what I saw only the read/write() paths were fixed, so I would guess 1.15 beta has the same problem with open()
You probably also want https://golang.org/cl/236997. See #39237.