jacobsa/fuse

Running tests sometimes hangs at the FUSE level

stapelberg opened this issue · 6 comments

Steps to reproduce:

  1. cd ~/go/src/github.com/jacobsa/fuse
  2. go test -count=1 -v ./...

Sometimes, the above go test invocation hangs. ps shows:

[…]
michael        82587  20   0  0.3  0.0              \_ /home/michael/sdk/go1.16/bin/go test -count=1 -v ./...
michael        83396  20   0  3.9  0.1                  \_ /tmp/go-build2156820598/b126/memfs.test -test.paniconexit0 -test.timeout=10m0s -test.count=1 -test.v=true

(Sometimes it’s the statfs test process that hangs, so it’s not just memfs.)

Open file descriptors of the test process:

% ls -l /proc/83396/fd
total 0
lr-x------. 1 michael michael 64 2021-06-06 21:15 0 -> /dev/null
l-wx------. 1 michael michael 64 2021-06-06 21:15 1 -> 'pipe:[427286]'
lrwx------. 1 michael michael 64 2021-06-06 21:15 10 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 11 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 12 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 13 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 14 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 15 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 16 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 17 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 18 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 19 -> /dev/fuse
l-wx------. 1 michael michael 64 2021-06-06 21:15 2 -> 'pipe:[427286]'
lrwx------. 1 michael michael 64 2021-06-06 21:15 20 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 21 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 22 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 23 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 24 -> /dev/fuse
l-wx------. 1 michael michael 64 2021-06-06 21:15 26 -> /tmp/sample_test986118255/foo
lrwx------. 1 michael michael 64 2021-06-06 21:15 28 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 3 -> 'anon_inode:[eventpoll]'
l-wx------. 1 michael michael 64 2021-06-06 21:15 30 -> /tmp/sample_test986118255/foo
lr-x------. 1 michael michael 64 2021-06-06 21:15 4 -> 'pipe:[418737]'
l-wx------. 1 michael michael 64 2021-06-06 21:15 5 -> 'pipe:[418737]'
lr-x------. 1 michael michael 64 2021-06-06 21:15 6 -> /var/lib/sss/mc/passwd
lrwx------. 1 michael michael 64 2021-06-06 21:15 7 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 8 -> /dev/fuse
lrwx------. 1 michael michael 64 2021-06-06 21:15 9 -> /dev/fuse

Adding extra debug logging doesn’t seem to help, as go test doesn’t show the output until the test process finishes.

To unstuck your computer, use:

% echo 1 | sudo tee /sys/fs/fuse/connections/43/abort

Possibly followed by fusermount -u <mountpoint> for all remaining FUSE mountpoints in /tmp.

This issue does not seem to be caused by a recent change, at least not within github.com/jacobsa/fuse (perhaps on the kernel side?). I can reproduce it with commit e7bcad2 from 2019, and also with the current version.

Any ideas/tips welcome.

I have no specific ideas or tips, but I will say (very generically, sorry) that with past issues like this I was able to get a lot of mileage out of digging around in /proc/$pid. For example, I think looking at the kernel stacks for the stuck threads would probably tell you something useful.

gocryptfs developer here, https://github.com/hanwen/go-fuse had a similar problem.

Does the hang reproduce every time when you run with GOMAXPROCS=1 ?

The workaround in go-fuse is this: https://github.com/hanwen/go-fuse/blob/master/fuse/poll.go

I encountered this problem and this is my working theory/understanding of what is happening. Thanks @rfjakob for the pointer.

Background: Go File IO

Go does file IO using epoll in Linux (or the equivalent in non-Linux targets). Go opens files in non-blocking mode and adds them to the runtime poller. The poller periodically polls IO events from the kernel for the registered files, and returns any goroutine waiting for IO as ready to wake up to the scheduler.

The problem: FUSE + Test == Deadlock

The problem here happens because the FUSE server and the test are running on the same process, and therefore sharing the same runtime poller. When the test makes a file IO that would cause a poll to a file in the file-system backed by the FUSE server, the test goroutine will sleep until the file is ready. Notice that at this point, the FUSE server is also sleeping, waiting for messages from the Kernel.

When the Kernel receives the poll, it passes it to FUSE (kernel) which forwards it to the FUSE server by writing it to the connection file. The Kernel then waits for a message back in the file, before returning the poll call back to the poller. Unfortunately the daemon never gets a chance to read the message from the kernel, because at this point the runtime poller is blocked waiting for the Kernel to answer the original poll call. It simply can't wake up the FUSE server goroutine, which means the original poll call can never be answered back to the Kernel. Deadlock.

Option 1: Don't poll from tests. (<--- This is what I'm doing)

The os package doesn't offer a straightforward way to do File IO operations in blocking mode (that I know of). It's possible to implement it yourself. For example:

func OpenNoPoll(name string, mode int, perm uint32) (*os.File, error) {
	fd, err := syscall.Open(name, mode|syscall.O_CLOEXEC, perm)
	if err != nil {
		return nil, err
	}
	return os.NewFile(uintptr(fd), name), nil
}

func ReadDirNoPoll(name string) ([]os.DirEntry, error) {
	f, err := OpenNoPoll(name, syscall.O_RDONLY, 0)
	if err != nil {
		return nil, err
	}
	defer func() { _ = f.Close() }()

	dirs, err := f.ReadDir(-1)
	if err != nil {
		return nil, err
	}

	sort.Slice(dirs, func(i, j int) bool { return dirs[i].Name() < dirs[j].Name() })
	return dirs, nil
}

It's messy, but it works.

Option 2: Run FUSE server and test in different processes.

Simple enough to do, but I find it not as convenient to debug.

Option 3: @rfjakob hack

The workaround posted by @rfjakob consists of: upon initialization, create a special file in the backed file-system, force a poll to that file, and then immediately return ENOSYS. This will tell the FUSE (kernel) that the backed file-system doesn't support poll, and following poll calls to the file-system will return immediately from the Kernel, without going to the FUSE server.

Although this hack is very effective, I believe it to be a bit of an overkill. This deadlock, as I understand it, only happens when the poll (to the backed file-system) originates from the same process as the FUSE server, which seems to be a very sketchy thing to do (except on a test).

Thanks very much for the analysis!

I think I would actually prefer option 2 (server and test in separate processes), as it most closely mirrors how jacobsa/fuse will be run in practice.

the downside of option 2 is that the tests won't be able to mock out parts of the filesystem or do whitebox testing that examines the filesystem states

Update: I have since switched my project to run the file system in tests in a separate process. The reason is that I could never really remove all sources of poll, and changes to the system/test would randomly cause the tests to start freezing again. The downside is that debugging is more cumbersome, as noted above. Debug from IDE doesn't "simply work" anymore. Instead, I have to "connect to remote process", which is not as convenient. Nevertheless, it works. Below is a snippet of code that shows how run file system in a separate process in test.

func SetupTest() {
	daemon = exec.Command("path/to/daemon/binary")
	daemon.Stdout = os.Stdout
	stderr := newMatchingWriter(os.Stderr)
        // This assumes you have set the FUSE DebugLogger to STDERR. Change accordingly.
	daemon.Stderr = stderr
	daemon.Start()
        // we have to wait for the file system to actually mount before proceeding with the test, to avoid race conditions.
	stderr.Wait()
}


var matchNext = []byte("*initOK ()")

// matchingWriter is a io.Writer that checks whether a string matching the regular expression
// "init.*OK \(\)" has been written to it. All writes are forwarded to delegate.
type matchingWriter struct {
	delegate io.Writer
	state    int
	cond     *sync.Cond
	mu       *sync.Mutex
}

// newMatchingWriter returns a new matchingWriter that forwards writes to the given delegate.
func newMatchingWriter(delegate io.Writer) *matchingWriter {
	mu := &sync.Mutex{}
	return &matchingWriter{
		delegate: delegate,
		state:    0,
		cond:     sync.NewCond(mu),
		mu:       mu,
	}
}

func (writer *matchingWriter) Write(p []byte) (n int, err error) {
	writer.mu.Lock()
	for b := 0; b < len(p) && writer.state < len(matchNext)-1; b++ {
		if p[b] == matchNext[writer.state+1] {
			writer.state += 1
		} else if writer.state < 4 {
			writer.state = 0
		} else {
			writer.state = 4
		}
	}
	if writer.state == len(matchNext)-1 {
		writer.cond.Broadcast()
	}
	writer.mu.Unlock()
	return writer.delegate.Write(p)
}

// Wait until a string matching the regular expression "init.*OK \(\)" has been written to this
// matchingWriter.
func (writer *matchingWriter) Wait() {
	writer.mu.Lock()
	defer writer.mu.Unlock()
	for writer.state != len(matchNext)-1 {
		writer.cond.Wait()
	}
}