peterbourgon/diskv

completeFilename causes invalid memory address or nil pointer dereference

Closed this issue · 11 comments

Hi

We're running a component in Kubernetes that uses diskv under the hood. The problem is that the process occasionally crashes when it attempts to remove the key from the store. Here is the relevant stack trace:

github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).Erase(0xc42041c2d0, 0x0, 0x1b, 0x0, 0x0) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:409 +0xe7
github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).completeFilename(0xc42041c2d0, 0x0, 0x1b, 0x1b, 0x27fdf00) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:525 +0x98
path/filepath.Join(0xc42110b970, 0x2, 0x2, 0xc4208e18c0, 0x11) /usr/lib/go/src/path/filepath/path.go:210
path/filepath.join(0xc42110b970, 0x2, 0x2, 0x0, 0x0) /usr/lib/go/src/path/filepath/path_unix.go:45 +0x96
strings.Join(0xc42110b970, 0x2, 0x2, 0x18d6236, 0x1, 0xc42110b918, 0x2) /usr/lib/go/src/strings/strings.go:424

Data directory is mounted as a regular host path (/opt/spm/agent) and file names are ksuid-compatible identifiers.

diskv is initialized with following configuration:

d := diskv.New(diskv.Options{
		BasePath:     c.Dir,
		Transform:    func(s string) []string { return []string{} },
		CacheSizeMax: 1024 * 1024,
})

Do you have any pointers or ideas why this would happen?

@peterbourgon Do you have any smart ideas or suggestions where I should start looking in the code to find the root cause?

That stack trace identifies the callstack, but it doesn't give the specific error that gets bubbled up. Do you have that?

All I have is the panic error message:

panic: runtime error: invalid memory address or nil pointer dereference

And which version of diskv?

I'm using the latest version: 2.0.1

Hmm. There is not much to go on. According to your stack trace, the nil pointer dereference is in the stdlib filepath.Join, and I'm not sure how that makes sense. Does this happen when you pass an empty key, for example? Can you paste the complete verbatim crash log?

yeah, the error is quite bizarre. We had 5 crashes of k8s pod during last 2 days. Here is the complete stack trace (there are slightly naming modifications to avoid exposing private data):

panic: runtime error: invalid memory address or nil pointer dereference

goroutine 16 [running]:
github.com/org/agent/outputs/lb.(*Output).Publish(0xc4204391e0, 0x1a8da00, 0xc420e72600, 0x0, 0x0) /home/usr/agent/src/github.com/org/agent/outputs/lb/lb.go:157 +0xa5d
github.com/org/agent/cmd/agent/app/pipeline.(*eventConsumer).ackJournal(0xc420690360, 0xc420e72600) /home/usr/agent/src/github.com/org/agent/cmd/agent/app/pipeline/consumer.go:211 +0xdd
github.com/org/agent/cmd/agent/app/journal/kv.(*dkv).ACK(0xc4202fa5a0, 0x0, 0x1b, 0x1a61d80, 0xc4200d9930) /home/usr/agent/src/github.com/org/agent/cmd/agent/app/journal/kv/kv.go:204 +0x42
github.com/org/agent/vendor/github.com/peterbourgon/diskv.(*Diskv).Erase(0xc4201d5050, 0x0, 0x1b, 0x0, 0x0) /home/usr/agent/src/github.com/org/agent/vendor/github.com/peterbourgon/diskv/diskv.go:409 +0xe7
github.com/org/agent/vendor/github.com/peterbourgon/diskv.(*Diskv).completeFilename(0xc4201d5050, 0x0, 0x1b, 0x1b, 0x27fcf00) /home/usr/agent/src/github.com/org/agent/vendor/github.com/peterbourgon/diskv/diskv.go:525 +0x98
path/filepath.Join(0xc42110b970, 0x2, 0x2, 0xc4208e18c0, 0x11) /usr/lib/go/src/path/filepath/path.go:210
path/filepath.join(0xc42110b970, 0x2, 0x2, 0x0, 0x0) /usr/lib/go/src/path/filepath/path_unix.go:45 +0x96
strings.Join(0xc42110b970, 0x2, 0x2, 0x18d6236, 0x1, 0xc42110b918, 0x2) /usr/lib/go/src/strings/strings.go:424

I already tried to reproduce by passing an empty key, but the crash doesn't occur. For what's worth, I also identified the exact place in Go stdlib (path_unix.go) where it panics:

func join(elem []string) string {
	// If there's a bug here, fix the logic in ./path_plan9.go too.
	for i, e := range elem {
		if e != "" {
			return Clean(strings.Join(elem[i:], string(Separator))) <- panic
		}
	}
	return ""
}

The last that occur to me is that this could be something k8s-specific, since data dir is mapped to host volume inside daemonset.

I feel safe asserting that crashing in strings.Join is a red herring, the problem is elsewhere. If it's happening in a Kubernetes volume mount, my guess is something at the filesystem layer. My suggestion would be to come up with some patch for Erase that logged a bunch of debug information, and have you run that build for a day or two, and see what it says when it goes kaput. Would this work?

Sounds good. Let me reach out to you again after I gather some relevant debug info.
Thanks!

Hi @peterbourgon

The conclusion we have is that a nasty race condition is occurring in the code and thus making the key invalid (runtime attempts to dereference a nil string pointer) which leads to the crash. I'll close this issue as it seems it's not related to diskv.

Thanks for being supportive.