sasha-s/go-deadlock

Question: potential deadlock with goroutine stuck on internal lock?

Thegaram opened this issue · 6 comments

Hey, first of all thanks for the hard work on this great lib!

I'm having trouble interpreting the output below. It suggests goroutine 77264 holds lock 0xc4202a60e0 for a long time, preventing others (like goroutine 77325 and many more) from acquiring it.

However, the output suggests that goroutine 77264 actually got stuck during unlock: raft.go:688 is a deferred mu.Unlock(), and deadlock.go:330 is actually is a lock acquire statement in this lib.

Does this mean that the (potential) deadlock is coming from this lib in this case? What would make goroutine 77264 get stuck on that internal lock? (I reproduced the same output with 1m30s lock timeout.)

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 77264 lock 0xc4202a60e0
../raft/raft.go:618 raft.(*Raft).AppendEntriesRPCHandler { rf.mu.Lock() } <<<<<
/usr/local/Cellar/go/1.9/libexec/src/runtime/asm_amd64.s:509 runtime.call32 { CALLFN(·call32, 32) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:434 reflect.Value.call { call(frametype, fn, args, uint32(frametype.size), uint32(retOffset)) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:302 reflect.Value.Call { return v.call("Call", in) }
../labrpc/labrpc.go:478 labrpc.(*Service).dispatch { function.Call([]reflect.Value{svc.rcvr, args.Elem(), replyv}) }
../labrpc/labrpc.go:402 labrpc.(*Server).dispatch { return service.dispatch(methodName, req) }
../labrpc/labrpc.go:229 labrpc.(*Network).ProcessReq.func1 { r := server.dispatch(req) }

Have been trying to lock it again for more than 30s
goroutine 77325 lock 0xc4202a60e0
../raft/raft.go:618 raft.(*Raft).AppendEntriesRPCHandler { rf.mu.Lock() } <<<<<
/usr/local/Cellar/go/1.9/libexec/src/runtime/asm_amd64.s:509 runtime.call32 { CALLFN(·call32, 32) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:434 reflect.Value.call { call(frametype, fn, args, uint32(frametype.size), uint32(retOffset)) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:302 reflect.Value.Call { return v.call("Call", in) }
../labrpc/labrpc.go:478 labrpc.(*Service).dispatch { function.Call([]reflect.Value{svc.rcvr, args.Elem(), replyv}) }
../labrpc/labrpc.go:402 labrpc.(*Server).dispatch { return service.dispatch(methodName, req) }
../labrpc/labrpc.go:229 labrpc.(*Network).ProcessReq.func1 { r := server.dispatch(req) }

Here is what goroutine 77264 doing now
goroutine 77264 [semacquire]:
sync.runtime_SemacquireMutex(0xc420084744, 0xa900000000)
    /usr/local/Cellar/go/1.9/libexec/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc420084740)
    /usr/local/Cellar/go/1.9/libexec/src/sync/mutex.go:134 +0x14c
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc420084740, 0x1288e00, 0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:330 +0x3f
github.com/sasha-s/go-deadlock.postUnlock(0x1288e00, 0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:167 +0x5f
github.com/sasha-s/go-deadlock.(*Mutex).Unlock(0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:97 +0x7f
raft.(*Raft).AppendEntriesRPCHandler(0xc4202a60e0, 0xc420beb840, 0xc421834620)
    .../src/raft/raft.go:688 +0xcc8

...

Hi.
Looking at the output, it seems you are right.
Somehow the l.mu got locked, so we are stuck on deadlock.go:330 .
I do not see how it could happen.
I would look at the dump of the other goroutines, to see if any of them are doing something in deadlock.go (under the l.mu lock).
Or, if you have something reproducible, I'd love to look at it.

I found one bug in the library, hopefully fixed it in 1595213
Not sure it would solve your problem, but could you pull and try again?

I just implemented a deadlock detector with easy to prove correctness:

https://github.com/zhiqiangxu/util/tree/master/deadlock

I'm facing the same problem while debugging Mercure:

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 5212824 lock 0xc0001aa780
hub/bolt_transport.go:120 hub.(*BoltTransport).Dispatch { t.Lock() } <<<<<
hub/subscribe.go:201 hub.(*Hub).dispatchSubscriptionUpdate { h.transport.Dispatch(u) }
hub/subscribe.go:180 hub.(*Hub).shutdown { h.dispatchSubscriptionUpdate(s, false) }
hub/subscribe.go:49 hub.(*Hub).SubscribeHandler { return }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 http.HandlerFunc.ServeHTTP { f(w, r) }
../go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210 mux.(*Router).ServeHTTP { handler.ServeHTTP(w, req) }
../go/pkg/mod/github.com/gorilla/handlers@v1.4.2/cors.go:54 handlers.(*cors).ServeHTTP { ch.h.ServeHTTP(w, r) }
../go/pkg/mod/github.com/unrolled/secure@v1.0.8/secure.go:177 secure.(*Secure).Handler.func1 { h.ServeHTTP(w, r) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 http.HandlerFunc.ServeHTTP { f(w, r) }
../go/pkg/mod/github.com/gorilla/handlers@v1.4.2/logging.go:45 handlers.loggingHandler.ServeHTTP { h.handler.ServeHTTP(logger, req) }
../go/pkg/mod/github.com/gorilla/handlers@v1.4.2/recovery.go:78 handlers.recoveryHandler.ServeHTTP { h.handler.ServeHTTP(w, req) }
../go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210 mux.(*Router).ServeHTTP { handler.ServeHTTP(w, req) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2807 http.serverHandler.ServeHTTP { handler.ServeHTTP(rw, req) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:1895 http.(*conn).serve { serverHandler{c.server}.ServeHTTP(w, w.req) }

Have been trying to lock it again for more than 30s
goroutine 5373639 lock 0xc0001aa780

Here is what goroutine 5212824 doing now
goroutine 5212824 [semacquire]:
sync.runtime_SemacquireMutex(0xc00000ee24, 0x0, 0x1)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00000ee20)
	/usr/local/Cellar/go/1.14.3/libexec/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/Cellar/go/1.14.3/libexec/src/sync/mutex.go:81
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc00000ee20, 0x16feaa0, 0xc0001aa780)
	/Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:330 +0xa6
github.com/sasha-s/go-deadlock.postUnlock(...)
	/Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:167
github.com/sasha-s/go-deadlock.(*RWMutex).Unlock(0xc0001aa780)
	/Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:129 +0x66
github.com/dunglas/mercure/hub.(*BoltTransport).Dispatch(0xc0001aa780, 0xc001522600, 0x0, 0x0)
	/Users/dunglas/workspace/mercure/hub/bolt_transport.go:133 +0x251
github.com/dunglas/mercure/hub.(*Hub).dispatchSubscriptionUpdate(0xc000298420, 0xc002c1f5e0, 0xc000f81000)
	/Users/dunglas/workspace/mercure/hub/subscribe.go:201 +0x2f2
github.com/dunglas/mercure/hub.(*Hub).shutdown(0xc000298420, 0xc002c1f5e0)
	/Users/dunglas/workspace/mercure/hub/subscribe.go:180 +0x4c
github.com/dunglas/mercure/hub.(*Hub).SubscribeHandler(0xc000298420, 0x2bb4068, 0xc00102f260, 0xc000fae900)
	/Users/dunglas/workspace/mercure/hub/subscribe.go:49 +0x66e
net/http.HandlerFunc.ServeHTTP(0xc000280ea0, 0x2bb4068, 0xc00102f260, 0xc000fae900)
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 +0x44
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0001dc180, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210 +0xe2
github.com/gorilla/handlers.(*cors).ServeHTTP(0xc0001942d0, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/handlers@v1.4.2/cors.go:54 +0x102a
github.com/unrolled/secure.(*Secure).Handler.func1(0x2bb4068, 0xc00102f260, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/unrolled/secure@v1.0.8/secure.go:177 +0xdf
net/http.HandlerFunc.ServeHTTP(0xc00000ff80, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 +0x44
github.com/gorilla/handlers.loggingHandler.ServeHTTP(0x18849a0, 0xc000010020, 0x1885560, 0xc00000ff80, 0x17b87b0, 0x18907a0, 0xc002c1f500, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/handlers@v1.4.2/logging.go:45 +0x243
github.com/gorilla/handlers.recoveryHandler.ServeHTTP(0x18850a0, 0xc000298de0, 0x1883d00, 0xc0002a1b20, 0x1, 0x18907a0, 0xc002c1f500, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/handlers@v1.4.2/recovery.go:78 +0xce
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0001dc0c0, 0x18907a0, 0xc002c1f500, 0xc000a19e00)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210 +0xe2
net/http.serverHandler.ServeHTTP(0xc0001c62a0, 0x18907a0, 0xc002c1f500, 0xc000a19e00)
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2807 +0xa3
net/http.(*conn).serve(0xc0035ebcc0, 0x1892f60, 0xc002980200)
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:1895 +0x86c
created by net/http.(*Server).Serve
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2933 +0x35c

I'm using the same version of the library. Thanks for writing such an useful tool btw!

v0.2.1 (1595213)

This blocking occurs in the caching layer, which is heavily loaded, about 100,000 requests per second.
The blocking does not work immediately, sometimes it takes up to a week, and sometimes the next day after the restart

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 1595446618 lock 0xc00048e7d0
<project>/pkg/stracing/mutex.go:48 stracing.(*Mutex).Lock ??? <<<<<
...

Have been trying to lock it again for more than 5m0s
goroutine 1595163670 lock 0xc00048e7d0

Here is what goroutine 1595446618 doing now
goroutine 1595446618 [semacquire]:
sync.runtime_SemacquireMutex(0xc0001108a4, 0xd00, 0x1)
 runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0001108a0)
 sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
 sync/mutex.go:81
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc0001108a0, 0xc54900, 0xc00048e7d0)
 github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:330 +0xa6
github.com/sasha-s/go-deadlock.postUnlock(...)
 github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:167
github.com/sasha-s/go-deadlock.(*Mutex).Unlock(0xc00048e7d0)
 github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:97 +0x65
...

But in this case, there is a very small timeout, while in my real code there is 5 minutes.

package main

import (
	"time"

	"github.com/sasha-s/go-deadlock"
)

func main() {
	deadlock.Opts.DeadlockTimeout = 1 * time.Second

	var mx deadlock.Mutex

	fn := func() {
		mx.Lock()
		mx.Unlock()
	}

	for i := 0; i < 100_000; i++ {
		go func() {
			for {
				fn()
			}
		}()
	}

	select {}
}

Perhaps the problem is not in this library, but in some other place. But in my real case, mutex protected counting the number in a slice and adding to a slice.

// copy code from https://github.com/jaegertracing/jaeger/blob/748bf213d1e5a03ebc4da63601473be3bbe193f4/examples/hotrod/pkg/tracing/mutex.go with my fixes
type Mutex struct {
	...
	waiters     []string
	waitersLock deadlock.Mutex // <--
}

func (sm *Mutex) Lock(ctx context.Context, name string) {
	activeSpan := opentracing.SpanFromContext(ctx) // "activeSpan == noopSpan{}" in production

	sm.waitersLock.Lock() // <-- "Previous place where the lock was grabbed"
	if waiting := len(sm.waiters); waiting > 0 && activeSpan != nil {
		activeSpan.LogFields(...) // <-- func (n noopSpan) LogFields(fields ...log.Field) {}
	}
	sm.waiters = append(sm.waiters, name)
	sm.waitersLock.Unlock() // <-- "Have been trying to lock it again for more than 5m0s"

	sm.realLock.Lock()
	sm.holder = name

	sm.waitersLock.Lock()
	behindLen := len(sm.waiters) - 1
	sm.waitersLock.Unlock()

	if activeSpan != nil {
		activeSpan.LogFields(...)
	}
}