lni/dragonboat

pprof --goroutine gradually increase (even use the dragonboat-example)

wh-afra opened this issue · 5 comments

The lib is so useful . Thank you for sharing it.
At present , i met with a problem ,and i can not tell if it is normal .
When Pprof is deployment on my enviroment, I notice that the goroutine is gradually increasing , even no data transformation is going on . As time goes on, the number of goroutine will become bigger and bigger.
And I tried the dragonboat-example (ondisk) ,the same thing happened。
I also get the increasing location ,as following:
173 @ 0x43a596 0x406d1b 0x406818 0xab3870 0x52bfe5 0x46b3a1

0xab386f github.com/lni/dragonboat/v4/internal/transport.(*TCP).Start.func2.2+0x2f d:/GOPATH/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20220604122422-e23d27bb8ff4/internal/transport/tcp.go:476

0x52bfe4 github.com/lni/goutils/syncutil.(*Stopper).runWorker.func1+0xc4 d:/GOPATH/pkg/mod/github.com/lni/goutils@v1.3.1-0.20220604063047-388d67b4dbc4/syncutil/stopper.go:79

It seems that many transport connections can not stop normally ?

Dragonboat version

github.com/lni/dragonboat/v4@v4.0.0-20220604122422-e23d27bb8ff4

Expected behavior

Actual behavior

When Pprof is deployment ,goroutine gradually increase

Steps to reproduce the behavior

use the dragonboart-example https://github.com/lni/dragonboat-example/tree/master/ondisk
and deploy the pprof .
BTW , I ran the program on the different machines (OS is ubuntu)

Could you take a moment to look at this question ? Is it a potential bug ,and how to handle it?
Thanks a lot.

@lni I tried to modify the code to handle the problem, could you take some time to the commit
https://github.com/wh-afra/dragonboat-1/commit/718dee75997d185d024dc0ad067175bb8981973b ?
thank you .

lni commented

What do you mean by the number of goroutine keep increasing? Each TCP has its own goroutine, did you have anything to show that the goroutine never gets stopped after the TCP connection is done?

You change above is obviously incorrect. Please read the code first to see what that goroutine is used for first.

lni commented

Your proposed change above is obvious wrong, but you are right that it will leak the helper goroutine. I think the fix should be something more like the code below. Care to change your program, prepare a couple tests and send in the PR?

Thanks for raising the bug.

connCloseCh := make(chan struct{})
closeFn := func() {
        once.Do(func() {
         select {
         case connCloseCh <- struct{}{}:
         default: 
         } 
         if err := conn.Close(); err != nil {
            plog.Errorf("failed to close the connection %v", err)
          }
        })
      }
      t.connStopper.RunWorker(func() {
       select {
        case <-t.stopper.ShouldStop():
        case <-connCloseCh:
       }
        closeFn()
      })
      t.connStopper.RunWorker(func() {
        t.serveConn(conn)
        closeFn()
      })

Thanks for your reply. Yes , the modification is just a workaround . I will try your solution and do some tests .

lni commented

this has been fixed in cda0760