gcash/bchd

Max RPC clients exceeded

infertux opened this issue · 12 comments

I'm regularly bumping into this error:

[INF] RPCS: Max RPC clients exceeded [10] - disconnecting client 127.0.0.1:XXXX

I'm the only one connecting to the gRPC interface and I'm pretty sure I never actually hit the max limit. I tried to bump it to 100 and got the same error after a few days. I suspect bchd doesn't decrement the client count in some cases. Is there a way to query bchd for the client list so I could double check?

This is particularly annoying because bchd stops syncing the blockchain when that happens - which seems like another bug(?) - and I have to restart it to get things going again.

Hmm, the gRPC interface should never throw this error, as there is no maximum number of connections on the public gRPC API. This error is just for the JSON-RPC interface. The way this is tracked is super simple, perhaps you have HTTP keepalives on in your client, and the connections are persisting for longer than you are expecting? Which RPC calls are you making?

The code in rpcserver.go

	rpcServeMux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Content-Type", "application/json")

		// Limit the number of connections to max allowed.
		if s.limitConnections(w, r.RemoteAddr) {
			return
		}

		// Keep track of the number of connected clients.
		s.incrementClients()
		defer s.decrementClients()

		_, isAdmin, err := s.checkAuth(r, true)
		if err != nil {
			jsonAuthFail(w)
			return
		}

		// Read and respond to the request.
		s.jsonRPCRead(w, r, isAdmin)
	})

perhaps you have HTTP keepalives on in your client, and the connections are persisting for longer than you are expecting

You're spot on! Fulcrum is querying this bchd instance and it uses HTTP keepalives. Actually this may be related to this old issue I opened about bchd compatibility with Fulcrum.

So it sounds like Fulcrum isn't closing HTTP connections properly? Any ideas how I could investigate this? I'm not very good with C++.

Ahh, so you are making a short request to fulcrum, then it is forwarding to BCHD and then completing but not closing the connection? Usually with fulcrum you want keepalives as it is polling for new blocks and data constantly. Ideally it should be using just 1 connection for all BCHD calls.

What exact calls are you making to fulcrum? We probably want to loop in @cculianu!

Hmm. Strange.

Just to chime in: Fulcrum uses a pool of 3 connections by default, and it issues requests in parallel when it can (because this performs better on BCHN and BU) -- but it definitely closes them if they go down or whatever, and re-creates the socket(s) in that case after having killed them/shut them down. I don't see such connection leaks on other bitcoin daemons (BCHN, BU, and Bitcoin Core are what I regularly test with).

Hmm.

What exact calls are you making to fulcrum?

@zquestz I just run one of the Electron Cash public servers using Fulcrum with bchd as backend. I'm not making any custom calls.

Here are the logs from bchd and Fulcrum below. Notice the 13 minute gap with no output from bchd which seems very suspicious to me. Restarting bchd fixes the problem (Fulcrum then reconnects automatically to bchd) but it bumps into the same error after a few days consistently.

bchd log:

Feb 04 14:38:34 bchd[579]: 2021-02-04 14:38:34.103 [WRN] SYNC: No sync peer candidates available                                                                      
Feb 04 14:38:39 bchd[579]: 2021-02-04 14:38:39.420 [INF] SYNC: New valid peer 54.153.49.188:35328 (inbound) (/bitcoin-seeder:0.01/)                                   
Feb 04 14:38:40 bchd[579]: 2021-02-04 14:38:40.564 [INF] SYNC: Lost peer 54.153.49.188:35328 (inbound)                                                                
Feb 04 14:51:56 bchd[579]: 2021-02-04 14:51:56.432 [INF] RPCS: Max RPC clients exceeded [50] - disconnecting client 127.0.0.1:51942                                   
Feb 04 14:52:00 bchd[579]: 2021-02-04 14:52:00.431 [INF] RPCS: Max RPC clients exceeded [50] - disconnecting client 127.0.0.1:51948                                   
Feb 04 14:52:01 bchd[579]: 2021-02-04 14:52:01.431 [INF] RPCS: Max RPC clients exceeded [50] - disconnecting client 127.0.0.1:51952

fulcrum log:

Feb 04 14:38:57 Fulcrum[552]: [2021-02-04 14:38:57.427] <BitcoinD.1> Lost connection to bitcoind, will retry every 5 seconds ...                                      
Feb 04 14:38:57 Fulcrum[552]: [2021-02-04 14:38:57.427] <Task.GetChainInfo> 1017323: FAIL: bitcoind connection lost                                                   
Feb 04 14:38:57 Fulcrum[552]: [2021-02-04 14:38:57.428] <Controller> Failed to synch blocks and/or mempool

/cc @cculianu

Hmm yeah -- probably a leak on the bchd side. Maybe some sockets/connection objects/whatever are kept around and never gc'd?

It got stuck again and strace -p $(pidof bchd) is full of futex syscalls so we're stuck in some infinite loop indeed. Now how can I find out where that is in the code though?

futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00ec7c948, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00ec7c948, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc0003c0148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc0003c0148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc0003c0148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc0003c0148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc0003c0148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc0003c0148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc0003c0148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc000051148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x14c88c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc000051148, FUTEX_WAKE_PRIVATE, 1) = 1

@cculianu I thought we tested a full sync with fulcrum and it worked fine for you? I would like to get to the bottom of this. I remember we added the ping, and that was allowing syncing to work smoothly.

@infertux so it works fine for days? Then all of a sudden has an issue? We are actually looking into a syncing issue we have seen that is similar to this. Hopefully when we address this, the locking issue will get resolved.

Also note, it doesn't limit the number of connections, but the number of in flight rpc calls. Before BCHD didn't support keepalives. When a request is sent, it increments the counter, then when complete it decrements it. The logic is super simple. However with keepalives if you send 50 requests and they are all pending, then it will report the error you are seeing.

I suspect there is a locking issue, and when that happens the rpc stops responding but fulcrum keeps sending requests. Then the server gets overloaded due to the backlog.

I would be curious if getblock calls to the rpc work from another host, or if they hang! If it is a locking issue then another client should also get blocked. Any chance you can test @infertux ?

After further investigation, I realised it happens whenever the HDD's I/O gets saturated by other programs running on this machine. So it's difficult to reproduce consistently and I guess it won't happen with fast SSDs.

if getblock calls to the rpc work from another host, or if they hang! If it is a locking issue then another client should also get blocked

Thanks! I'll try that next time it gets stuck...

Hmm there are some throttling options in Fulcrum to perhaps help limit the number of RPC threads as well the number of concurrent rpc requests from clients. The two options are: bitcoind_clients and bitcoind_throttle. Refer to this example config for documentation on their meaning and syntax: https://github.com/cculianu/Fulcrum/blob/d7c194231e44fd1c1132ee06599e70ba2396663d/doc/fulcrum-example-config.conf#L526

Not sure if this can help you but maybe setting these options to something extremely conservative, say: bitcoind_clients = 1 and bitcoind_throttle = 5 3 1 or something like that would help make it extremely improbable that the I/O falls behind?

Just a thought...