facebook/wangle

Too many dangling CLOSE_WAIT connections

edsburke opened this issue · 5 comments

Hey Wangle team,

thank you guys for the wonderful work. We have been using wangle to build a RPC layer pretty well in our projects except too many dangling CLOSE_WAIT connections on Ubuntu 16, eventually halting the server for any responses. With much efforts of debugging, we resort to get hints from here. Note that all destructors and close methods are called properly. I'd highlight the code structure here. Your help on how to debug this are greatly appreciated!!!!

Btw, wangle v2018.10.22.00 is used in this case.

Specifically, client sock is in FIN_WAIT2 status, and server sock is in CLOSE_WAIT status. Client socks soon disappear (prob. being forced to close by OS), however, server socks in CLOSE_WAIT accumulate, turning into dangling socks leaving CLOSE_WAIT status.

netstat output of client socks,

tcp        0      0 172.31.38.97:60445      54.118.66.170:15515      FIN_WAIT2
tcp        0      0 172.31.38.97:60447      54.118.66.170:15515      FIN_WAIT2
tcp        0      0 172.31.38.97:60449      54.118.66.170:15515      FIN_WAIT2

netstat output of server socks

tcp6       1      0 54.118.66.170:15515     172.31.38.97:60445     CLOSE_WAIT
tcp6       1      0 54.118.66.170:15515     172.31.38.97:60447     CLOSE_WAIT
tcp6       1      0 54.118.66.170:15515     172.31.38.97:60449     CLOSE_WAIT

Waiting for a while, CLOSE_WAIT socks transition to dangling status,
lsof -p 5489 | grep TCPv6

rpc-server     5489 ubuntu   20u     sock                0,8      0t0 2007789 protocol: TCPv6
rpc-server     5489 ubuntu   25u     sock                0,8      0t0 2006998 protocol: TCPv6
rpc-server     5489 ubuntu   26u     sock                0,8      0t0 2009216 protocol: TCPv6
rpc-server     5489 ubuntu   28u     sock                0,8      0t0 2009218 protocol: TCPv6

Code wise, for client side, RpcClient forwards request to RpcConnection maintained by ConnectionPool based on connection id (e.g. host and port). RpcConnection internally has RpcService which is ClientDispatcher created by ConnectionFactory.

class RpcClient {
public:
// internally calls RpcConnection::SendRequest
virtual ... AsyncCall() {
  cp_->GetConnection(remote_id)->SendRequest(...)
}

private:
  std::shared_ptr<ConnectionPool> cp_;
  std::shared_ptr<folly::IOThreadPoolExecutor> io_executor_;
  std::shared_ptr<folly::CPUThreadPoolExecutor> cpu_executor_;
};
class RpcConnection {
public:
virtual folly::Future<std::unique_ptr<Response>> SendRequest(std::unique_ptr<Request> req);

private:
  std::recursive_mutex mutex_;
  std::shared_ptr<folly::IOThreadPoolExecutor> io_executor_;
  std::shared_ptr<folly::CPUThreadPoolExecutor> cpu_executor_;

  // ConnectionId used by ConnectionPool
  std::shared_ptr<ConnectionId> connection_id_; 

  // initialized by ConnectionFactory::connect, it's ClientDispatcher indeed where 
  // Promise<Response> and Future are handled
  std::shared_ptr<RpcService> rpc_service_; 

  std::shared_ptr<ConnectionFactory> cf_;
  std::shared_ptr<wangle::ClientBootstrap<RpcClientSerializePipeline>> client_bootstrap_;
};
class ConnectionPool {
...
private:
  std::shared_ptr<ConnectionFactory> cf_;
  std::shared_ptr<Configuration> conf_;
  std::unordered_map<std::shared_ptr<ConnectionId>, std::shared_ptr<RpcConnection>> connections_;
}
class ConnectionFactory {
  virtual std::shared_ptr<RpcService> Connect(
      std::shared_ptr<wangle::ClientBootstrap<RpcClientSerializePipeline>> client_bootstrap,
      const std::string &hostname, uint16_t port);
};
class RpcClientPipelineFactory {

public:
RpcClientSerializePipeline::Ptr RpcClientPipelineFactory::newPipeline() {
  auto pipeline = RpcClientSerializePipeline::create();
  pipeline->setTransport(sock);
  pipeline->addBack(wangle::AsyncSocketHandler{sock});
  pipeline->addBack(wangle::EventBaseHandler{});
  pipeline->addBack(wangle::LengthFieldBasedFrameDecoder{});
  pipeline->addBack(RpcClientSeralizeHandler....)
}

};

For server side, it's very straightforward,

class RpcServer {
public:
void StartListening(int port) {
  auto factory = std::make_shared<RpcServerPipelineFactory>();
  auto server = std::make_shared<ServerBootstrap>();
  server->childPipeline(factory);
  server->bind(port);
}
};
class RpcServerPipelineFactory {
public:
RpcServerSerializePipeline::Ptr newPipeline(
      std::shared_ptr<folly::AsyncTransportWrapper> sock) {

  auto pipeline = RpcServerSerializePipeline::create();
  pipeline->addBack(wangle::AsyncSocketHandler(sock));
  pipeline->addBack(wangle::EventBaseHandler());
  pipeline->addBack(wangle::LengthFieldBasedFrameDecoder());
  pipeline->addBack(RpcServerSerializeHandler());
  pipeline->addBack(wangle::MultiplexServerDispatcher<
      std::unique_ptr<Request>, std::unique_ptr<Response>>(
    service_.get()));
  pipeline->finalize();
  return pipeline;
}
};

I happen to run EchoServer and EchoClient in the wangle examples. It confirms CLOSE_WAIT issue also exists. CLOSE_WAIT socks just accumulate and transition to dangling status forever, and are never released unless the process is killed.

Could anyone suggest what should be done to debug/fix this issue? Thanks.

netstat output of EchoClient

tcp        0      0 172.31.38.97:64115      54.118.66.170:8080       FIN_WAIT2
tcp        0      0 172.31.38.97:64117      54.118.66.170:8080       FIN_WAIT2

netstat output of EchoServer

tcp6       0      0 54.118.66.170:8080      172.31.38.97:64115     CLOSE_WAIT
tcp6       0      0 54.118.66.170:8080      172.31.38.97:64117     CLOSE_WAIT

lsof -p 2825 | grep TCPv6

EchoServe 2825 root   20u     sock                0,8      0t0 2044556 protocol: TCPv6
EchoServe 2825 root   23u     sock                0,8      0t0 2044795 protocol: TCPv6

Just verified the latest code v2020.04.06.00 has the same issue on Ubuntu 16.04.
Running multiple times ./EchoClient from one machine(i.e. 172.31.38.97), ./EchoServer is running from another machine (i.e. 172.26.1.197). When EchoClient is done, many CLOSE_WAIT server socks are lingering there.

Note that tcp_fin_timeout has been tuned on client machine to be long enough (120 seconds) so that EchoServer has the chance to send LAST_ACK.

TCP: About FIN_WAIT_2, TIME_WAIT and CLOSE_WAIT

netstat -anp | grep EchoServer

tcp6       0      0 :::8080                 :::*                    LISTEN      26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48758     CLOSE_WAIT  26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48756     CLOSE_WAIT  26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48762     CLOSE_WAIT  26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48754     CLOSE_WAIT  26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48770     CLOSE_WAIT  26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48764     CLOSE_WAIT  26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48760     CLOSE_WAIT  26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48766     CLOSE_WAIT  26314/EchoServer
tcp6       0      0 172.26.1.197:8080       172.31.38.97:48768     CLOSE_WAIT  26314/EchoServer

Try to bump this thread. Hey Wangle team, could you please suggest how should it be debugged or fixed? Thanks a lot.

Does Wangle team notice this big issue from user community? Please kindly advice how to fix it or work it around? Appreciated!

You should close the socket after client leaved,use EchoServer as a example

class EchoHandler :public wangle::HandlerAdapter<std::string> {
public:
    void read(Context* ctx, std::string msg)override {
        std::cout << "handling " << msg << std::endl;
        write(ctx, msg + "\r\n");
    }
   // close the socket
    void readEOF(Context* ctx) {
        ctx->fireClose();
    }
};