Too many dangling CLOSE_WAIT connections
edsburke opened this issue · 5 comments
Hey Wangle team,
thank you guys for the wonderful work. We have been using wangle to build a RPC layer pretty well in our projects except too many dangling CLOSE_WAIT connections on Ubuntu 16, eventually halting the server for any responses. With much efforts of debugging, we resort to get hints from here. Note that all destructors and close methods are called properly. I'd highlight the code structure here. Your help on how to debug this are greatly appreciated!!!!
Btw, wangle v2018.10.22.00 is used in this case.
Specifically, client sock is in FIN_WAIT2 status, and server sock is in CLOSE_WAIT status. Client socks soon disappear (prob. being forced to close by OS), however, server socks in CLOSE_WAIT accumulate, turning into dangling socks leaving CLOSE_WAIT status.
netstat output of client socks,
tcp 0 0 172.31.38.97:60445 54.118.66.170:15515 FIN_WAIT2
tcp 0 0 172.31.38.97:60447 54.118.66.170:15515 FIN_WAIT2
tcp 0 0 172.31.38.97:60449 54.118.66.170:15515 FIN_WAIT2
netstat output of server socks
tcp6 1 0 54.118.66.170:15515 172.31.38.97:60445 CLOSE_WAIT
tcp6 1 0 54.118.66.170:15515 172.31.38.97:60447 CLOSE_WAIT
tcp6 1 0 54.118.66.170:15515 172.31.38.97:60449 CLOSE_WAIT
Waiting for a while, CLOSE_WAIT socks transition to dangling status,
lsof -p 5489 | grep TCPv6
rpc-server 5489 ubuntu 20u sock 0,8 0t0 2007789 protocol: TCPv6
rpc-server 5489 ubuntu 25u sock 0,8 0t0 2006998 protocol: TCPv6
rpc-server 5489 ubuntu 26u sock 0,8 0t0 2009216 protocol: TCPv6
rpc-server 5489 ubuntu 28u sock 0,8 0t0 2009218 protocol: TCPv6
Code wise, for client side, RpcClient forwards request to RpcConnection maintained by ConnectionPool based on connection id (e.g. host and port). RpcConnection internally has RpcService which is ClientDispatcher created by ConnectionFactory.
class RpcClient {
public:
// internally calls RpcConnection::SendRequest
virtual ... AsyncCall() {
cp_->GetConnection(remote_id)->SendRequest(...)
}
private:
std::shared_ptr<ConnectionPool> cp_;
std::shared_ptr<folly::IOThreadPoolExecutor> io_executor_;
std::shared_ptr<folly::CPUThreadPoolExecutor> cpu_executor_;
};
class RpcConnection {
public:
virtual folly::Future<std::unique_ptr<Response>> SendRequest(std::unique_ptr<Request> req);
private:
std::recursive_mutex mutex_;
std::shared_ptr<folly::IOThreadPoolExecutor> io_executor_;
std::shared_ptr<folly::CPUThreadPoolExecutor> cpu_executor_;
// ConnectionId used by ConnectionPool
std::shared_ptr<ConnectionId> connection_id_;
// initialized by ConnectionFactory::connect, it's ClientDispatcher indeed where
// Promise<Response> and Future are handled
std::shared_ptr<RpcService> rpc_service_;
std::shared_ptr<ConnectionFactory> cf_;
std::shared_ptr<wangle::ClientBootstrap<RpcClientSerializePipeline>> client_bootstrap_;
};
class ConnectionPool {
...
private:
std::shared_ptr<ConnectionFactory> cf_;
std::shared_ptr<Configuration> conf_;
std::unordered_map<std::shared_ptr<ConnectionId>, std::shared_ptr<RpcConnection>> connections_;
}
class ConnectionFactory {
virtual std::shared_ptr<RpcService> Connect(
std::shared_ptr<wangle::ClientBootstrap<RpcClientSerializePipeline>> client_bootstrap,
const std::string &hostname, uint16_t port);
};
class RpcClientPipelineFactory {
public:
RpcClientSerializePipeline::Ptr RpcClientPipelineFactory::newPipeline() {
auto pipeline = RpcClientSerializePipeline::create();
pipeline->setTransport(sock);
pipeline->addBack(wangle::AsyncSocketHandler{sock});
pipeline->addBack(wangle::EventBaseHandler{});
pipeline->addBack(wangle::LengthFieldBasedFrameDecoder{});
pipeline->addBack(RpcClientSeralizeHandler....)
}
};
For server side, it's very straightforward,
class RpcServer {
public:
void StartListening(int port) {
auto factory = std::make_shared<RpcServerPipelineFactory>();
auto server = std::make_shared<ServerBootstrap>();
server->childPipeline(factory);
server->bind(port);
}
};
class RpcServerPipelineFactory {
public:
RpcServerSerializePipeline::Ptr newPipeline(
std::shared_ptr<folly::AsyncTransportWrapper> sock) {
auto pipeline = RpcServerSerializePipeline::create();
pipeline->addBack(wangle::AsyncSocketHandler(sock));
pipeline->addBack(wangle::EventBaseHandler());
pipeline->addBack(wangle::LengthFieldBasedFrameDecoder());
pipeline->addBack(RpcServerSerializeHandler());
pipeline->addBack(wangle::MultiplexServerDispatcher<
std::unique_ptr<Request>, std::unique_ptr<Response>>(
service_.get()));
pipeline->finalize();
return pipeline;
}
};
I happen to run EchoServer and EchoClient in the wangle examples. It confirms CLOSE_WAIT issue also exists. CLOSE_WAIT socks just accumulate and transition to dangling status forever, and are never released unless the process is killed.
Could anyone suggest what should be done to debug/fix this issue? Thanks.
netstat output of EchoClient
tcp 0 0 172.31.38.97:64115 54.118.66.170:8080 FIN_WAIT2
tcp 0 0 172.31.38.97:64117 54.118.66.170:8080 FIN_WAIT2
netstat output of EchoServer
tcp6 0 0 54.118.66.170:8080 172.31.38.97:64115 CLOSE_WAIT
tcp6 0 0 54.118.66.170:8080 172.31.38.97:64117 CLOSE_WAIT
lsof -p 2825 | grep TCPv6
EchoServe 2825 root 20u sock 0,8 0t0 2044556 protocol: TCPv6
EchoServe 2825 root 23u sock 0,8 0t0 2044795 protocol: TCPv6
Just verified the latest code v2020.04.06.00 has the same issue on Ubuntu 16.04.
Running multiple times ./EchoClient from one machine(i.e. 172.31.38.97), ./EchoServer is running from another machine (i.e. 172.26.1.197). When EchoClient is done, many CLOSE_WAIT server socks are lingering there.
Note that tcp_fin_timeout has been tuned on client machine to be long enough (120 seconds) so that EchoServer has the chance to send LAST_ACK.
TCP: About FIN_WAIT_2, TIME_WAIT and CLOSE_WAIT
netstat -anp | grep EchoServer
tcp6 0 0 :::8080 :::* LISTEN 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48758 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48756 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48762 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48754 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48770 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48764 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48760 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48766 CLOSE_WAIT 26314/EchoServer
tcp6 0 0 172.26.1.197:8080 172.31.38.97:48768 CLOSE_WAIT 26314/EchoServer
Try to bump this thread. Hey Wangle team, could you please suggest how should it be debugged or fixed? Thanks a lot.
Does Wangle team notice this big issue from user community? Please kindly advice how to fix it or work it around? Appreciated!
You should close the socket after client leaved,use EchoServer as a example
class EchoHandler :public wangle::HandlerAdapter<std::string> {
public:
void read(Context* ctx, std::string msg)override {
std::cout << "handling " << msg << std::endl;
write(ctx, msg + "\r\n");
}
// close the socket
void readEOF(Context* ctx) {
ctx->fireClose();
}
};