Tencent/TSeer

TSeer Agent Api works in a synchronous way which is slow while TSeerServer is down

Closed this issue · 1 comments

The Tseer Api becomes very slow when TSeerServer is down (occasionally core dump as mentioned in issue #12 ), because it works in a synchronous way and will wait for response from server until timeout (500ms). If it happens in a high efficiency online system it will cause performance problem, every request using agent api to get backend service host will be slow down by this issue.

The asynchronous way should be better, the agent is responsible for fetching host list and the api just get the result that already cached.

当TseerServer挂掉(issue #12 中遇到的,写日志阶段偶发 core)的时候,Agent api 响应会变得非常的慢,因为目前是一种同步获取的方式,api 会先触发一个请求去 TseerServer 同步信息,然后返回结果,当 TseerServer 有问题,这个请求就会等到超时(500ms)才返回。在一些请求量大耗时敏感的系统里,这个问题会造成请求拥塞,系统吞吐急剧下降。

这里异步应该是更好的方法,Agent 只负责同步 host 信息并更新缓存,而接口只从缓存中获取数据保证响应速度。考虑 Tseer 作为内部系统应该早已经过千锤百炼,这种问题肯定早有方案, 还请同步或明示应该如何使用。

A simple way to reproduce this problem:
一个简单复现的办法:

CODE:

struct timeval tv1, tv2;
gettimeofday(&tv1, NULL);
iRet = ApiGetRoute(req, sErr);
gettimeofday(&tv2, NULL);
fprintf(stderr, "cost %f\n", (tv2.tv_sec - tv1.tv_sec)*1000.0 + (tv2.tv_usec - tv1.tv_usec)/1000.0);
cout << "[out]iRet: " << iRet << " sErr: " << sErr << endl;
cout << "[out]ip: " << req.ip << endl;
cout << "[out]port: " << req.port << endl;
cout << "[out]isTcp: " << req.isTcp << endl;

OUTPUT:

cost 500.177000
[out]iRet: 0 sErr: /home/tcheng/tools/TSeer/api/cplus/src/conn.cpp:QueryAndRecvRouterFromAgent:136|socket recvfrom error|ip:127.0.0.1|port:8865|ret:-1|timeOut:500|fd:4|errno:11|info:Resource temporarily unavailable
[out]ip: 10.181.32.11
[out]port: 8724
[out]isTcp: 1
[report]sErr: /home/tcheng/tools/TSeer/api/cplus/src/conn.cpp:QueryAndRecvRouterFromAgent:136|socket recvfrom error|ip:127.0.0.1|port:8865|ret:-1|timeOut:500|fd:4|errno:11|info:Resource temporarily unavailable

After tracing the code I found in RouterManager::getRouter* method there are _remoteProvider->isAvailable() and _remoteProvider->addFailedNumAndCheckAvailable() check , when _remoteProvider is not available, it will get result from local cache. And the failed threshold is 3 but in my test, the first 10 of 100 times loop became slow and the rest 90 were as fast as normal.

I will close this issue and find deeper for more details to tune it