CLUSTER: Request non-existent HTTP streams, log flooding, thread does not exit, FD leakage.
MaxLoveThree opened this issue · 13 comments
SRS version:
[root@localhost trunk]# ./objs/srs -v
2.0.214
Configuration:
listen 1936;
max_connections 1000;
pid ./objs/play_edge.pid;
srs_log_level info;
srs_log_file ./objs/play_edge.log;
chunk_size 4096;
http_server {
enabled on;
listen 8090;
dir ./objs/nginx/html;
}
vhost **defaultVhost** {
mode remote;
origin 127.0.0.1:19350;
http_remux {
enabled on;
mount [vhost]/[app]/[stream].aac;
hstrs on;
}
}
Operation:
Without any streaming, initiate a pull request to the server through the browser.
curl http://172.16.198.129:8090/my_test/test.aac -o /dev/null
Phenomenon:
The following logs are repeatedly displayed, even if the HTTP connection of the browser is disconnected, the following prints continue to be displayed.
[2016-09-06 04:24:24.713][warn][56189][113][62] origin disconnected, retry. ret=1011
[2016-09-06 04:24:25.715][trace][56189][113] edge pull connected, url=rtmp://127.0.0.1:19350/my_test/test, server=127.0.0.1:19350
[2016-09-06 04:24:25.728][trace][56189][113] complex handshake success.
[2016-09-06 04:24:25.728][trace][56189][113] edge ingest from 127.0.0.1:19350 at rtmp://127.0.0.1:19350/my_test
[2016-09-06 04:24:25.808][trace][56189][113] input chunk size to 60000
[2016-09-06 04:24:25.808][trace][56189][113] connected, version=2.0.214, ip=127.0.0.1, pid=56192, id=117, dsu=1
[2016-09-06 04:24:25.809][trace][56189][113] out chunk size to 60000
[2016-09-06 04:24:28.856][warn][56189][113][62] origin disconnected, retry. ret=1011
[2016-09-06 04:24:29.857][trace][56189][113] edge pull connected, url=rtmp://127.0.0.1:19350/my_test/test, server=127.0.0.1:19350
[2016-09-06 04:24:29.873][trace][56189][113] complex handshake success.
[2016-09-06 04:24:29.873][trace][56189][113] edge ingest from 127.0.0.1:19350 at rtmp://127.0.0.1:19350/my_test
[2016-09-06 04:24:29.953][trace][56189][113] input chunk size to 60000
[2016-09-06 04:24:29.953][trace][56189][113] connected, version=2.0.214, ip=127.0.0.1, pid=56192, id=120, dsu=1
[2016-09-06 04:24:29.953][trace][56189][113] out chunk size to 60000
TRANS_BY_GPT3
According to code tracing, the HTTP client thread has been continuously inside the SrsLiveStream::serve_http interface without coming out. This causes the client thread to not reach the logic code that checks whether the HTTP connection still exists, even if the HTTP connection is disconnected. By reading the implementation of other business functions, it seems that when the edge server pulls the stream from the source station, it should start another thread instead of stacking it on the HTTP client thread.
TRANS_BY_GPT3
The while loop in SrsLiveStream::serve_http is missing a logic to check if the HTTP client connection still exists.
TRANS_BY_GPT3
HTTP long connection, unless data is written, there is no way to know if the client has disconnected, because no data is read from the client after entering FLV.
This can only be solved like this: if the stream does not exist when sourcing, return 404 to the client.
TRANS_BY_GPT3
This is because of the issue with the origin retrieval strategy. If there is no stream, the origin server should return a 404 error, and then the 404 error should be passed from the origin server to the edge server, and the edge server should give a 404 error to the player. However, this issue requires significant changes and cannot be addressed in SRS2 in time.
Postpone to SRS3+
TRANS_BY_GPT3
Take another look, it will cause the fd not to close.
TRANS_BY_GPT3
You should open a coroutine to receive data from the fd. If the client closes the fd, the reading coroutine will return an error.
In each iteration, check if the coroutine has encountered an error. If there is an error, stop the loop.
TRANS_BY_GPT3
Similar to reading RTMP CONNECTION:
srs/trunk/src/app/srs_app_rtmp_conn.cpp
Line 717 in ff87318
For RTMP playback clients, only a very small number of requests are sent from the client to the server, while the majority are sent from the server to the client. Therefore, a new coroutine is created for reading, while the main coroutine is mainly responsible for sending.
For FLV players, there are no read requests, only write requests. Therefore, a new coroutine can be created to block at the read location, and if the client closes the file descriptor, the read coroutine will return an error. The main coroutine's loop is mainly for sending data, and it checks the read coroutine at each iteration.
TRANS_BY_GPT3
For some players, after a successful connection, if they request a non-existent HTTP-FLV, the connection will not be actively disconnected and will remain open. In this situation, although there won't be a "close_wait" state, both sides will be stuck here, so it is better for SRS to actively respond with a 404.
TRANS_BY_GPT3
The SRS configuration is as follows:
listen 1935;
max_connections 1000;
daemon off;
srs_log_tank console;
http_server {
enabled on;
listen 8080;
}
vhost __defaultVhost__ {
http_remux {
enabled on;
mount [vhost]/[app]/[stream].flv;
hstrs on;
}
}
Directly access the player multiple times: http://www.ossrs.net/players/srs_player.html?app=live&stream=livestream.flv&server=localhost&port=8080&autostart=true&vhost=localhost&schema=http
You can see many CLOSE_WAIT
:
winlin:srs winlin$ netstat -an|grep 8080|grep CLOSE_WAIT
tcp4 0 0 127.0.0.1.8080 127.0.0.1.52870 CLOSE_WAIT
tcp4 0 0 127.0.0.1.8080 127.0.0.1.52866 CLOSE_WAIT
tcp4 0 0 127.0.0.1.8080 127.0.0.1.52864 CLOSE_WAIT
tcp4 0 0 127.0.0.1.8080 127.0.0.1.52862 CLOSE_WAIT
tcp4 0 0 127.0.0.1.8080 127.0.0.1.52855 CLOSE_WAIT
tcp4 0 0 127.0.0.1.8080 127.0.0.1.52852 CLOSE_WAIT
You can see many FDs (10-15) not closed:
winlin:srs winlin$ lsof |grep 10671|grep CLOSE_WAIT
srs 10671 winlin 10u IPv4 0x17dcbc0eb6e11347 0t0 TCP localhost:http-alt->localhost:52852 (CLOSE_WAIT)
srs 10671 winlin 11u IPv4 0x17dcbc0eb6ab6a4f 0t0 TCP localhost:http-alt->localhost:52855 (CLOSE_WAIT)
srs 10671 winlin 12u IPv4 0x17dcbc0eb6f32347 0t0 TCP localhost:http-alt->localhost:52862 (CLOSE_WAIT)
srs 10671 winlin 13u IPv4 0x17dcbc0eb6a98f67 0t0 TCP localhost:http-alt->localhost:52864 (CLOSE_WAIT)
srs 10671 winlin 14u IPv4 0x17dcbc0ea4a5485f 0t0 TCP localhost:http-alt->localhost:52866 (CLOSE_WAIT)
srs 10671 winlin 15u IPv4 0x17dcbc0eb6aca157 0t0 TCP localhost:http-alt->localhost:52870 (CLOSE_WAIT)
TRANS_BY_GPT3
Open a new ST receiving thread to read the HTTP request. Since HTTP-FLV does not have any subsequent requests, the receiving thread will encounter an error and exit when the client closes the connection. Generally, HTTP requests are handled in this way:
SrsHttpConn::do_cycle
parser->parse_message(&req)
process_request(writer, req)
However, in process_request
, we need to open another thread to detect if the FD is closed:
process_request(writer, req)
trd->start()
while trd->error_code() == ERROR_SUCCESS
write FLV data.
In the thread trd
, we need to call the function that directly reads the HTTP message:
SrsHttpConn::pop_message(&req)
Note: This API can only be used in connections without requests, that is, for FD closure detection. It would be a disaster if two FDs read the same FD.
There is a change, in fact, HTTP Streaming (FLV/TS) is replaced by SrsResponseOnlyHttpConn
instead of SrsHttpConn
, when reading the request, all the body is discarded.
When the player is closed, the receiving thread detects that the SOCKET has been RESET, which means it has been closed by the client, and the thread interrupts the loop.
[2017-04-30 11:58:43.928][trace][16163][107] HTTP client ip=127.0.0.1
[2017-04-30 11:58:43.928][trace][16163][107] HTTP GET http://localhost:8080/live/livestream.flv, content-length=-1
[2017-04-30 11:58:43.963][trace][16163][107] http: mount flv stream for vhost=/live/livestream, mount=/live/livestream.flv
[2017-04-30 11:58:43.964][trace][16163][107] hstrs: source url=/live/livestream, is_edge=0, source_id=-1[-1]
[2017-04-30 11:58:43.964][trace][16163][107] dispatch cached gop success. count=0, duration=-1
[2017-04-30 11:58:43.964][trace][16163][107] create consumer, queue_size=30.00, jitter=1
[2017-04-30 11:58:46.988][warn][16163][107][54] client disconnect peer. ret=1004
No FD leakage displayed.
winlin:srs winlin$ lsof |grep 16163|grep CLOSE_WAIT
winlin:srs winlin$ netstat -an|grep 8080|grep CLOSE_WAIT
It actually took 52 minutes to solve this problem. In the case of such a simple ST architecture, it can be considered a relatively troublesome problem...
TRANS_BY_GPT3
Compilation failed under Ubuntu.
src/app/srs_app_recv_thread.cpp:557:9: error: ‘ISrsHttpMessage’ was not declared in this scope.
TRANS_BY_GPT3
@chenliang2017 Please file another bug.