ossrs/srs

CLUSTER: Request non-existent HTTP streams, log flooding, thread does not exit, FD leakage.

MaxLoveThree opened this issue · 13 comments

SRS version:
[root@localhost trunk]# ./objs/srs -v
2.0.214
Configuration:

listen              1936;
max_connections     1000;
pid                 ./objs/play_edge.pid;
srs_log_level       info;
srs_log_file        ./objs/play_edge.log;
chunk_size          4096;

http_server {
    enabled         on;
    listen          8090;
    dir             ./objs/nginx/html;
}

vhost **defaultVhost** {
        mode            remote;
        origin          127.0.0.1:19350;
    http_remux {
            enabled     on;
            mount       [vhost]/[app]/[stream].aac;
            hstrs       on;
    }
}

Operation:
Without any streaming, initiate a pull request to the server through the browser.

curl http://172.16.198.129:8090/my_test/test.aac -o /dev/null

Phenomenon:
The following logs are repeatedly displayed, even if the HTTP connection of the browser is disconnected, the following prints continue to be displayed.

[2016-09-06 04:24:24.713][warn][56189][113][62] origin disconnected, retry. ret=1011
[2016-09-06 04:24:25.715][trace][56189][113] edge pull connected, url=rtmp://127.0.0.1:19350/my_test/test, server=127.0.0.1:19350
[2016-09-06 04:24:25.728][trace][56189][113] complex handshake success.
[2016-09-06 04:24:25.728][trace][56189][113] edge ingest from 127.0.0.1:19350 at rtmp://127.0.0.1:19350/my_test
[2016-09-06 04:24:25.808][trace][56189][113] input chunk size to 60000
[2016-09-06 04:24:25.808][trace][56189][113] connected, version=2.0.214, ip=127.0.0.1, pid=56192, id=117, dsu=1
[2016-09-06 04:24:25.809][trace][56189][113] out chunk size to 60000
[2016-09-06 04:24:28.856][warn][56189][113][62] origin disconnected, retry. ret=1011
[2016-09-06 04:24:29.857][trace][56189][113] edge pull connected, url=rtmp://127.0.0.1:19350/my_test/test, server=127.0.0.1:19350
[2016-09-06 04:24:29.873][trace][56189][113] complex handshake success.
[2016-09-06 04:24:29.873][trace][56189][113] edge ingest from 127.0.0.1:19350 at rtmp://127.0.0.1:19350/my_test
[2016-09-06 04:24:29.953][trace][56189][113] input chunk size to 60000
[2016-09-06 04:24:29.953][trace][56189][113] connected, version=2.0.214, ip=127.0.0.1, pid=56192, id=120, dsu=1
[2016-09-06 04:24:29.953][trace][56189][113] out chunk size to 60000

TRANS_BY_GPT3

According to code tracing, the HTTP client thread has been continuously inside the SrsLiveStream::serve_http interface without coming out. This causes the client thread to not reach the logic code that checks whether the HTTP connection still exists, even if the HTTP connection is disconnected. By reading the implementation of other business functions, it seems that when the edge server pulls the stream from the source station, it should start another thread instead of stacking it on the HTTP client thread.

TRANS_BY_GPT3

The while loop in SrsLiveStream::serve_http is missing a logic to check if the HTTP client connection still exists.

TRANS_BY_GPT3

HTTP long connection, unless data is written, there is no way to know if the client has disconnected, because no data is read from the client after entering FLV.
This can only be solved like this: if the stream does not exist when sourcing, return 404 to the client.

TRANS_BY_GPT3

This is because of the issue with the origin retrieval strategy. If there is no stream, the origin server should return a 404 error, and then the 404 error should be passed from the origin server to the edge server, and the edge server should give a 404 error to the player. However, this issue requires significant changes and cannot be addressed in SRS2 in time.
Postpone to SRS3+

TRANS_BY_GPT3

Take another look, it will cause the fd not to close.

TRANS_BY_GPT3

You should open a coroutine to receive data from the fd. If the client closes the fd, the reading coroutine will return an error.
In each iteration, check if the coroutine has encountered an error. If there is an error, stop the loop.

TRANS_BY_GPT3

Similar to reading RTMP CONNECTION:

if ((ret = trd->error_code()) != ERROR_SUCCESS) {

For RTMP playback clients, only a very small number of requests are sent from the client to the server, while the majority are sent from the server to the client. Therefore, a new coroutine is created for reading, while the main coroutine is mainly responsible for sending.

For FLV players, there are no read requests, only write requests. Therefore, a new coroutine can be created to block at the read location, and if the client closes the file descriptor, the read coroutine will return an error. The main coroutine's loop is mainly for sending data, and it checks the read coroutine at each iteration.

TRANS_BY_GPT3

For some players, after a successful connection, if they request a non-existent HTTP-FLV, the connection will not be actively disconnected and will remain open. In this situation, although there won't be a "close_wait" state, both sides will be stuck here, so it is better for SRS to actively respond with a 404.

TRANS_BY_GPT3

The SRS configuration is as follows:

listen              1935;
max_connections     1000;
daemon              off;
srs_log_tank        console;
http_server {
    enabled         on;
    listen          8080;
}
vhost __defaultVhost__ {
    http_remux {
        enabled     on;
        mount       [vhost]/[app]/[stream].flv;
        hstrs       on;
    }
}

Directly access the player multiple times: http://www.ossrs.net/players/srs_player.html?app=live&stream=livestream.flv&server=localhost&port=8080&autostart=true&vhost=localhost&schema=http

You can see many CLOSE_WAIT:

winlin:srs winlin$ netstat -an|grep 8080|grep CLOSE_WAIT
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52870        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52866        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52864        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52862        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52855        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52852        CLOSE_WAIT 

You can see many FDs (10-15) not closed:

winlin:srs winlin$ lsof |grep 10671|grep CLOSE_WAIT
srs       10671 winlin   10u     IPv4 0x17dcbc0eb6e11347       0t0      TCP localhost:http-alt->localhost:52852 (CLOSE_WAIT)
srs       10671 winlin   11u     IPv4 0x17dcbc0eb6ab6a4f       0t0      TCP localhost:http-alt->localhost:52855 (CLOSE_WAIT)
srs       10671 winlin   12u     IPv4 0x17dcbc0eb6f32347       0t0      TCP localhost:http-alt->localhost:52862 (CLOSE_WAIT)
srs       10671 winlin   13u     IPv4 0x17dcbc0eb6a98f67       0t0      TCP localhost:http-alt->localhost:52864 (CLOSE_WAIT)
srs       10671 winlin   14u     IPv4 0x17dcbc0ea4a5485f       0t0      TCP localhost:http-alt->localhost:52866 (CLOSE_WAIT)
srs       10671 winlin   15u     IPv4 0x17dcbc0eb6aca157       0t0      TCP localhost:http-alt->localhost:52870 (CLOSE_WAIT)

TRANS_BY_GPT3

Open a new ST receiving thread to read the HTTP request. Since HTTP-FLV does not have any subsequent requests, the receiving thread will encounter an error and exit when the client closes the connection. Generally, HTTP requests are handled in this way:

SrsHttpConn::do_cycle
    parser->parse_message(&req)
    process_request(writer, req)

However, in process_request, we need to open another thread to detect if the FD is closed:

process_request(writer, req)
    trd->start()
    while trd->error_code() == ERROR_SUCCESS 
        write FLV data.

In the thread trd, we need to call the function that directly reads the HTTP message:

SrsHttpConn::pop_message(&req)

Note: This API can only be used in connections without requests, that is, for FD closure detection. It would be a disaster if two FDs read the same FD.

There is a change, in fact, HTTP Streaming (FLV/TS) is replaced by SrsResponseOnlyHttpConn instead of SrsHttpConn, when reading the request, all the body is discarded.

When the player is closed, the receiving thread detects that the SOCKET has been RESET, which means it has been closed by the client, and the thread interrupts the loop.

[2017-04-30 11:58:43.928][trace][16163][107] HTTP client ip=127.0.0.1
[2017-04-30 11:58:43.928][trace][16163][107] HTTP GET http://localhost:8080/live/livestream.flv, content-length=-1
[2017-04-30 11:58:43.963][trace][16163][107] http: mount flv stream for vhost=/live/livestream, mount=/live/livestream.flv
[2017-04-30 11:58:43.964][trace][16163][107] hstrs: source url=/live/livestream, is_edge=0, source_id=-1[-1]
[2017-04-30 11:58:43.964][trace][16163][107] dispatch cached gop success. count=0, duration=-1
[2017-04-30 11:58:43.964][trace][16163][107] create consumer, queue_size=30.00, jitter=1
[2017-04-30 11:58:46.988][warn][16163][107][54] client disconnect peer. ret=1004

No FD leakage displayed.

winlin:srs winlin$ lsof |grep 16163|grep CLOSE_WAIT
winlin:srs winlin$ netstat -an|grep 8080|grep CLOSE_WAIT

It actually took 52 minutes to solve this problem. In the case of such a simple ST architecture, it can be considered a relatively troublesome problem...

TRANS_BY_GPT3

Compilation failed under Ubuntu.
src/app/srs_app_recv_thread.cpp:557:9: error: ‘ISrsHttpMessage’ was not declared in this scope.

TRANS_BY_GPT3

@chenliang2017 Please file another bug.

Fixed by f2b4bc7