ossrs/srs

Incidental: Retrieving client through the http interface/api/v1/clients will cause the SRS service to restart.

yangxiangzhi opened this issue · 6 comments

Accessing the client through the HTTP interface /api/v1/clients/?start=0&count=1000 may cause the SRS service to restart intermittently.

Please describe the issue you encountered.

1. SRS version: SRS/3.0.137(OuXuli)
1. SRS log:

[2021-04-25 06:17:00.391][Trace][1][19550] HTTP API GET http://docker-srs:1985/api/v1/clients?start=0&count=1000, content-length=-1, chunked=0/0
233621 [2021-04-25 06:17:00.391][Trace][1][19550] client finished.
233622 [2021-04-25 06:17:00.393][Trace][1][19551] API server client, ip=172.21.0.3
233623 [2021-04-25 06:17:00.393][Trace][1][19551] HTTP API GET http://docker-srs:1985/api/v1/clients/?start=0&count=1000, content-length=-1, chunked=0/0
233624 [2021-04-25 06:17:02.941][Trace][1][0] SRS/3.0.137(OuXuli), The MIT License (MIT)
233625 [2021-04-25 06:17:02.941][Trace][1][0] contributors: winlin<winlin@vip.126.com> wenjie.zhao<740936897@qq.com> xiangcheng.liu<liuxc0116@foxmail.com> naijia.liu<youngco       w@youngcow.net> alcoholyi<alcoholyi@qq.com> byteman<wangchen2011@gmail.com> chad.wang<chad.wang.cn@gmail.com> suhetao<suhetao@gmail.com> Johnny<fengjihu@163.com> kart       hikeyan<keyanmca@gmail.com> StevenLiu<lq@chinaffmpeg.org> zhengfl<zhengfl_1989@126.com> tufang14<breadbean1449@gmail.com> allspace<allspace@gmail.com> niesongsong<nie       950@gmail.com> rudeb0t<nimrod@themanxgroup.tw> CallMeNP<np.liamg@gmail.com> synote<synote@qq.com> lovecat<littlefawn@163.com> panda1986<542638787@qq.com> YueHonghui<h       ongf.yue@hotmail.com> ThomasDreibholz<dreibh@simula.no> JuntaoLiu<juntliu@gmail.com> RocFang<fangpeng1986@gmail.com> MakarovYaroslav<yaroslav.makarov.97@mail.ru> Mirk       oVelic<mvelic@inoxx.net> HuiZhang(huzhang2)<huzhang2@cisco.com> OtterWa<simpleotter23@gmail.com> walkermi<172192667@qq.com> haofz<fuzhuang.hao@vhall.com> ME_Kun_Han<h       anvskun@hotmail.com> ljx0305<ljx0305@gmail.com> cenxinwei<censhanhe@163.com> StarBrilliant<m13253@hotmail.com> xubin<xubin@chnvideo.com> intliang<yintiliang@gmail.com       > flowerwrong<sysuyangkang@gmail.com> YLX<568414379@qq.com> J<guotaojiang@qq.com> Harlan<hailiang@gvrcraft.com> hankun<hankun@bravovcloud.com> JonathanBarratt<jonatha       n.barratt@gmail.com> KeeganH<keeganwharris@gmail.com> StevenLiu<lingjiujianke@gmail.com> liuxc0116<liuxc0116@gmail.com> ChengdongZhang<lmajzcd@sina.com> lovacat<lovec       at@china.sina.com> qiang.li<qiang.li@verycdn.com.cn> HungMingWu<u9089000@gmail.com> Himer<xishizhaohua@qq.com> xialixin<xlx0625@163.com> alphonsetai<tyh_123@163.com>        Michael.Ma<wnpllr@gmail.com> lam2003<linmin3@yy.com>
233626 [2021-04-25 06:17:02.941][Trace][1][0] cwd=/usr/local/srs, work_dir=./, build: 2020-04-07 02:35:32, configure: --x86-x64 , uname: Linux 3ebf9b107f9a 4.9.87-linuxkit-a       ufs #1 SMP Wed Mar 14 15:12:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
233627 [2021-04-25 06:17:02.941][Trace][1][0] configure detail: --prefix=/usr/local/srs --with-hls --with-hds --with-dvr --with-ssl --with-transcode --with-ingest --with-sta       t --with-http-callback --with-http-server --with-stream-caster --with-http-api --with-librtmp --without-research --without-utest --without-gperf --without-gmc --witho       ut-gmd --without-gmp --without-gcp --without-gprof --log-trace --cc=gcc --cxx=g++ --ar=ar --ld=ld --randlib=randlib
233628 [2021-04-25 06:17:02.941][Trace][1][0] srs checking config...

1. SRS configuration:

# main config for srs.
# @see full.conf for detail config.

listen 1935;
max_connections 1000;
ff_log_dir ./objs;
srs_log_tank file;
srs_log_level trace;
srs_log_file ./objs/srs.log
daemon off;
http_api {
    enabled on;
    listen 1985;
    crossdomain on;
}
http_server {
    enabled on;
    listen 8080;
    dir ./objs/nginx/html;
}
stats {
    network 0;
    disk sda sdb xvda xvdb;
}
vhost __defaultVhost__ {
    #tcp_nodelay on;
**# Minimum delay open, it is open by default. When this option is enabled, MR is disabled by default.**
    #min_latency on;
**# Merged-Write, SRS always uses Merged-Write, which means sending packets to the client once every N milliseconds. This algorithm can improve the efficiency of RTMP downstream by about 5 times, with a range of [350-1800].**
    #mw_latency on;

    #play{
**# Cache a group of video frames**
        #gop_cache off;
**# Configure the length of the live queue. The server will place data in the live queue, and if it exceeds this length, it will be cleared to the last I-frame.**
        #queue_length 10;
**# Merged-Write, SRS always uses Merged-Write, which means sending packets to the client once every N milliseconds. This algorithm can improve the efficiency of RTMP downstream by about 5 times, within the range of [350-1800].**
        #mw_latency 100;
    #}

    #publish {
        #mr off;
    #}

    http_hooks {
        enabled on;
        on_publish http://docker-nginx/check/publish;
        on_play http://docker-nginx/check/publish;
        on_stop http://docker-nginx/check/publish;
    }
    hls {
        enabled on;
        hls_path ./objs/nginx/html;
**# Clean up expired m3u8 and ts files**
        hls_dispose 1;
        hls_fragment 2;
        hls_window 5;
    }
    http_remux {
        enabled on;
        mount [vhost]/[app]/[stream].flv;
        hstrs on;
    }
}

Replay

How to replay bug?

Steps to reproduce the bug

Bug Reproduction Steps:

  1. After starting the SRS service, frequent requests to the /api/v1/clients?start=0&count=1000 interface will cause the service to restart.
    • However, the bug does not occur every time. I have tested it locally and it has not been reproduced consistently.
    • On the production server, the issue occurs approximately 15 times per day, causing the service to restart when retrieving client information.

Please note that the translation provided is a direct translation and may require further clarification or adjustment based on the context.

TRANS_BY_GPT3

Verified, this problem does exist.
Scenario: RTMP push streaming, FLV pull streaming.
Reason: If FLV pull streaming occurs before RTMP push streaming, it will trigger an update source id, causing the SrsRequest pointer to be reallocated without notifying the statistic. When calling the /api/v1/clients/ interface, it will result in an illegal pointer access.

Reproduction method: Modify the srs_app_http_stream.cpp file.

srs_error_t SrsLiveStream::update(SrsSource* s, SrsRequest* r)
{
    source = s;
    
    srs_freep(req);
    SrsRequest* test = new SrsRequest();  // Add this line of garbage code, it will trigger 100%
    req = r->copy()->as_http();
    
    return srs_success;
}

The problem is very subtle. Due to the optimization of the operating system, the address of 'req' does not change after each 'realloc', making it difficult to reproduce. Only when that block of memory is coincidentally occupied and modified by another program, will it cause a crash.

TRANS_BY_GPT3

The bug has been fixed, please refer to 497c2d5

TRANS_BY_GPT3

6689d88 Fix the bug from the previous commit

TRANS_BY_GPT3

Thank you @duiniuluantanqin for analyzing and identifying the root cause, and providing a PR. Let's summarize this issue.

The essence of this issue is that stat directly references req.

srs_error_t SrsStatistic::on_client(std::string id, SrsRequest* req, SrsConnection* conn, SrsRtmpConnType type)
{
    client->req = req;

void SrsStatistic::on_disconnect(std::string id)
{
    clients.erase(it);

// RTMP references Conn's req.
srs_error_t SrsRtmpConn::stream_service_cycle()
{
    SrsRequest* req = info->req;
    if ((err = stat->on_client(srs_int2str(_srs_context->get_id()), req, this, info->type)) != srs_success) {

// FLV references SrsLiveStream.
srs_error_t SrsLiveStream::do_serve_http(ISrsHttpResponseWriter* w, ISrsHttpMessage* r)
{
    if ((err = stat->on_client(srs_int2str(_srs_context->get_id()), req, hc, SrsRtmpConnPlay)) != srs_success) {

srs_error_t SrsLiveStream::update(SrsSource* s, SrsRequest* r)
{
    srs_freep(req);
    req = r->copy()->as_http();

This issue does not exist in RTMP because SrsStatistic references SrsRtmpConn's req, and the latter has a longer lifespan and is always available.

This issue exists in HTTP-FLV and causes a crash. The reproduction steps are provided in the previous comment. The problem arises because SrsStatistic references SrsLiveStream's req, and the lifespans of these two objects are uncertain, resulting in SrsStatistic referencing a dangling pointer.

Solution: Copy req in SrsStatistic. Since SrsStatistic and conn are corresponding, their req objects are also corresponding. Therefore, stat can be considered as one of conn's members and can be copied. Refer to 71dda68 for more details.

Once again, thank you @duiniuluantanqin for spending a lot of time analyzing and locating this issue. This kind of wild pointer problem is very subtle, and the most difficult part is how to locate it. Once located, it is easy to fix.

TRANS_BY_GPT3

Thank you two experts, the problem has been resolved.

TRANS_BY_GPT3