ossrs/srs

Too many coroutines caused memory allocation failure, terminate called after throwing an instance of 'std::bad_alloc'.

han4235 opened this issue · 16 comments

srs automatically exits when pulling the stream.
srs: src/app/srs_app_edge.cpp:766: virtual int SrsPlayEdge::on_ingest_play(): Assertion `state == SrsEdgeStatePlay' failed.

TRANS_BY_GPT3

jarod commented

I also encountered version 2.0a2.

TRANS_BY_GPT3

Please provide the configuration, logs, version, and steps to reproduce. Thank you~

TRANS_BY_GPT3

jarod commented

My configuration is very simple, it consists of the default configuration files origin.conf and edge.conf. I changed the "origin" in edge.conf to my own server domain name. There is one origin and two edges. There are about 5 push streams and 50 pull streams. Both origin and edge have experienced downtime. The log for the edge is the same as the one mentioned above, and the log for the origin is as follows:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

If needed, I can provide the core dump files.

TRANS_BY_GPT3

Please send me the core and its corresponding SRS. You can also put it on a file sharing platform. Is it CentOS?

TRANS_BY_GPT3

jarod commented

centos 7 64bit, related files can be found at http://pan.baidu.com/s/1pJGLnyN

TRANS_BY_GPT3

Hmm, I'll find time to take a look.

TRANS_BY_GPT3

[winlin@centos7 srs]$ ./objs/srs -v
2.0.195
[winlin@centos7 srs]$ ls -lh core.*
-rw-------. 1 winlin winlin 1.1G Oct 22 21:10 core.13964
-rw-------. 1 winlin winlin 2.1G Oct 22 22:16 core.31521


(gdb) f 2
#2  0x00000000004f884b in SrsEdgeIngester::cycle (this=0x455bb50) at src/app/srs_app_edge.cpp:138
warning: Source file is more recent than executable.
138     if ((ret = client->handshake()) != ERROR_SUCCESS) {
(gdb) p this[0]
$3 = {<ISrsReusableThread2Handler> = {_vptr.ISrsReusableThread2Handler = 0x898e50 <vtable for SrsEdgeIngester+16>}, stream_id = 1, _source = 
    0x2327650, _edge = 0x2193580, _req = 0x23982a0, pthread = 0x34d09c0, stfd = 0x16ee220, io = 0x3b63e20, kbps = 0x3b5e190, client = 0x34cd3a0, 
  origin_index = 0}

Visible that the edge object is not damaged.

(gdb) f 0
#0  0x00000000004500d0 in SrsComplexHandshake::handshake_with_server (this=0x7f5ff5a5cc00, hs_bytes=0x4341d00, io=0x3b63e20)
    at src/protocol/srs_rtmp_handshake.cpp:1341
1341        if ((ret = hs_bytes->read_s0s1s2(io)) != ERROR_SUCCESS) {


(gdb) p hs_bytes[0]
$7 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}

It is evident that the object has already been released, so using it again will definitely cause problems.

TRANS_BY_GPT3

(gdb) bt
#0  0x00000000004500d0 in SrsComplexHandshake::handshake_with_server (this=0x7f5ff5a5cc00, hs_bytes=0x4341d00, io=0x3b63e20)
    at src/protocol/srs_rtmp_handshake.cpp:1341
#1  0x0000000000433889 in SrsRtmpClient::handshake (this=0x34cd3a0) at src/protocol/srs_rtmp_stack.cpp:1978
#2  0x00000000004f884b in SrsEdgeIngester::cycle (this=0x455bb50) at src/app/srs_app_edge.cpp:138
#3  0x00000000004a355d in SrsReusableThread2::cycle (this=0x34d09c0) at src/app/srs_app_thread.cpp:533
#4  0x00000000004a2557 in internal::SrsThread::thread_cycle (this=0x1b5b710) at src/app/srs_app_thread.cpp:203
#5  0x00000000004a2769 in internal::SrsThread::thread_fun (arg=0x1b5b710) at src/app/srs_app_thread.cpp:244
#6  0x000000000051643e in _st_thread_main () at sched.c:327
#7  0x0000000000516bae in st_thread_create (start=0x12f5105, arg=0xfbad8001, joinable=32608, stk_size=974285335) at sched.c:591
#8  0x0000000000000000 in ?? ()
(gdb) 

Stack.

TRANS_BY_GPT3

(gdb) p hs_bytes[0]
$4 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}

Explanation: C0C1 has been completed, but S0S1S2 has not been received yet. This is an impossible execution path.





    // s0s1s2
    if ((ret = hs_bytes->read_s0s1s2(io)) != ERROR_SUCCESS) {
        return ret;
    }


    // plain text required.
    if (hs_bytes->s0s1s2[0] != 0x03) {
        ret = ERROR_RTMP_HANDSHAKE;
        srs_warn("handshake failed, plain text required. ret=%d", ret);
        return ret;
    }


int SrsHandshakeBytes::read_s0s1s2(ISrsProtocolReaderWriter* io)
{
    int ret = ERROR_SUCCESS;


    if (s0s1s2) {
        return ret;
    }


    ssize_t nsize;


    s0s1s2 = new char[3073];
    if ((ret = io->read_fully(s0s1s2, 3073, &nsize)) != ERROR_SUCCESS) {
        srs_warn("read s0s1s2 failed. ret=%d", ret);
        return ret;
    }
    srs_verbose("read s0s1s2 success.");


    return ret;
}

Explanation: When SrsHandshakeBytes::read_s0s1s2 returns, s0s1s2 is definitely non-NULL.

TRANS_BY_GPT3

Observing hs_bytes again:

'
Make sure to maintain the markdown structure.

(gdb) p hs_bytes[0]
$5 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}
(gdb) x /12xb hs_bytes->c0c1
0x7f603a4727d8 <main_arena+120>:    0xc8    0x27    0x47    0x3a    0x60    0x7f    0x00    0x00
0x7f603a4727e0 <main_arena+128>:    0xc8    0x27    0x47    0x3a

Among them, c0 should be 0x03, but it is actually 0xc8.
And the pointer of c0c1 is 0x7f603a4727d8, which is definitely a stack pointer, but it should actually be a heap pointer.
From these two observations, hs_bytes is a wild pointer.

'
Make sure to maintain the markdown structure.

TRANS_BY_GPT3

Looking at the stack:

(gdb) f 1
#1  0x0000000000433889 in SrsRtmpClient::handshake (this=0x34cd3a0) at src/protocol/srs_rtmp_stack.cpp:1978
1978        if ((ret = complex_hs.handshake_with_server(hs_bytes, io)) != ERROR_SUCCESS) {
(gdb) p hs_bytes[0]
$9 = {_vptr.SrsHandshakeBytes = 0x8917f0 <vtable for SrsHandshakeBytes+16>, c0c1 = 0x3ce7ab0 "\003V(\340D\200", 
  s0s1s2 = 0x4080ec0 "\003V(\340B\001", c2 = 0x0}

At this point, the observed hs_bytes are different from before, indicating a problem within the complex_hs.handshake_with_server. In the f1 section, c0c1 is a heap pointer, and the data starts with 03 without any corruption.

TRANS_BY_GPT3

(gdb) p ((SrsStSocket*)io)[0]
$15 = {<ISrsProtocolReaderWriter> = {<ISrsProtocolReader> = {<ISrsBufferReader> = {
        _vptr.ISrsBufferReader = 0x895e20 <vtable for SrsStSocket+96>}, <ISrsProtocolStatistic> = {
        _vptr.ISrsProtocolStatistic = 0x895eb0 <vtable for SrsStSocket+240>}, <No data fields>}, <ISrsProtocolWriter> = {<ISrsBufferWriter> = {
        _vptr.ISrsBufferWriter = 0x895f18 <vtable for SrsStSocket+344>}, <No data fields>}, <No data fields>}, recv_timeout = 30000000, 
  send_timeout = 30000000, recv_bytes = 3073, send_bytes = 1537, stfd = 0x16ee220}

From the data of io, it can be seen that 3073 bytes (s0s1s2) were received and 1537 bytes (c0c1) were sent. There may have been a problem while processing s0s1s2.

TRANS_BY_GPT3

This may be a problem caused by allocating objects on the stack. Change it to allocate on the heap.

TRANS_BY_GPT3

https://stackoverflow.com/a/2504601
bad_alloc is basically unable to allocate, judging from the size of the core, it is a long-running service.

If you are running on a typical embedded processor running Linux without virtual memory it is quite likely 
your process will be terminated by the operating system before new fails if you allocate too much memory.

If you are running your program on a machine with less physical memory than the maximum of virtual 
memory (2 GB on standard Windows) you will find that once you have allocated an amount of memory 
approximately equal to the available physical memory, further allocations will succeed but will cause 
paging to disk. This will bog your program down and you might not actually be able to get to the point 
of exhausting virtual memory. So you might not get an exception thrown.

If you have more physical memory than the virtual memory, and you simply keep allocating memory, 
you will get an exception when you have exhausted virtual memory to the point where you can not 
allocate the block size you are requesting.

If you have a long-running program that allocates and frees in many different block sizes, including 
small blocks, with a wide variety of lifetimes, the virtual memory may become fragmented to the point 
where new will be unable to find a large enough block to satisfy a request. Then new will throw an 
exception. If you happen to have a memory leak that leaks the occasional small block in a random 
location that will eventually fragment memory to the point where an arbitrarily small block allocation 
will fail, and an exception will be thrown.

If you have a program error that accidentally passes a huge array size to new[], new will fail and throw 
an exception. This can happen for example if the array size is actually some sort of random byte pattern, 
perhaps derived from uninitialized memory or a corrupted communication stream.

TRANS_BY_GPT3

This article analyzes that bad_alloc is not always Out of Memory (OOM): http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1404r1.html

Wrote an example, as follows:

/*
ulimit -S -v 204800
g++ -g -O0 t.cpp -o t && ./t
*/
#include <stdio.h>
int main(){
    char* p1 = new char[193000 * 1024]; // huge allocation
    char* p0 = new char[100 * 1024]; // small allocation
    printf("OK\n");
}

Execution will crash.

[root@SRS tmp]# ulimit -S -v 204800
[root@SRS tmp]# g++ -g -O0 t.cpp -o t && ./t
terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

[root@SRS tmp]# ll core.21082 
-rw------- 1 root root 198045696 Oct 26 21:04 core.21082

Looking at the stack is not about allocating the majority, but about allocating the minority.

[root@SRS tmp]# gdb t -c core.21082 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffeae793000
Core was generated by `./t'.
Program terminated with signal 6, Aborted.
#0  0x00007fd17ff0e4f5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libgcc-4.4.7-23.el6.x86_64 libstdc++-4.4.7-23.el6.x86_64
(gdb) bt
#0  0x00007fd17ff0e4f5 in raise () from /lib64/libc.so.6
#1  0x00007fd17ff0fcd5 in abort () from /lib64/libc.so.6
#2  0x00007fd1807c8a8d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x00007fd1807c6be6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007fd1807c6c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x00007fd1807c6d32 in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00007fd1807c712d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
#7  0x00007fd1807c71e9 in operator new[](unsigned long) () from /usr/lib64/libstdc++.so.6
#8  0x0000000000400624 in main () at t.cpp:8
(gdb) f 8
#8  0x0000000000400624 in main () at t.cpp:8
8	    char* p0 = new char[100 * 1024]; // small allocation
(gdb) 

TRANS_BY_GPT3

Added a gdb script, analyzed the number of coroutines in the core. Download the code srs.py first:

(gdb) source gdb/srs.py 
(gdb) nn_coroutines 
this coroutine(&_st_this_thread->tlink) is: 0x7f43ba761e78
next is 0x7f43b92d9e78, total 500
next is 0x7f43b5c37e78, total 1000
next is 0x7f43bfd71e78, total 31500
next is 0x7f43bdad9e78, total 32000
next is 0x7f43bd8f3e78, total 32500
total coroutines: 32717

By default, ST uses mmap to allocate the stack space for coroutines. Therefore, if the number exceeds a certain limit, it will fail. You can check this limit using the following:

[root@05ff04a933cd st]# sysctl vm.max_map_count
vm.max_map_count = 65530

Note: This limit does not apply in Docker, and you can open up to 650162 coroutines with a memory usage of around 40GB. Generally, this limit is enabled on production machines.

Then compile this code huge-threads.cpp and execute it.

g++ huge-threads.cpp ../../objs/st/libst.a -g -O0 -o huge-threads && 
./huge-threads 60000

Usually, it will hang around 30,000 coroutines here.

[root@05ff04a933cd st]# ./huge-threads 60000
pid=77682, create 60000 coroutines
create thread fail, i=32749

There are two solutions for this.

  1. It is necessary to check why there are so many coroutines when the Source is not cleaned up.
  2. MALLOC_STACK can be enabled during compilation.

TRANS_BY_GPT3