Too many coroutines caused memory allocation failure, terminate called after throwing an instance of 'std::bad_alloc'.
han4235 opened this issue · 16 comments
srs automatically exits when pulling the stream.
srs: src/app/srs_app_edge.cpp:766: virtual int SrsPlayEdge::on_ingest_play(): Assertion `state == SrsEdgeStatePlay' failed.
TRANS_BY_GPT3
I also encountered version 2.0a2.
TRANS_BY_GPT3
Please provide the configuration, logs, version, and steps to reproduce. Thank you~
TRANS_BY_GPT3
My configuration is very simple, it consists of the default configuration files origin.conf and edge.conf. I changed the "origin" in edge.conf to my own server domain name. There is one origin and two edges. There are about 5 push streams and 50 pull streams. Both origin and edge have experienced downtime. The log for the edge is the same as the one mentioned above, and the log for the origin is as follows:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
If needed, I can provide the core dump files.
TRANS_BY_GPT3
Please send me the core and its corresponding SRS. You can also put it on a file sharing platform. Is it CentOS?
TRANS_BY_GPT3
centos 7 64bit, related files can be found at http://pan.baidu.com/s/1pJGLnyN
TRANS_BY_GPT3
Hmm, I'll find time to take a look.
TRANS_BY_GPT3
[winlin@centos7 srs]$ ./objs/srs -v
2.0.195
[winlin@centos7 srs]$ ls -lh core.*
-rw-------. 1 winlin winlin 1.1G Oct 22 21:10 core.13964
-rw-------. 1 winlin winlin 2.1G Oct 22 22:16 core.31521
(gdb) f 2
#2 0x00000000004f884b in SrsEdgeIngester::cycle (this=0x455bb50) at src/app/srs_app_edge.cpp:138
warning: Source file is more recent than executable.
138 if ((ret = client->handshake()) != ERROR_SUCCESS) {
(gdb) p this[0]
$3 = {<ISrsReusableThread2Handler> = {_vptr.ISrsReusableThread2Handler = 0x898e50 <vtable for SrsEdgeIngester+16>}, stream_id = 1, _source =
0x2327650, _edge = 0x2193580, _req = 0x23982a0, pthread = 0x34d09c0, stfd = 0x16ee220, io = 0x3b63e20, kbps = 0x3b5e190, client = 0x34cd3a0,
origin_index = 0}
Visible that the edge object is not damaged.
(gdb) f 0
#0 0x00000000004500d0 in SrsComplexHandshake::handshake_with_server (this=0x7f5ff5a5cc00, hs_bytes=0x4341d00, io=0x3b63e20)
at src/protocol/srs_rtmp_handshake.cpp:1341
1341 if ((ret = hs_bytes->read_s0s1s2(io)) != ERROR_SUCCESS) {
(gdb) p hs_bytes[0]
$7 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}
It is evident that the object has already been released, so using it again will definitely cause problems.
TRANS_BY_GPT3
(gdb) bt
#0 0x00000000004500d0 in SrsComplexHandshake::handshake_with_server (this=0x7f5ff5a5cc00, hs_bytes=0x4341d00, io=0x3b63e20)
at src/protocol/srs_rtmp_handshake.cpp:1341
#1 0x0000000000433889 in SrsRtmpClient::handshake (this=0x34cd3a0) at src/protocol/srs_rtmp_stack.cpp:1978
#2 0x00000000004f884b in SrsEdgeIngester::cycle (this=0x455bb50) at src/app/srs_app_edge.cpp:138
#3 0x00000000004a355d in SrsReusableThread2::cycle (this=0x34d09c0) at src/app/srs_app_thread.cpp:533
#4 0x00000000004a2557 in internal::SrsThread::thread_cycle (this=0x1b5b710) at src/app/srs_app_thread.cpp:203
#5 0x00000000004a2769 in internal::SrsThread::thread_fun (arg=0x1b5b710) at src/app/srs_app_thread.cpp:244
#6 0x000000000051643e in _st_thread_main () at sched.c:327
#7 0x0000000000516bae in st_thread_create (start=0x12f5105, arg=0xfbad8001, joinable=32608, stk_size=974285335) at sched.c:591
#8 0x0000000000000000 in ?? ()
(gdb)
Stack.
TRANS_BY_GPT3
(gdb) p hs_bytes[0]
$4 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}
Explanation: C0C1 has been completed, but S0S1S2 has not been received yet. This is an impossible execution path.
// s0s1s2
if ((ret = hs_bytes->read_s0s1s2(io)) != ERROR_SUCCESS) {
return ret;
}
// plain text required.
if (hs_bytes->s0s1s2[0] != 0x03) {
ret = ERROR_RTMP_HANDSHAKE;
srs_warn("handshake failed, plain text required. ret=%d", ret);
return ret;
}
int SrsHandshakeBytes::read_s0s1s2(ISrsProtocolReaderWriter* io)
{
int ret = ERROR_SUCCESS;
if (s0s1s2) {
return ret;
}
ssize_t nsize;
s0s1s2 = new char[3073];
if ((ret = io->read_fully(s0s1s2, 3073, &nsize)) != ERROR_SUCCESS) {
srs_warn("read s0s1s2 failed. ret=%d", ret);
return ret;
}
srs_verbose("read s0s1s2 success.");
return ret;
}
Explanation: When SrsHandshakeBytes::read_s0s1s2
returns, s0s1s2 is definitely non-NULL.
TRANS_BY_GPT3
Observing hs_bytes again:
'
Make sure to maintain the markdown structure.
(gdb) p hs_bytes[0]
$5 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}
(gdb) x /12xb hs_bytes->c0c1
0x7f603a4727d8 <main_arena+120>: 0xc8 0x27 0x47 0x3a 0x60 0x7f 0x00 0x00
0x7f603a4727e0 <main_arena+128>: 0xc8 0x27 0x47 0x3a
Among them, c0 should be 0x03, but it is actually 0xc8.
And the pointer of c0c1 is 0x7f603a4727d8, which is definitely a stack pointer, but it should actually be a heap pointer.
From these two observations, hs_bytes is a wild pointer.
'
Make sure to maintain the markdown structure.
TRANS_BY_GPT3
Looking at the stack:
(gdb) f 1
#1 0x0000000000433889 in SrsRtmpClient::handshake (this=0x34cd3a0) at src/protocol/srs_rtmp_stack.cpp:1978
1978 if ((ret = complex_hs.handshake_with_server(hs_bytes, io)) != ERROR_SUCCESS) {
(gdb) p hs_bytes[0]
$9 = {_vptr.SrsHandshakeBytes = 0x8917f0 <vtable for SrsHandshakeBytes+16>, c0c1 = 0x3ce7ab0 "\003V(\340D\200",
s0s1s2 = 0x4080ec0 "\003V(\340B\001", c2 = 0x0}
At this point, the observed hs_bytes are different from before, indicating a problem within the complex_hs.handshake_with_server. In the f1 section, c0c1 is a heap pointer, and the data starts with 03 without any corruption.
TRANS_BY_GPT3
(gdb) p ((SrsStSocket*)io)[0]
$15 = {<ISrsProtocolReaderWriter> = {<ISrsProtocolReader> = {<ISrsBufferReader> = {
_vptr.ISrsBufferReader = 0x895e20 <vtable for SrsStSocket+96>}, <ISrsProtocolStatistic> = {
_vptr.ISrsProtocolStatistic = 0x895eb0 <vtable for SrsStSocket+240>}, <No data fields>}, <ISrsProtocolWriter> = {<ISrsBufferWriter> = {
_vptr.ISrsBufferWriter = 0x895f18 <vtable for SrsStSocket+344>}, <No data fields>}, <No data fields>}, recv_timeout = 30000000,
send_timeout = 30000000, recv_bytes = 3073, send_bytes = 1537, stfd = 0x16ee220}
From the data of io, it can be seen that 3073 bytes (s0s1s2) were received and 1537 bytes (c0c1) were sent. There may have been a problem while processing s0s1s2.
TRANS_BY_GPT3
This may be a problem caused by allocating objects on the stack. Change it to allocate on the heap.
TRANS_BY_GPT3
https://stackoverflow.com/a/2504601
bad_alloc is basically unable to allocate, judging from the size of the core, it is a long-running service.
If you are running on a typical embedded processor running Linux without virtual memory it is quite likely
your process will be terminated by the operating system before new fails if you allocate too much memory.
If you are running your program on a machine with less physical memory than the maximum of virtual
memory (2 GB on standard Windows) you will find that once you have allocated an amount of memory
approximately equal to the available physical memory, further allocations will succeed but will cause
paging to disk. This will bog your program down and you might not actually be able to get to the point
of exhausting virtual memory. So you might not get an exception thrown.
If you have more physical memory than the virtual memory, and you simply keep allocating memory,
you will get an exception when you have exhausted virtual memory to the point where you can not
allocate the block size you are requesting.
If you have a long-running program that allocates and frees in many different block sizes, including
small blocks, with a wide variety of lifetimes, the virtual memory may become fragmented to the point
where new will be unable to find a large enough block to satisfy a request. Then new will throw an
exception. If you happen to have a memory leak that leaks the occasional small block in a random
location that will eventually fragment memory to the point where an arbitrarily small block allocation
will fail, and an exception will be thrown.
If you have a program error that accidentally passes a huge array size to new[], new will fail and throw
an exception. This can happen for example if the array size is actually some sort of random byte pattern,
perhaps derived from uninitialized memory or a corrupted communication stream.
TRANS_BY_GPT3
This article analyzes that bad_alloc is not always Out of Memory (OOM): http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1404r1.html
Wrote an example, as follows:
/*
ulimit -S -v 204800
g++ -g -O0 t.cpp -o t && ./t
*/
#include <stdio.h>
int main(){
char* p1 = new char[193000 * 1024]; // huge allocation
char* p0 = new char[100 * 1024]; // small allocation
printf("OK\n");
}
Execution will crash.
[root@SRS tmp]# ulimit -S -v 204800
[root@SRS tmp]# g++ -g -O0 t.cpp -o t && ./t
terminate called after throwing an instance of 'St9bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
[root@SRS tmp]# ll core.21082
-rw------- 1 root root 198045696 Oct 26 21:04 core.21082
Looking at the stack is not about allocating the majority, but about allocating the minority.
[root@SRS tmp]# gdb t -c core.21082
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffeae793000
Core was generated by `./t'.
Program terminated with signal 6, Aborted.
#0 0x00007fd17ff0e4f5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libgcc-4.4.7-23.el6.x86_64 libstdc++-4.4.7-23.el6.x86_64
(gdb) bt
#0 0x00007fd17ff0e4f5 in raise () from /lib64/libc.so.6
#1 0x00007fd17ff0fcd5 in abort () from /lib64/libc.so.6
#2 0x00007fd1807c8a8d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3 0x00007fd1807c6be6 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x00007fd1807c6c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5 0x00007fd1807c6d32 in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6 0x00007fd1807c712d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
#7 0x00007fd1807c71e9 in operator new[](unsigned long) () from /usr/lib64/libstdc++.so.6
#8 0x0000000000400624 in main () at t.cpp:8
(gdb) f 8
#8 0x0000000000400624 in main () at t.cpp:8
8 char* p0 = new char[100 * 1024]; // small allocation
(gdb)
TRANS_BY_GPT3
Added a gdb script, analyzed the number of coroutines in the core. Download the code srs.py first:
(gdb) source gdb/srs.py
(gdb) nn_coroutines
this coroutine(&_st_this_thread->tlink) is: 0x7f43ba761e78
next is 0x7f43b92d9e78, total 500
next is 0x7f43b5c37e78, total 1000
next is 0x7f43bfd71e78, total 31500
next is 0x7f43bdad9e78, total 32000
next is 0x7f43bd8f3e78, total 32500
total coroutines: 32717
By default, ST uses mmap to allocate the stack space for coroutines. Therefore, if the number exceeds a certain limit, it will fail. You can check this limit using the following:
[root@05ff04a933cd st]# sysctl vm.max_map_count
vm.max_map_count = 65530
Note: This limit does not apply in Docker, and you can open up to 650162 coroutines with a memory usage of around 40GB. Generally, this limit is enabled on production machines.
Then compile this code huge-threads.cpp and execute it.
g++ huge-threads.cpp ../../objs/st/libst.a -g -O0 -o huge-threads &&
./huge-threads 60000
Usually, it will hang around 30,000 coroutines here.
[root@05ff04a933cd st]# ./huge-threads 60000
pid=77682, create 60000 coroutines
create thread fail, i=32749
There are two solutions for this.
- It is necessary to check why there are so many coroutines when the Source is not cleaned up.
MALLOC_STACK
can be enabled during compilation.
TRANS_BY_GPT3