ossrs/srs

Support Multiple-CPUs(or Threads) to improve concurrency.

winlinvip opened this issue · 11 comments

Remark

For now, let's put the multi-threading preparation on hold. Although ST already supports multi-threading and the RTC multi-threading part has been almost completed, there are several factors that make me think we should reconsider whether multi-threading is necessary at this stage.

First of all, the multi-threading branch has been deleted from the SRS repository, but it is still preserved in my repository feature/threads, which mainly includes the following commits:

The main reasons for reconsidering multi-threading support are:

  • RTC cascading can extend SRS's capabilities, making it simple and reliable to expand with the Edge/Origin mechanism for live streaming.
  • Single-process concurrency is currently between 1000 and 1200 concurrent connections. After further optimizing QoS, it is estimated that it can maintain around 1000 concurrent connections, which is sufficient for open-source use.
  • Only ST and RTC have more comprehensive multi-threading support, while live streaming and API support are not yet perfect. It requires more effort and is not suitable for our time-limited community to do.
  • The main goal of multi-threading is to expand the single-machine concurrency to the tens of thousands level, which is generally used in large-scale commercial systems. Generally, open-source SFUs do not have this scenario (besides, SRS can use cascading for expansion).

However, simplifying ST and improving its performance can still be considered for merging, including:

Summary

SRS's support for multi-threading is a significant architectural upgrade, essentially aimed at addressing performance issues.

Regarding performance issues, the following points can be expanded:

  • In the live streaming playback scenario, a single-process single-thread can run up to 3000 concurrent connections or higher, as live streaming has no encryption and can directly distribute audio and video data in a one-to-many scenario. Moreover, it can be horizontally scaled through Edge.
  • In the live streaming push scenario, it is impossible to achieve complete horizontal scaling, and the source station cluster scale cannot be very large, as referenced in #464. If it is a high-bitrate stream, a single-process single-thread can hardly achieve more than 1000 connections.
  • In the RTC scenario, in addition to encryption and QoS, the performance of UDP sending and receiving will also be lower, and it is roughly estimated that it is difficult to reach 500 concurrent connections. This makes SRS no longer a purely IO-intensive server, but an IO and CPU-intensive server.

Why is this issue important?

  • In the RTC scenario, if a single process can only reach 100 or 300 concurrent connections, then 10 times the number of cores is needed. The amount of traffic between these servers is also ten times higher. At this point, the concurrency capability is entirely insufficient.
  • In the live streaming scenario, the current multi-process, Origin, and Edge cluster expansion capabilities can be supported by multi-threading, allowing a single machine to achieve high concurrency, reducing the number of machines to manage, and lowering system complexity.
  • In the monitoring scenario, it is no longer a case of a single stream being consumed by many people, but rather many streams requiring encryption. If a single process does not support enough connections, it will cause many process issues.

Therefore, the multi-threading architecture can be considered a revolution after the multi-coroutine architecture, but this time it is a self-revolution.

Arch

The previous SRS single-threaded architecture (SRS/1/2/3/4):

image

  • Single-process single-threaded architecture, live streaming supports Origin-Edge cluster, and if deployed on a single machine, it is a multi-process single-threaded architecture.
  • RTC is computationally and IO-intensive, so multi-core capabilities are essential, and the main problem is the issue described in this post.
  • Despite this, SRS4 has also made many single-core performance optimizations, as referenced in b431ad7, c5d2027, 14bfc98, and 36ea673

The ultimate goal architecture is horizontally scalable Hybrid threads, also known as low-lock multi-threaded structure (SRS/v5.0.3):

image

  • Merge SRTP thread into Hybrid thread, both threads call OpenSSL simultaneously, posing a risk of Crash.
  • UDP RECV and SEND threads are optional, disabled by default, and Hybrid threads are responsible for sending and receiving data.
  • ST is changed to a thread-local structure (ST#19), with each thread having its isolated ST. It is crucial to ensure that the ST structure of one thread does not pass to another thread, i.e., not reading or writing other threads' FDs.
  • Hybrid threads default to 1, which is almost identical to SRS4's completely single-threaded structure, maintaining structural simplicity. If high performance is needed, multiple threads can be enabled, and the architecture remains essentially unchanged (from one thread to multiple independent threads).
  • Hybrid threads support horizontal scaling, using a multi-port approach. RTC returns different ports through SDP, while RTMP and HTTP require 302 redirection. The specific implementation depends on the documentation.
  • When Hybrid scales horizontally, it still locks connections to a single thread based on the stream, enabling push stream expansion. It can almost scale horizontally to hundreds or thousands of cores, but as the number of cores increases, communication between threads increases, and it is not entirely cost-free, but the cost is relatively small.
  • When Hybrid scales horizontally, in general, multiple push streams and multiple play streams can be well supported, such as 1000 push streams with 10 play streams each, totaling 10,000 streams.

The disadvantages of this architecture:

  • Single-stream playback expansion issue: When Hybrid scales horizontally, since each stream is locked to a single thread, the number of play streams for a single stream is limited by the number of connections supported by a single thread. Solution: In the future, cascading will be used to solve the downstream expansion issue, such as supporting 1000 connections for a single stream on a single machine (regardless of the number of cores), and cascading 1000 servers to support 1 million play streams.
  • Global variable and static variable cleanup issue: Although multi-threading is isolated, they still have a chance to make mistakes through global and static variables. Therefore, all global variables must be checked and modified, either changed to thread-local or thread-safe. This brings risks to stability and is often difficult to troubleshoot when problems arise. Solution: By default, only 1 thread is enabled, allowing for a long enough transition and improvement period.
  • Library thread safety issue: For example, OpenSSL has multi-threading issues, OpenSSL 1.1 claims to be thread-safe, requires modifying the compilation script, and does not use the option -no-threads, but sometimes it may forget to change this option and cause problems. Solution: By default, only 1 SRTP thread is enabled, allowing for a long enough transition and improvement period.

Single-stream playback expansion issue: If you must modify multi-threading to allow a single machine with multiple cores to support playback horizontal scaling, you can have the push stream thread broadcast to the pull stream thread. This change is acceptable, but open-source does not entirely pursue performance. It still needs to maintain a very simple architecture with performance optimization. Personally, I think it is reasonable for a single machine to support 1000 play streams and use cascading to solve expansion, so this optimization will not be used in the future.

Note: Image source is here.

Communication Mechanism

There are two ways for threads to communicate: the first is locked chan, and the second is passing fd. The second can rely on the first.

Both methods should avoid passing audio and video data. Of course, they can be passed, but it is not efficient. For example, you can start a transcoding thread, communicate with chan, and it doesn't require much concurrency.

SRS will have multiple ST threads, which communicate through chan, but they do not pass audio and video data, only some coordination messages.

Currently, SRS's thread communication uses pipe implementation to avoid locks. Therefore, when using it, be aware that it is a low-efficiency mechanism and should not directly pass audio and video packets. It is mainly used for communication between the Master (API) thread and the Hybrid (service) thread, with Hybrid returning SDP to the API.

Thread Types

Each thread will have its ST, and ST is thread-local, i.e., independent and isolated ST.

Remark: The most critical risk and change is to avoid passing FD (or other ST resources) created by one thread to another thread for processing, which will definitely cause problems.

In the end, there will be several types of threads:

  1. Main thread. It mainly manages configuration, manages threads, passes messages, and listens to APIs.
  2. Log thread. Read logs from the queue and write them to disk. To avoid disk IO blocking ST, recording and HLS writing to disk can also be placed in this thread.
  3. One or more Hybrid threads. Listen to network FDs, create epoll, start ST, and serve as the main audio and video business thread. Threads do not communicate with each other. Live streaming uses REUSE PORT, and RTC uses multi-port isolation.
  4. Optional, SRT thread. As it is now, an independent SRT, pushing RTMP to the ST thread via local socket.
  5. Optional, if implementing the SRT protocol yourself, the ST thread can handle SRT clients.
  6. Optional, transcoding or AI thread, pulling audio and video data from the ST thread via local socket, or passing data through Chan, implementing capabilities such as mixing and streaming.

Milestones

4.0 will not enable multi-threading, maintaining single-threaded capabilities.

5.0 will implement most of the multi-threading capabilities, including improving ST's thread-local capabilities. However, Hybrid will only default to 1 thread, and although the process has multiple threads, the overall difference from the previous single-thread is not significant.

6.0 will enable as many threads as there are CPU cores by default, completing the entire multi-threaded architecture transformation.

Differences from Go

Go's multi-threading overhead is too high, and its performance is not sufficient, as it is designed for general services.

With multiple cores, such as 16 cores, Go has about 5 cores for switching. This is because there are locks and data copying between multiple threads, even though chan is used.

In addition, Go is genuinely multi-threaded, requiring constant consideration of competition and thread switching, while SRS is still genuinely single-threaded. Go is more complicated to use, while SRS can still maintain the simplicity of single-threading.

SRS is a multi-threaded and coroutine-based architecture optimized for business, essentially still single-threaded, with threads being essentially unrelated.

Relationship with Source

A single ST thread will have multiple sources.

A source, which is a push stream and its corresponding consumer (playback), is only in one ST thread.

In this way, both push and play are completed in a single ST thread, without the need for locks or switching.

Since the client's URL is unknown when connecting, it is also unknown which stream it belongs to, so it may be accepted by the wrong ST thread, requiring FD migration.

Migrating FD between multiple threads is relatively simple. The difficulty lies in ST, which needs to support multi-threading and consider rebuilding the FD in the new ST thread's epoll when migrating FD. However, this is not particularly difficult, and it is much easier than multi-process.

Why not Multi-process

FD migration between multi-processes is too difficult to implement, and communication between processes is not as easy as communication between threads, nor is it as efficient as threads.

The reason why Nginx uses multi-process is that there is no need for FD migration between multiple processes. So when doing live streaming, NginxRTMP processes push streams to each other, which is too difficult to maintain.

If not migrated, audio and video packets need to be forwarded, and it is definitely better and more suitable for streaming media to migrate FD based on the stream.

Thread Local

Each thread has its own ST, which can be referred to as the Envoy Threading Model, using the C++ thread_local keyword to indicate variables.

I wrote an example SRS: thread-local.cpp, with the following results:

$ ./thread-local
PFN1: tl_g_nn(0x7fbd59504080)=1, g_obj(0x7fbd59504084)=1, gp_obj(0x7fbd59504088,0x7fbd595040a0)=1, gp_obj2(0x7fbd59504090,0x7fbd595040b0)=1
PFN2: tl_g_nn(0x7fbd59604080)=2, g_obj(0x7fbd59604084)=2, gp_obj(0x7fbd59604088,0x7fbd596040a0)=2, gp_obj2(0x7fbd59604090,0x7fbd596040b0)=2
MAIN: tl_g_nn(0x7fbd59704080)=100, g_obj(0x7fbd59704084)=100, gp_obj(0x7fbd59704088,0x7fbd597040a0)=100, gp_obj2(0x7fbd59704090,0x7fbd597040b0)=100

It can be used to modify global variables:

// Global thread local int variable.
thread_local int tl_g_nn = 0;

Including global pointers:

thread_local MyClass g_obj(0);
thread_local MyClass* gp_obj = new MyClass(0);
thread_local MyClass* gp_obj2 = NULL;
MyClass* get_gp_obj2()
{
    if (!gp_obj2) {
        gp_obj2 = new MyClass(0);
    }
    return gp_obj2;
}

The addresses and values of these pointers are different in each thread.

GCC __thread

GCC has extended the keyword __thread, which has the same effect as C++11's thread_local.

A multi-threaded version of ST has been implemented before, using gcc's __thread keyword, referring to toffaletti and ST#19.

UDP Binding

Note: Please note that we ultimately chose to implement RTC multi-thread isolation with multiple ports instead of using UDP binding, so I have collapsed related comments by default.

RTC's UDP is connectionless, and multiple threads can reuse the fd through REUSE_PORT to receive packets sent to the same port.

The kernel will perform a five-tuple binding. When the kernel delivers to a certain listen fd, it will continue to deliver to this fd. Refer to udp-client and udp-server:

Start 3 workers, at 127.0.0.1:8000
listen at 127.0.0.1:8000, fd=3 ok
listen at 127.0.0.1:8000, fd=4 ok
listen at 127.0.0.1:8000, fd=5 ok
fd #5, peer 127.0.0.1:50331, got 13B, Hello world!
fd #5, peer 127.0.0.1:50331, got 13B, Hello world!
fd #5, peer 127.0.0.1:50331, got 13B, Hello world!

Note: There are three fds listening on port 8000 above. After the client 50331 delivers to fd=5, it will continue to deliver to this fd.

UDP Migration

If we receive a client packet from a certain fd, such as 3, and find that this client should be received by another fd, such as 4, we can use connect to bind the delivery relationship.

Refer to the example udp-connect-client.cpp and udp-connect-server.cpp. The server receives the packet and continuously uses other fds to connect. The performance is different on different platforms.

CentOS 7 server, listening on 0.0.0.0:8000, as shown below, can achieve migration twice:

Start 2 workers, at 0.0.0.0:8000
listen at 0.0.0.0:8000, fd=4, migrate_fd=3 ok
listen at 0.0.0.0:8000, fd=3, migrate_fd=4 ok

fd #3, peer 172.16.239.217:37846, got 13B, Hello world!
Transfer 172.16.239.217:37846 from #3 to #4, r0=0, errno=0

fd #4, peer 172.16.239.217:37846, got 13B, Hello world!
Transfer 172.16.239.217:37846 from #4 to #3, r0=0, errno=0

fd #3, peer 172.16.239.217:37846, got 13B, Hello world!
Transfer 172.16.239.217:37846 from #3 to #4, r0=0, errno=0
fd #3, peer 172.16.239.217:37846, got 13B, Hello world!

CentOS 7 server, if bound to a fixed address, such as eth0 or lo, will not migrate:

Start 2 workers, at 172.16.123.121:8000
listen at 172.16.123.121:8000, fd=4, migrate_fd=3 ok
listen at 172.16.123.121:8000, fd=3, migrate_fd=4 ok

fd #3, peer 120.227.88.168:43015, got 13B, Hello world!
Transfer 120.227.88.168:43015 from #3 to #4, r0=0, errno=0

fd #3, peer 120.227.88.168:43015, got 13B, Hello world!
Transfer 120.227.88.168:43015 from #3 to #4, r0=0, errno=0
fd #3, peer 120.227.88.168:43015, got 13B, Hello world!

Note: Linux multiple migrations will not return an error, but will not take effect.

Mac server, regardless of which address is bound, will migrate once:

Start 2 workers, at 127.0.0.1:8000
listen at 127.0.0.1:8000, fd=4, migrate_fd=3 ok
listen at 127.0.0.1:8000, fd=3, migrate_fd=4 ok

fd #3, peer 127.0.0.1:61448, got 13B, Hello world!
Transfer 127.0.0.1:61448 from #3 to #4, r0=0, errno=0

fd #4, peer 127.0.0.1:61448, got 13B, Hello world!
Transfer 127.0.0.1:61448 from #4 to #3, r0=-1, errno=48
fd #4, peer 127.0.0.1:61448, got 13B, Hello world!

Note: On Mac, multiple migrations will return an error [ERRNO 48] ADDRESS ALREADY IN USE.

After connecting with @wasphin, we don't want to migrate. Instead, we hope to bind the 5-tuple to this fd after connecting, so as to avoid other FDs receiving packets.

In this case, a more suitable thread model is:

  1. The UDP port is listened and packets are sent and received by default by the public thread, and the processing thread sends and receives packets through the public thread.
  2. If the processing thread finds that this 5-tuple is not what it should handle, it will transfer the processing to other threads through the message queue.
  3. If the processing thread finds that the packets of this 5-tuple are to be processed by itself, it will open the FD and connect to this address, so that only this thread will directly send and receive packets in the future.

This model is actually a hybrid model:

  1. Most of the time, there is no need to pass packets through locks and inter-thread communication queues.
  2. In a few cases, especially when a new address has just started, packets can be passed through the message queue.
  3. In the early stage of architecture evolution, it is possible not to connect, which means that packets will be passed between threads.

This hybrid model does not rely on UDP connect, but the performance will be very high when Connect works.

In addition, the encryption and decryption problem can also be solved by a similar hybrid model:

  1. Start multiple independent encryption and decryption threads and pass packets through the queue.
  2. If the performance of the working thread is sufficient, it can directly encrypt and decrypt by itself.
  3. In the early stage of architecture evolution, there can be independent encryption and decryption threads, which means that packets will be passed between threads.

What's special is the disk IO thread, which will definitely use the queue to send messages:

  1. The log writing thread collects logs from other threads through the queue and writes them to the disk.
  2. The TS, FLV, and MP4 file writing threads, also known as recording threads, collect the file content to be written through the queue and write the content to the disk.

In the early days, we will still pass packets between multiple threads and divide different threads according to the business. As the evolution progresses, we will gradually eliminate the communication and dependencies between threads and turn them into independent threads that do not rely on each other, achieving higher performance.

centos 8(4.18.0-193.28.1.el8_2.x86_64) 下使用原代码帧听在 0.0.0.0:8000 也是迁移两次, 但是参考手册, 在迁移前先 connect 一次 AF_UNSPEC 后, 可以不断迁移:

$ ./udp-connect-server 0.0.0.0 8000 2
Start 2 workers, at 0.0.0.0:8000
listen at 0.0.0.0:8000, fd=4, migrate_fd=3 ok
listen at 0.0.0.0:8000, fd=3, migrate_fd=4 ok

fd #4, peer 127.0.0.1:51161, got 13B, Hello world!
dissolve the association with 127.0.0.1:51161 of #4, r0=0, errno=0
Transfer 127.0.0.1:51161 from #4 to #3, r0=0, errno=0

fd #3, peer 127.0.0.1:51161, got 13B, Hello world!
dissolve the association with 127.0.0.1:51161 of #3, r0=0, errno=0
Transfer 127.0.0.1:51161 from #3 to #4, r0=0, errno=0

fd #4, peer 127.0.0.1:51161, got 13B, Hello world!
dissolve the association with 127.0.0.1:51161 of #4, r0=0, errno=0
Transfer 127.0.0.1:51161 from #4 to #3, r0=0, errno=0

fd #3, peer 127.0.0.1:51161, got 13B, Hello world!
dissolve the association with 127.0.0.1:51161 of #3, r0=0, errno=0
Transfer 127.0.0.1:51161 from #3 to #4, r0=0, errno=0

fd #4, peer 127.0.0.1:51161, got 13B, Hello world!
dissolve the association with 127.0.0.1:51161 of #4, r0=0, errno=0
Transfer 127.0.0.1:51161 from #4 to #3, r0=0, errno=0

fd #3, peer 127.0.0.1:51161, got 13B, Hello world!
dissolve the association with 127.0.0.1:51161 of #3, r0=0, errno=0
Transfer 127.0.0.1:51161 from #3 to #4, r0=0, errno=0

fd #4, peer 127.0.0.1:51161, got 13B, Hello world!
dissolve the association with 127.0.0.1:51161 of #4, r0=0, errno=0
Transfer 127.0.0.1:51161 from #4 to #3, r0=0, errno=0

同样测试适用于绑定固定地址:

$ ./udp-connect-server 127.0.0.1 8000 2
Start 2 workers, at 127.0.0.1:8000
listen at 127.0.0.1:8000, fd=4, migrate_fd=3 ok
listen at 127.0.0.1:8000, fd=3, migrate_fd=4 ok

fd #3, peer 127.0.0.1:44481, got 13B, Hello world!
dissolve the association with 127.0.0.1:44481 of #3, r0=0, errno=0
Transfer 127.0.0.1:44481 from #3 to #4, r0=0, errno=0

fd #4, peer 127.0.0.1:44481, got 13B, Hello world!
dissolve the association with 127.0.0.1:44481 of #4, r0=0, errno=0
Transfer 127.0.0.1:44481 from #4 to #3, r0=0, errno=0

fd #3, peer 127.0.0.1:44481, got 13B, Hello world!
dissolve the association with 127.0.0.1:44481 of #3, r0=0, errno=0
Transfer 127.0.0.1:44481 from #3 to #4, r0=0, errno=0

fd #4, peer 127.0.0.1:44481, got 13B, Hello world!
dissolve the association with 127.0.0.1:44481 of #4, r0=0, errno=0
Transfer 127.0.0.1:44481 from #4 to #3, r0=0, errno=0

fd #3, peer 127.0.0.1:44481, got 13B, Hello world!
dissolve the association with 127.0.0.1:44481 of #3, r0=0, errno=0
Transfer 127.0.0.1:44481 from #3 to #4, r0=0, errno=0

https://man7.org/linux/man-pages/man2/connect.2.html

   Some protocol sockets (e.g., TCP sockets as well as datagram
   sockets in the UNIX and Internet domains) may dissolve the
   association by connecting to an address with the sa_family member
   of sockaddr set to AF_UNSPEC; thereafter, the socket can be
   connected to another address.  (AF_UNSPEC is supported on Linux
   since kernel 2.2.)

Note: 最终架构,我们确定是由可水平扩展的Hybrid线程,不会单独拆分SRTP和SEND/RECV线程,因此隐藏相关评论。

收发和SRTP启动专门的线程做,相关的Commit如下:

  • Threads-SRTP: Config and add files for the async-srtp : 28504c0
  • Threads-SRTP: Support decrypt RTP by async SRTP: 9a82baf
  • Threads: Fix bug for SRTP and Log thread nanosleep: 85570cf
  • Threads-RECV: Support dedicate thread to recv UDP packets: e20eedf
  • Threads-RECV: Refine the stat for SNMP UDP recv/error: 93fd8bc
  • Threads: Set the threads name display in top: 3c9c9b1
  • Threads: Support cpu affinity for threads: a3ea734
  • Threads-RECV: Drop received packet if exceed max queue size: 981446a
  • Threads-RECV: Show the dropped packets pps: 1d8d8ab
  • Threads: Use thread-local buffer for log: d49928d
  • Threads: Use coroutine to consume recv/srtp packets: a7e7ba2
  • Threads: Support Circuit-Breaker to work in storms: 89b7ef5
  • Threads: Refine variables and do dispose: a179a40
  • Threads: Merge recv and srtp consume to one timer: f603611
  • Threads-RECV: Change UDP recv max size from 6k to 1500 bytes: 246f727
  • Threads-SRTP: Support async decrypt RTCP: ac7c277
  • Threads-SEND: Support async send UDP packets: 5bfbc4a
  • Threads-SRTP: Use async encrypt SRTP packet: 837a04a

在开启加解密、NACK和TWCC前提下,单线程能支持到500路并发流(推或拉),多线程能支持到1000路并发(起步),未来的优化方向:

  • 目前Hybrid线程,也就是之前的SRS主线程,集中了服务器的处理逻辑,其他线程也通过这个线程通信(批量队列)。
  • 尽量减少Hybrid线程的中转,比如RECV/SEND线程和SRTP线程可以直接通信,除非是新的连接才需要通过中转建立关联。
  • 改造ST变成thread_local的ST,ST目前是非线程安全,改造后才能保障多线程下不会出现问题。通信需要使用类似Go的chan机制,线程+协程通信效率很低但可能是必须的。
  • 可以通过RECV线程做逻辑判断,将同一个流的包转给特定的Hybrid线程,这样Hybrid线程也可以水平扩展。
  • 使用系统UDP五元组绑定机制,Hybrid线程直接独立完成服务,避免线程之间的锁(改动较大可能需要时间和稳定性考虑),性能是否比目前批量队列的方式高,也需要验证。

Note: 最终架构,我们确定是由可水平扩展的Hybrid线程,不会单独拆分SRTP和SEND/RECV线程,因此隐藏相关评论。

尽量减少Hybrid线程的中转,比如RECV/SEND线程和SRTP线程可以直接通信,除非是新的连接才需要通过中转建立关联。

  • Threads-RECV: Support tunnel for recv-srtp. f1c6726
  • Threads-SEND: Support tunnel for srtp-send. 35e209e

Note: 最终架构,我们确定是由可水平扩展的Hybrid线程,不会单独拆分SRTP和SEND/RECV线程,因此隐藏相关评论。

开启多线程性能优化时,需要打开配置,默认这些配置都是关闭的:

threads {
    async_srtp on;
    async_recv on;
    async_send on;
    async_tunnel on;
    cpu_affinity {
        hybrid 1; srtp 2; recv 3; send 3; master 3; log 3;
    }
}

Note:线程的亲和性(affinity)设置中,CPU-0预留给软中断

熔断器默认是开启的,能保护服务器在高负载时不会被打趴下:

circuit_breaker {
    max_recv_queue 5000;
    high_threshold 90;
    high_pulse 2;
    critical_threshold 95;
    critical_pulse 1;
    dying_threshold 99;
    dying_pulse 5;
}

Note: 配置项的具体函数,请参考full.conf,可以关闭熔断器,推荐开启。

Note: 这是中间结果,已经合并到了这个Issue的描述中,所以折叠相关评论。

之前的SRS单线程架构(SRS/1/2/3/4)

image

  • 单进程单线程架构,直播支持Origin-Edge集群,如果部署在单机就是多进程单线程架构。
  • RTC由于是计算和IO密集型,所以多核能力很重要,主要问题就是这个帖子所描述的问题。
  • 尽管这样,SRS4也做了很多单核的性能优化,参考b431ad7c5d202714bfc9836ea673

实现过一种有锁多线程结构(SRS/v5.0.2),是个中间状态的版本:

image

  • TUNNEL隧道,是RECV-SRTP和SRTP-SEND线程直接通信的通道。
  • TUNNEL隧道由Hybrid构建,只有Hybrid才知道这些上下文信息。
  • 启动时无TUNNEL,包通过Hybrid中转,在DTLS成功结束后Hybrid建立TUNNEL。

Note: 优化的方向是扩展Hybrid线程,实现多线程(少锁)架构,达到Envoy这样最高的性能。

Remark: 这个版本的优势是改动小,稳定性高,但是hybrid还是瓶颈,无法水平扩展。这个版本的代码只会作为临时方案供参考,会从主干中移除,参考feature/threads_with_locks

The RTC benchmark data, by srs-bench:

Update Server Clients Type CPU Memory Threads Commit
2021-03-31 SRS/5.0.2 1400 publishers ~90% x 4 3.1GB 6 #2188
2021-03-31 SRS/5.0.2 1400 players ~93% x 4 1.0GB 6 #2188
2021-03-31 SRS/4.0.87 550 publishers ~86% x 1 1.3GB 1
2021-03-31 SRS/4.0.87 800 players ~94% x 1 444MB 1

Note: The benchmark tool for Janus is srs-bench, and startup script by janus-docker.

Note: 这是中间结果,已经合并到了这个Issue的描述中,所以折叠相关评论。

改进后少锁多线程结构(SRS/v5.0.3)

image

  • 合并SRTP线程到Hybrid线程,两个线程同时调用openssl,有Crash风险。
  • UDP RECV和SEND线程是可选的,默认关闭,也就是由Hybrid线程收发数据。
  • ST改为thread-local结构(ST#19),每个线程有自己隔离的ST,必须严格保证线程的ST结构不会传递到另外一个线程,也就是不能读写其他线程的FD。
  • Hybrid线程默认1个,也就是和SRS4的完全单线程结构基本接近,保持结构的简单性;如果需要开启高性能,可以开启多个线程,架构基本上也是不变的(从一个线程变成多个独立的线程)。
  • Hybrid线程支持水平扩展,注意是用的多端口方式,RTC通过SDP返回不同端口,RTMP和HTTP需要通过302跳转,具体实现要看文档了。
  • Hybrid水平扩展时,还是根据流锁定连接到一个线程,这样能支持推流的扩展。几乎可以水平扩展到百或千核,核数增多时线程之间通信增多,也并非完全无代价,代价较小。
  • Hybrid水平扩展时,一般情况下多路推流多路播放,可以支持得很好,比如1000推流每个流10个播放总共1万路流。

这个架构的缺点:

  • 单流播放扩展问题:Hybrid水平扩展时,由于每个流锁定在一个线程,所以一个流的播放路数,受限于单线程支持的路数。解法:未来计划通过级联解决下行扩展问题,比如下行一个流一台机器(无论多少核)支持1000路,可以通过级联1000个服务器支持100万路播放。
  • 全局变量和静态变量清理问题:多线程虽然是隔离的,但是他们还是有机会通过全局和静态变量犯错误,所以所有的全局变量,都必须检查和修改一遍,要么改成thread-local,要么改成thread-safe,这给稳定性带来了隐患,而且出现问题时往往很难排查。解法:默认只开启1个线程,用足够长的时间过渡和改善。
  • 库的线程安全问题:比如OpenSSL有多线程问题,OpenSSL 1.1号称是线程安全,需要修改编译脚本,不使用选项-no-threads,但有时候可能忘记开修改这个选项造成问题。解法:默认只开启1个SRTP线程,用足够长的时间过渡和改善。

单流播放扩展问题:如果一定要修改多线程,让单个机器多核支持播放的水平扩展,可以使用虚拟的Consumer,在线程之间共享Packet的指针,在另外一个线程消费Packet,这样的改动是可以接受,但是开源上并不完全追求性能,还是要保持非常简单架构的性能优化,个人认为单机1000路播放,通过级联解决扩展是合理的,所以未来不会使用这种优化。

So, does it support multi-threading for WebRTC now? After starting multiple processes and adding port reuse, only one core is used on a multi-core machine.

I've been working with Node.js for almost a year now, and I found that it is very similar to SRS's coroutine + multithreading, and I can basically see the future of SRS's multithreading.

The simplicity is quite good, which is what we want. We can't refer to Go for multithreading because it is true multithreading, while Node.js's multithreading is actually single-threaded for each thread, without locks and such. Go actually has locks.

Multithreading without thread synchronization is more suitable for maintenance.

I still insist on business separation for multithreading, with the stream still in one thread. This doesn't solve performance issues, but it does solve some CPU-intensive and freezing issues, such as:

  1. Audio transcoding: For example, AAC to Opus, and vice versa.
  2. DNS resolution: Currently implemented using system functions, multithreading can avoid blocking (of course, using UDP to implement it yourself is also a solution, but it's more complicated).
  3. Writing logs: Writing to disk, usually not a problem, but who knows? Experts say that if you can rely on the disk, everything should be on the tree.
  4. HLS, DASH, and DVR: Writing a lot of data to disk, some friends have tested that it has an impact when there are more than 10 channels, because the disk is really unreliable.

After solving these problems, the stability will be improved, and sometimes it is definitely affected by these factors.

I don't want to do multithreading for streams, because from the perspective of ease of use, for example, the API needs to double the maintenance cost for clustering, requires scheduling logic (no matter how simple), increases the steps for troubleshooting, and cannot simply evaluate the load. All these factors will greatly reduce the maintenance level of the entire project.

The only advantage of multithreading for streams is to increase multi-core capabilities, which can be achieved through cascading (which will be supported in the future) and business scheduling. If you think that running one process on one machine is too wasteful, you can use multiple ports or implement it with Pods. In any case, if you have already reached the point of focusing on high performance, it must be a large business volume of tens of thousands or even hundreds of thousands. If such a large business volume does not have research and development capabilities, it will either be a pitfall or a death trap.