ossrs/srs

HLS+: Support HLS Edge Cluster

winlinvip opened this issue ยท 18 comments

Currently, only RTMP has edge and origin servers to form a load balancing and fault-tolerant cluster. HLS distribution only supports origin servers. After SRS cuts out HLS files, it serves as an HTTP origin server for distribution. If SRS supports HTTP edge servers, it will support HTTP origin servers and edge clusters, allowing for more comprehensive statistics and control.

TRANS_BY_GPT3

This feature essentially aims to solve the issues with HLS edge, which include the following problems with HLS:

  1. Inability to identify users: Some players close the connection after playing a single TS file, making it impossible for the server to determine which requests belong to the same connection. IP addresses are not reliable since multiple clients can be behind a single NAT. Without user identification, dynamic parsing and packaging cannot be performed, and only physical slicing is possible.

  2. Temporary slicing: The server will always have a delay of at least one slice because it needs to write the temporary file before sending it to the client.

  3. Startup delay: When the client starts, it looks for several slices in advance, causing the player to start playing from the previous slices.

Therefore, SRS's edge mode cannot dynamically generate HLS and cannot serve clients after real-time sourcing, as is possible with HTTP FLV. This is also the fundamental reason why HLS cannot achieve low latency.

TRANS_BY_GPT3

If HLS is made into a stream, the delay can be reduced to the same level as FLV. When the client requests the M3U8, it is redirected to an M3U8 with a UUID parameter, for example, m3u8?uuid=154698. This UUID is the user's identifier, and the TS files also carry this identifier.

In this way, the edge server can dynamically generate HLS when the user connects. Instead of destroying this structure when the user closes the connection, it is kept for the next time the user reconnects.

This means that there is no need to do HTTP origin fetching, only RTMP origin fetching. The efficiency is not as high as HTTP, but the delay is much lower.

TRANS_BY_GPT3

This method solves the problem of HLS edge not being able to recognize users. Some players close the connection after playing one ts, so the server has no idea which ones are part of the same link.
This also solves the problem of HLS startup delay. The first few segments are very short, which allows the player to start with less data, reducing the startup delay.
This also solves the issue of the server needing to cache at least one segment. The server can use chunked encoding to continuously send data without the need for temporary files, until the segment ends.
In theory, the delay in HLS is mainly due to the delay in the transmission protocol, not the transmission and encoding delay. Optimizing the application of the protocol can solve this problem.

TRANS_BY_GPT3

After all the talk, the conclusion is that there is no need to make it a traditional HTTP edge, just make the edge support HLS, just like the FLV edge, both using RTMP for origin.
In other words, adding HLS to the HTTP stream is sufficient, which greatly reduces complexity. Currently, HTTP streams support FLV, MP3, AAC, TS, and adding HLS would be enough.
The fundamental reason for this issue is that HLS has always been in DVR mode, following the wrong path. Why can't HLS streams be made into HTTP streams? It can definitely be done.

TRANS_BY_GPT3

The advantages of this solution are as follows:

  1. Hot standby, fault tolerance, all taken care of. Can leverage all the advantages of the existing RTMP cluster.
  2. Statistics and analysis are the same as RTMP, so now we can know the connection information of HLS clients.
  3. Lower latency, even regular HLS players can have lower latency.
  4. Standard HTTP and HLS, compatible with all players.

Operational advantages:

  1. Memory distribution, no disk I/O.
  2. Save origin bandwidth, one-way RTMP origin, RTMP, FLV, HLS distribution.
  3. Low latency, almost consistent with FLV.
  4. Stream HLS, knowing the user's connection, can implement more powerful restrictions and authentication.

The weaknesses of this solution are:

  1. Compared to DVR-based HLS, i.e., physical slicing, there is a higher computational overhead.

TRANS_BY_GPT3

Real-time trans-encapsulation and segmentation of TS on the edge machine, and cache the latest 3-5 segments, updating the M3U8 file in real-time (in-memory cache). Additionally, there is no need to identify a TS player, as long as there is an absolute reference for timestamps on the current machine. In practice, the consumption of real-time trans-encapsulation is minimal and can be resolved.

TRANS_BY_GPT3

Converting to ts in real-time and caching it is indeed a good solution. It is like sharing a slice of the connection with other connections. When should the slicing stop? Timeout?

TRANS_BY_GPT3

Generate ts slices, as long as the upstream stream is flowing, continuously perform a round-robin BUFFER, eliminating old slices, and update the corresponding M3U8 (the number of segments inside should be less than the cache to prevent the TS segment from being eliminated when a request comes in). When the upstream stream stops flowing, the slicing action naturally pauses. This slicing cache is not bound to the connection of the player, but bound to the lifecycle of the current stream on this machine.

TRANS_BY_GPT3

In other words, as long as someone is pushing a stream to the origin server, the edge starts to slice, right?

TRANS_BY_GPT3

The edge of the source station is currently being synchronized. Those that have not synchronized this stream will remain unchanged. In addition, with IDLE recovery, it will be sufficient. The statement "the upstream stream is flowing" refers to the stream on the edge synchronization source, and it is not clear whether this stream is flowing. There are two options for when this slicing action stops. One is as mentioned above, to bind it to the lifecycle of this stream on the current machine and stop when the stream is recycled. The other is to start when a user accesses HLS and stop when no one has accessed it for a certain period of time. The downside of this option is that the first user to access on each machine will experience a cold start delay, but it is not a major issue and depends on the tolerance level of the application scenario.

TRANS_BY_GPT3

Let me confirm one thing with you first, our terminology may be different, so we might not be talking about the same thing. When you mention "synchronizing the source station at the edge," are you referring to someone playing a stream on the edge, where the edge retrieves the stream from the source station and provides the service?

If that's the case, let's use the term "origin pull" to refer to this situation. However, there is one issue:

  • When a user plays an RTMP stream on the edge, that stream is in an "origin pull" state.
  • If a new user plays the HLS of that stream, the "origin pull" stream provides HLS service.
  • If all users playing the RTMP stream disconnect, but the HLS stream remains connected, it should still provide service.
  • In this case, it is necessary to maintain the "origin pull," which contradicts the situation where all connections are disconnected and there is no "origin pull."

I'm not sure if I understood correctly.

TRANS_BY_GPT3

When users are playing an edge RTMP stream, the stream is being sourced.
Yes.

If new users play the HLS stream of this sourced stream, the sourced stream provides HLS service.
Yes.

If all users playing the RTMP stream disconnect, but the HLS stream is still connected, then the service should still be provided.
Yes.

At this time, it is necessary to maintain the sourcing, which means it is still being sourced. This is contradictory to the situation where all users disconnect and no sourcing is needed.
No contradiction.

The misunderstanding of the fourth question stems from the difference in our structural design. In my previous design, there is a rotating buffer in the program that stores the original information of the stream (it can be an FLV-encapsulated stream or a custom one). Then all the components that communicate with it can be divided into two categories: INPUT PLUGIN (pull from the source, receive external PUSH) and OUTPUT PLUGIN (provide playback and synchronization to the outside). The current stream can provide services to the outside only if at least one INPUT PLUGIN is working. The stream can be reclaimed only when all OUTPUT PLUGINs have no users. HLS is just one of the OUTPUT PLUGINs, so in the fourth question, "all disconnect" refers to all OUTPUT directions, including RTMP and FLV OVER HTTP users disconnecting. This idle stream can be reclaimed.

TRANS_BY_GPT3

In addition, the phrase "stop when no one visits for a certain period of time" means that when no HLS users visit for a certain period of time, HLS slicing can stop. Whether the origin action stops or not depends on whether there are no users in all OUTPUT directions.

TRANS_BY_GPT3

Hmm, understood.

If we consider HLS and HTTP-FLV on the edge as a structure, where slicing is triggered (FLV slices are deleted when playback stops, while HLS can use timeouts), and HLS can share slices (FLV does not need to share), then there is no conflict. In fact, we should wait for all clients (RTMP disconnect, HLS timeout, HTTP-FLV disconnect) to disconnect before considering cleaning up this origin.

In other words, for RTMP and FLV, disconnection can be used, while HLS uses timeouts as disconnection. This way, the three types of distribution can be unified on the edge without any conflicts.

Thank you~

TRANS_BY_GPT3

This is the meaning.
I am also studying the code for handling multimedia protocols in SRS. If I have any questions in the future, I can ask you for advice. ^_^

TRANS_BY_GPT3

I named this feature: HLS+.
Right now, SRS support origin and edge cluster; we can push RTMP to origin, delivery RTMP and HTTP-FLV on edge cluster.
But the HLS is dvr to disk and delivery by HTTP cluster, I think it's possible to remux the HLS on edge server. That is, the HLS+ is delivery HLS on edge server, which remux HLS realtime.
With HLS+, we can push RTMP to origin, delivery RTMP, HTTP-FLV and HLS on edge server, without any external cluster.

It seems that MSE and WEBRTC are becoming more and more strong, so HLS+ has no place now, we can use MSE for 3s+ live and WEBRTC for 300ms communication.

HLS+ is quite complex, so it is worth considering implementing LLHLS, as well as more standard protocols like H3-FLV.

Enabling HLS on the edge should result in an error. Please refer to issue #1066 for more information.

TRANS_BY_GPT3