harlanc/xiu

CPU占用异常

ningnao opened this issue · 16 comments

XIU version
matser b2fe9af

Describe the bug
长时间使用后CPU占用率异常升高

配置文件

[rtmp]
enabled = true
port = 1935

[httpnotify]
enabled = true
on_publish = "http://localhost:8085/publish"
on_unpublish = "http://localhost:8085/unpublish"
on_play = "http://localhost:8085/play"
on_stop = "http://localhost:8085/stop"

[log]
level = "info"

在之前的沟通中#66,发现我的设备会在每个I帧发送一次metaData 和 sequence header,也是因此日志中大量打印了metadata相关的日志
此外,日志中还偶尔出现以下错误

[2023-10-31T06:45:04Z ERROR rtmp::rtmp] session run error: session_type: server, app_name: live, stream_name: 001, err: net io error: time out error
----------------------------
[2023-10-31T06:45:15Z WARN  rtmp::handshake::handshake_server] complex handshake failed.. err:digest error: cannot generate digest
----------------------------
[2023-10-31T07:12:11Z ERROR rtmp::rtmp] session run error: session_type: server, app_name: live, stream_name: 001, err: pack error: bytes writer error: not enough bytes to write: io error

这里有一些进程/线程的信息不知道对解决问题有没有一些帮助
image
image
image

查看了最新的日志,发现其中大量出现推流失败的日志如下

[2023-11-01T02:50:42Z ERROR streamhub] event_loop Publish err: exists

在大量推流错误的日志出现之前打印了如下的错误日志

[2023-11-01T02:50:32Z ERROR rtmp::messages::parser] the msg_type_id is not supported: 10
[2023-11-01T02:50:32Z ERROR rtmp::rtmp] session run error: session_type: server, app_name: live, stream_name: 001, err: message error: unknown message type

服务端还未重启,可以继续提供解决问题必要的信息

感谢反馈。

[2023-10-31T06:45:04Z ERROR rtmp::rtmp] session run error: session_type: server, app_name: live, stream_name: 001, err: net io error: time out error

这个错误日志是前几天你修复的,读数据超时了。

[2023-10-31T06:45:15Z WARN  rtmp::handshake::handshake_server] complex handshake failed.. err:digest error: cannot generate digest

这个日志是RTMP握手日志,先尝试复杂握手,失败后会用简单握手,如果简单握手成功就没有问题。

[2023-10-31T07:12:11Z ERROR rtmp::rtmp] session run error: session_type: server, app_name: live, stream_name: 001, err: pack error: bytes writer error: not enough bytes to write: io error

这个错误没有遇到过,应该是packetizer.rs里面打包的时候,写发送数据出错了,能复现吗?能够提供录制文件?

[2023-11-01T02:50:42Z ERROR streamhub] event_loop Publish err: exists

这个错误的意思是推流的时候发现服务端已经有这个流ID(app name + stream name)了。有两种可能:
- 一种是确实推流的时候用的流地址中的app name 和 stream name 被占用了;
- 一种是旧的流资源没有被正确释放,导致app name 和 stream name一直占用。

[2023-11-01T02:50:32Z ERROR rtmp::messages::parser] the msg_type_id is not supported: 10
[2023-11-01T02:50:32Z ERROR rtmp::rtmp] session run error: session_type: server, app_name: live, stream_name: 001, err: message error: unknown message type

这个日志看着逻辑有问题,碰到不识别的msg type id 不应该返回错误,否则整个session就退出了,这个我改下。

看你三个xiu进程的CPU使用率都不太正常,流的转发不太应该占用这么多CPU,你的一个进程中在处理多少路流?之前碰到的CPU高的问题,大多数是读/写数据异常,碰到死循环了,你的最后一个线程栈的图左边是CPU使用率吗,看这个图可能是 receive_frame_data_loop 这个函数里面有问题,有可能有死循环。

关于性能问题,也可以看下项目的wiki中关于如何profile一个xiu 进程,生成 profile文件后可以分析一下。

这个错误没有遇到过,应该是packetizer.rs里面打包的时候,写发送数据出错了,能复现吗?能够提供录制文件?

这个出现的频率不高,不太容易复现.. 我会再想想办法

这个错误的意思是推流的时候发现服务端已经有这个流ID(app name + stream name)了。有两种可能:

这里应该是第二种可能,在出现了后面的 err: message error: unknown message type后,我观察日志发现设备开始尝试重连,但并没有看到之前的连接被关闭

你的一个进程中在处理多少路流?

我目前只在处理一路流,平时CPU占用非常低

关于性能问题,也可以看下项目的wiki中关于如何profile一个xiu 进程,生成 profile文件后可以分析一下。

刚才我想尝试生成 profile 时发现服务器 SSH 已经没反应了.... 图三中左侧是CPU占用率

有可能CPU高的问题也是由session异常退出导致的,我提交了一次代码,你看下 https://github.com/harlanc/xiu/tree/fix_71

补充一个火焰图 perf

我去试一下修复后的版本

目前没再复现成功这个问题.. 但出现了新的问题
依旧是设备长时间推流

ffmpeg拉流的时候出现以下错误

DTS 4261412847, next:-46000 st:1 invalid dropping
PTS 4261412907, next:-46000 invalid dropping st:1
DTS 4261412928, next:-13000 st:1 invalid dropping
PTS 4261413009, next:-13000 invalid dropping st:1
[flv @ 0000027de782d040] Timestamps are unset in a packet for stream 0. This is deprecated and will stop working in the future. Fix your code to set the timestamps properly

DTS 4261412973, next:20000 st:1 invalid dropping
PTS 4261413018, next:20000 invalid dropping st:1
DTS 4261413008, next:53000 st:1 invalid dropping
PTS 4261413043, next:53000 invalid dropping st:1
DTS 4261413089, next:86000 st:1 invalid dropping
PTS 4261413170, next:86000 invalid dropping st:1
DTS 4261413146, next:119000 st:1 invalid dropping
PTS 4261413203, next:119000 invalid dropping st:1
DTS 4261413227, next:152000 st:1 invalid dropping
PTS 4261413308, next:152000 invalid dropping st:1
DTS 4261413273, next:185000 st:1 invalid dropping
PTS 4261413319, next:185000 invalid dropping st:1
DTS 4261413371, next:218000 st:1 invalid dropping
PTS 4261413469, next:218000 invalid dropping st:1
DTS 4261413407, next:251000 st:1 invalid dropping
PTS 4261413443, next:251000 invalid dropping st:1
DTS 4261413540, next:284000 st:1 invalid dropping
PTS 4261413673, next:284000 invalid dropping st:1
DTS 4261413679, next:317000 st:1 invalid dropping
PTS 4261413818, next:317000 invalid dropping st:1
DTS 4261413807, next:350000 st:1 invalid dropping
PTS 4261413935, next:350000 invalid dropping st:1
DTS 4261413901, next:383000 st:1 invalid dropping
PTS 4261413995, next:383000 invalid dropping st:1
DTS 4261413974, next:416000 st:1 invalid dropping
PTS 4261414047, next:416000 invalid dropping st:1
DTS 4261414075, next:449000 st:1 invalid dropping
PTS 4261414176, next:449000 invalid dropping st:1
DTS 4261414202, next:482000 st:1 invalid dropping
PTS 4261414329, next:482000 invalid dropping st:1
DTS 4261414240, next:515000 st:1 invalid dropping
PTS 4261414278, next:515000 invalid dropping st:1
DTS 4261414275, next:548000 st:1 invalid dropping
PTS 4261414310, next:548000 invalid dropping st:1
DTS 4261414382, next:581000 st:1 invalid dropping
PTS 4261414489, next:581000 invalid dropping st:1
DTS 4261414441, next:614000 st:1 invalid dropping
PTS 4261414500, next:614000 invalid dropping st:1
DTS 4261414536, next:647000 st:1 invalid dropping
PTS 4261414631, next:647000 invalid dropping st:1
DTS 4261414678, next:680000 st:1 invalid dropping
PTS 4261414820, next:680000 invalid dropping st:1
DTS 4275582106, next:713000 st:1 invalid dropping
PTS 4272972316, next:713000 invalid dropping st:1
DTS 4242027805, next:746000 st:1 invalid dropping
PTS 4242027934, next:746000 invalid dropping st:1
DTS 4208473408, next:779000 st:1 invalid dropping
PTS 4208473441, next:779000 invalid dropping st:1
DTS 4208473541, next:812000 st:1 invalid dropping
PTS 4208473674, next:812000 invalid dropping st:1
DTS 4208473574, next:845000 st:1 invalid dropping
PTS 4208473607, next:845000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19385051; changing to -16775393. This may result in incorrect timestamps in the output file.
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19384885; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4208473667, next:878000 st:1 invalid dropping
PTS 4208473760, next:878000 invalid dropping st:1
DTS 4275582568, next:911000 st:1 invalid dropping
PTS 4275582609, next:911000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19384759; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4275582601, next:944000 st:1 invalid dropping
PTS 4275582634, next:944000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19384685; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4275582636, next:977000 st:1 invalid dropping
PTS 4275582671, next:977000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19384650; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4275582883, next:1010000 st:1 invalid dropping
PTS 4275583130, next:1010000 invalid dropping st:1
DTS 4275583010, next:1043000 st:1 invalid dropping
PTS 4275583137, next:1043000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19384276; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4275583073, next:1076000 st:1 invalid dropping
PTS 4275583136, next:1076000 invalid dropping st:1
DTS 4275583153, next:1109000 st:1 invalid dropping
PTS 4275583233, next:1109000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19384133; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4275583201, next:1142000 st:1 invalid dropping
PTS 4275583249, next:1142000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19384085; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4275583298, next:1175000 st:1 invalid dropping
PTS 4275583395, next:1175000 invalid dropping st:1
DTS 4275583334, next:1208000 st:1 invalid dropping
PTS 4275583370, next:1208000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19383952; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4275583566, next:1241000 st:1 invalid dropping
PTS 4275583798, next:1241000 invalid dropping st:1
DTS 4275583707, next:1274000 st:1 invalid dropping
PTS 4275583848, next:1274000 invalid dropping st:1
DTS 4289751135, next:1307000 st:1 invalid dropping
PTS 4287141345, next:1307000 invalid dropping st:1
[flv @ 0000027de782d040] Non-monotonous DTS in output stream 0:1; previous: -16775393, current: -19383720; changing to -16775393. This may result in incorrect timestamps in the output file.
DTS 4256196834, next:1340000 st:1 invalid dropping
PTS 4256196963, next:1340000 invalid dropping st:1
FFmpeg:
FFmpeg:
[q] command received. Exiting.
FFmpeg:
frame=   44 fps=0.0 q=-1.0 Lsize=     675kB time=00:00:01.34 bitrate=4125.0kbits/s speed=12.1x
video:666kB audio:7kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.251740%

用potplayer播放时,时长不正确
image
image

ffmpeg中断时,服务器出现以下错误

[2023-11-05T12:14:10Z ERROR rtmp::rtmp] session run error: session_type: server, app_name: live, stream_name: 001, err: pack error: bytes writer error: not enough bytes to write: io error

除此之外没有别的错误信息了

[2023-11-05T12:14:10Z ERROR rtmp::rtmp] session run error: session_type: server, app_name: live, stream_name: 001, err: pack error: bytes writer error: not enough bytes to write: io error

这个错误,我这也复现了,在拉流的时候,直接断开就会报这个错误,意思是直接断开了TCP连接,拉流的session也会退出,这个流程看着是正常的,可以先不管吧,我在想要不要把这个error改成info。

ffmpeg拉流失败,应该是packetize rtmp chunk的时候时间戳写错了,推流多长时间才会报这个错误?最好在出错的时候可以抓个网络包,上下行的都要,我这边再分析分析,谢谢

我是第三天下午才发现的这个问题,第二天晚上试的时候还是好的

我在服务端抓了推流包和拉流的包,这个传国内网盘吧
链接:https://pan.baidu.com/s/1lp8sTm5qrs2zbbiORm_VZg?pwd=bpwl
提取码:bpwl

把rtmp 组包和拆包逻辑梳理了一下,提交了一次代码,看看问题还能复现吗?不过看你提供的抓包记录,发现推上来的音频和视频时间戳差别比较大,而且视频的时间戳有点怪:
image

试了一下最新版本,panic

thread 'tokio-runtime-worker' panicked at protocol/rtmp/src/chunk/unpacketizer.rs:312:21:
assertion `left == right` failed
  left: 1
 right: 0

试了一下 03737a0,obs和action推流正常,摄像头推流依然 panic
我打断点观察到第一次format_id为0,csid为3,第二次format_id为1,csid为2,放行后panic

看了下标准文档:
image
一个新的chunk stream 第一个chunk 肯定是 Type 0呀 也就是format==0(csid为2 的第一个 chunk应该是format_id==0,但是你提供的信息是1),可以抓下包吗?再看看具体情况

不过看样子这个csid 2应该不是音视频包,可以把

assert_eq!(format_id, 0);

这个断言注释掉再试试,这个chunk的msg_streamd_id 就是随机值了,可能不影响功能。

抓包数据:
链接:https://pan.baidu.com/s/1sv_nokgwnyjUTxjz87ZYsQ?pwd=z9j5
提取码:z9j5

这个断言注释掉再试试,这个chunk的msg_streamd_id 就是随机值了,可能不影响功能。

这个我之前尝试过了,packetizer.rs:51

cur_msg_header.timestamp_delta = cur_msg_header.timestamp - pre_msg_header.timestamp;

这里会有问题.. 减出来一个负数,debug下panic了

然后我也试着求他俩的差值,但是播放的时候不太正常,具体表现是前一两秒正常,后面都是慢放

加微信吧,github上效率太低,harlancc

加您了