cloudwu/skynet

sharedata.flush 导致 coredump

ghost90240 opened this issue · 2 comments

skynet版本是1.5.0(2021-11-09)
问题是在热更配置后,某个服务执行 sharedata.flush 导致 coredump
查看了有关 sharedata 的 issues,有2个是1.5.0版本后修复的
#1820,工具查看了都无超 int32 的 key
#1797,这个应该没关系,但是查看了下对应导致core的服务处理任务数量,每天量级100w,core是在第11天热更的时候,达不到回绕。
查了2天没啥头绪(这个问题非常偶然,无法复现,热更的时候上千个服,只有1个服core了)

(gdb) bt
#0  luaS_remove (L=0x7f68628bf388, ts=0x7f6882f28450) at lstring.c:211
#1  0x0000000000418a35 in freeobj (L=0x7f68628bf388, o=0x7f6882f28450) at lgc.c:795
#2  0x0000000000418c8e in sweep2old (L=0x7f68628bf388, p=0x7f6815d4a340) at lgc.c:1082
#3  0x000000000041a1f0 in atomic2gen (L=0x7f68628bf388, g=0x7f6847b7d4d0) at lgc.c:1294
#4  0x000000000041a5d6 in entergen (L=0x7f68628bf388, g=0x7f6847b7d4d0) at lgc.c:1332
#5  0x000000000041a6d3 in fullgen (L=0x7f68628bf388, isemergency=<value optimized out>) at lgc.c:1375
#6  luaC_fullgc (L=0x7f68628bf388, isemergency=<value optimized out>) at lgc.c:1730
#7  0x0000000000413057 in lua_gc (L=0x7f68628bf388, what=<value optimized out>) at lapi.c:1195
#8  0x00000000004306a7 in luaB_collectgarbage (L=0x7f68628bf388) at lbaselib.c:248
#9  0x000000000041705e in precallC (L=0x7f68628bf388, func=<value optimized out>, nresults=0) at ldo.c:510
#10 luaD_precall (L=0x7f68628bf388, func=<value optimized out>, nresults=0) at ldo.c:576
#11 0x00000000004263ff in luaV_execute (L=<value optimized out>, ci=<value optimized out>) at lvm.c:1684
#12 0x0000000000416e63 in unroll (L=0x7f68628bf388, ud=<value optimized out>) at ldo.c:725
#13 0x0000000000415dec in luaD_rawrunprotected (L=0x7f68628bf388, f=0x417190 <resume>, ud=0x7f68a1635dcc) at ldo.c:144
#14 0x0000000000416c84 in lua_resume (L=0x7f68628bf388, from=<value optimized out>, nargs=3, nresults=0x7f68a1635e2c) at ldo.c:830
#15 0x00007f68a343e455 in lua_resumeX (L=0x7f684754ca68, co_index=1, n=3) at service-src/service_snlua.c:90
#16 auxresume (L=0x7f684754ca68, co_index=1, n=3) at service-src/service_snlua.c:146
#17 timing_resume (L=0x7f684754ca68, co_index=1, n=3) at service-src/service_snlua.c:198
#18 0x00007f68a343e760 in luaB_coresume (L=0x7f684754ca68) at service-src/service_snlua.c:217
#19 0x00000000004175bf in precallC (L=0x7f684754ca68, ci=<value optimized out>, func=<value optimized out>, 
    narg1=<value optimized out>, delta=<value optimized out>) at ldo.c:510
#20 luaD_pretailcall (L=0x7f684754ca68, ci=<value optimized out>, func=<value optimized out>, narg1=<value optimized out>, 
    delta=<value optimized out>) at ldo.c:531
#21 0x0000000000425c33 in luaV_execute (L=<value optimized out>, ci=<value optimized out>) at lvm.c:1709
#22 0x00000000004172e7 in ccall (L=0x7f684754ca68, func=<value optimized out>, nResults=-1) at ldo.c:618
#23 luaD_callnoyield (L=0x7f684754ca68, func=<value optimized out>, nResults=-1) at ldo.c:636
#24 0x0000000000415dec in luaD_rawrunprotected (L=0x7f684754ca68, f=0x4134b0 <f_call>, ud=0x7f68a1636160) at ldo.c:144
#25 0x0000000000416a8f in luaD_pcall (L=0x7f684754ca68, func=<value optimized out>, u=<value optimized out>, old_top=224, 
    ef=<value optimized out>) at ldo.c:934
#26 0x00000000004133c9 in lua_pcallk (L=0x7f684754ca68, nargs=<value optimized out>, nresults=-1, errfunc=<value optimized out>, 
    ctx=<value optimized out>, k=<value optimized out>) at lapi.c:1063
#27 0x000000000042f8ff in luaB_xpcall (L=0x7f684754ca68) at lbaselib.c:494
#28 0x000000000041705e in precallC (L=0x7f684754ca68, func=<value optimized out>, nresults=2) at ldo.c:510
#29 luaD_precall (L=0x7f684754ca68, func=<value optimized out>, nresults=2) at ldo.c:576
#30 0x00000000004263ff in luaV_execute (L=<value optimized out>, ci=<value optimized out>) at lvm.c:1684
#31 0x00000000004172e7 in ccall (L=0x7f684754ca68, func=<value optimized out>, nResults=0) at ldo.c:618
#32 luaD_callnoyield (L=0x7f684754ca68, func=<value optimized out>, nResults=0) at ldo.c:636
#33 0x0000000000415dec in luaD_rawrunprotected (L=0x7f684754ca68, f=0x4134b0 <f_call>, ud=0x7f68a1636490) at ldo.c:144
#34 0x0000000000416a8f in luaD_pcall (L=0x7f684754ca68, func=<value optimized out>, u=<value optimized out>, old_top=48, 
    ef=<value optimized out>) at ldo.c:934
#35 0x00000000004133c9 in lua_pcallk (L=0x7f684754ca68, nargs=<value optimized out>, nresults=0, errfunc=<value optimized out>, 
    ctx=<value optimized out>, k=<value optimized out>) at lapi.c:1063
#36 0x00007f689b61a05d in _cb (context=0x7f683649ed60, ud=<value optimized out>, type=9, session=318, source=3, 
    msg=<value optimized out>, sz=16) at lualib-src/lua-skynet.c:67
#37 0x0000000000409e3d in dispatch_message (ctx=0x7f683649ed60, msg=0x7f68a1636650) at skynet-src/skynet_server.c:286
#38 0x000000000040a6bf in skynet_context_message_dispatch (sm=0x7f68a5452140, q=0x7f68365f8e00, weight=-1)
    at skynet-src/skynet_server.c:414
#39 0x000000000040b53d in thread_worker (p=<value optimized out>) at skynet-src/skynet_start.c:163
#40 0x0000003262807aa1 in start_thread () from /lib64/libpthread.so.0
#41 0x00000032624e8c4d in clone () from /lib64/libc.so.6

首先,问题出现在 lua gc 的时候,看起来是 lua vm 内部的状态错了。虽然是调用 sharedata.flush 导致,但只能说明

https://github.com/cloudwu/skynet/blob/v1.5.0/lualib/skynet/sharedata.lua#L60

sharedata.flush 这个操作调用了 fullgc ( collectgarbage() )而已。我不认为是 sharedata 本身的问题。

ps. 无论如何,都没有理由把 skynet 停留在旧版本,除非你有独自维护它的能力。且 lua 本身也在更迭,同样也没有理由停留在某个旧版本。比如 https://lua.org/bugs.html 这里可以看到,每个小版本都 fix 了大量的 bug 。

从 coredump log 看,https://github.com/cloudwu/skynet/blob/v1.5.0/3rd/lua/lstring.c#L211 这一行指: gc 在清理短字符串时,vm 里的短字符串 hash 表上的链表指针出错了。sharedata 库也没有能力写坏它。

我认为你需要排查的是所有 C 代码,找到内存越界,或其它内存错误。 至少,你可以先检查 double free 等简单的问题: https://github.com/cloudwu/skynet/wiki/MemoryHook

因为 coredump 很罕见,那么你需要重点考虑那些很少运行到的 C 代码。

谢谢云大的意见。
MEMORY_CHECK 已经带上了的,我更到最新版本后续再观察下。