Investigate queue backlog
jonathanmetzman opened this issue · 8 comments
There is a 100k unacked message backlog.
Interestingly, many of these tasks appear taken but not finished.
Two that have been doing their tasks for weeks are stuck here: Testing for crash
I've looked at two instances suffering from this issue. Their stacktraces looked roughly similar:
(gdb) bt
#0 0x00005cbe7d5d4b34 in sre_ucs1_match (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7c014, toplevel=toplevel@entry=0) at ./Modules/sre_lib.h:590
#1 0x00005cbe7d5db30d in sre_ucs1_search (pattern=<optimized out>, state=0x7ffee26f5bd0) at ./Modules/sre_lib.h:1443
#2 sre_search (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7bfe8) at ./Modules/_sre.c:578
#3 0x00005cbe7d5dd414 in pattern_subx (self=self@entry=0x5cbe7fc7bf90, ptemplate=<optimized out>, string=0x5cbe840925b0, count=0, subn=subn@entry=0) at ./Modules/_sre.c:1060
#4 0x00005cbe7d5ddbf5 in _sre_SRE_Pattern_sub_impl (count=<optimized out>, string=<optimized out>, repl=<optimized out>, self=0x5cbe7fc7bf90) at ./Modules/_sre.c:1181
#5 _sre_SRE_Pattern_sub (self=0x5cbe7fc7bf90, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at ./Modules/clinic/_sre.c.h:416
#6 0x00005cbe7d4b9ea7 in _PyMethodDef_RawFastCallKeywords (method=0x5cbe7d741320 <pattern_methods+96>, self=self@entry=0x5cbe7fc7bf90, args=args@entry=0x7c43d107cf10, nargs=nargs@entry=3, kwnames=kwnames@entry=0x0) at Objects/call.c:660
#7 0x00005cbe7d6346be in _PyMethodDescr_FastCallKeywords (descrobj=descrobj@entry=0x7c43d311a550, args=0x7c43d107cf08, nargs=nargs@entry=4, kwnames=kwnames@entry=0x0) at Objects/descrobject.c:288
#8 0x00005cbe7d4a01b2 in call_function (pp_stack=pp_stack@entry=0x7ffee26f5e90, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4593
#9 0x00005cbe7d4a119c in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3110
#10 0x00005cbe7d568560 in PyEval_EvalFrameEx (throwflag=0, f=0x7c43d107cd70) at Python/ceval.c:547
This seems like Python is stuck in some infinite regex
Got a stacktrace from a third:
Traceback (most recent call last):
File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 249, in <module>
main()
File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 212, in main
error_stacktrace, clean_exit, task_payload = task_loop()
File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 146, in task_loop
commands.process_command(task)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 249, in process_command
task.high_end, task.is_command_override)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 159, in wrapper
return func(task_name, task_argument, job_name, *args, **kwargs)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 431, in process_command_impl
preprocess)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 218, in run_command
result = task.execute(task_argument, job_name, uworker_env)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 127, in execute
self.execute_locally(task_argument, job_type, uworker_env)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 63, in execute_locally
uworker_output = utasks.uworker_main_no_io(self.module, uworker_input)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py", line 194, in uworker_main_no_io
uworker_output = utask_module.utask_main(uworker_input)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 368, in utask_main
fuzz_target, testcase, testcase_file_path, test_timeout)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 197, in test_for_crash_with_retries
compare_crash=False)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 801, in test_for_crash_with_retries
testcase.flaky_stack)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 688, in reproduce_with_retries
state = self._get_crash_state(round_number, crash_result)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 664, in _get_crash_state
state = crash_result.get_symbolized_data()
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/crash_result.py", line 48, in get_symbolized_data
self.output, symbolize_flag=True)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/stack_parsing/stack_analyzer.py", line 113, in get_crash_data
result = stack_parser.parse(crash_stacktrace_without_inlines)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 472, in parse
self.match_assert(line, state, ASSERT_REGEX_GLIBC_SUFFIXED)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 322, in match_assert
regex, line, state, new_type='ASSERT', new_frame_count=1)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 184, in update_state_on_match
match = compiled_regex.match(line)
KeyboardInterrupt
root@clusterfuzz-
I think a few hundred bots are blocked by this.
Another thing we should have is better task killing. A bot should kill a process that runs a task for too long.
@oliverchang Do you know if we have this feature already? I think we do right?
Another lesson we can learn is to have alerts when queue backlogs get too long. This is both a symptom of problems and an issue in of itself.
This has been mitigated in ClusterFuzz and fuzztest no longer does this.
But we should still deal with backlogs better.