google/clusterfuzz

Investigate queue backlog

jonathanmetzman opened this issue · 8 comments

There is a 100k unacked message backlog.
Interestingly, many of these tasks appear taken but not finished.

Two that have been doing their tasks for weeks are stuck here: Testing for crash

I've looked at two instances suffering from this issue. Their stacktraces looked roughly similar:

(gdb) bt                                                                                                                                                                                                                                                                                  
#0  0x00005cbe7d5d4b34 in sre_ucs1_match (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7c014, toplevel=toplevel@entry=0) at ./Modules/sre_lib.h:590                                                                                                                   
#1  0x00005cbe7d5db30d in sre_ucs1_search (pattern=<optimized out>, state=0x7ffee26f5bd0) at ./Modules/sre_lib.h:1443                                                                                                                                                                     
#2  sre_search (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7bfe8) at ./Modules/_sre.c:578                                                                                                                                                                           
#3  0x00005cbe7d5dd414 in pattern_subx (self=self@entry=0x5cbe7fc7bf90, ptemplate=<optimized out>, string=0x5cbe840925b0, count=0, subn=subn@entry=0) at ./Modules/_sre.c:1060                                                                                                            
#4  0x00005cbe7d5ddbf5 in _sre_SRE_Pattern_sub_impl (count=<optimized out>, string=<optimized out>, repl=<optimized out>, self=0x5cbe7fc7bf90) at ./Modules/_sre.c:1181                                                                                                                   
#5  _sre_SRE_Pattern_sub (self=0x5cbe7fc7bf90, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at ./Modules/clinic/_sre.c.h:416                                                                                                                                     
#6  0x00005cbe7d4b9ea7 in _PyMethodDef_RawFastCallKeywords (method=0x5cbe7d741320 <pattern_methods+96>, self=self@entry=0x5cbe7fc7bf90, args=args@entry=0x7c43d107cf10, nargs=nargs@entry=3, kwnames=kwnames@entry=0x0) at Objects/call.c:660                                             
#7  0x00005cbe7d6346be in _PyMethodDescr_FastCallKeywords (descrobj=descrobj@entry=0x7c43d311a550, args=0x7c43d107cf08, nargs=nargs@entry=4, kwnames=kwnames@entry=0x0) at Objects/descrobject.c:288                                                                                      
#8  0x00005cbe7d4a01b2 in call_function (pp_stack=pp_stack@entry=0x7ffee26f5e90, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4593                                                                                                                                 
#9  0x00005cbe7d4a119c in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3110                                                                                                                                                                  
#10 0x00005cbe7d568560 in PyEval_EvalFrameEx (throwflag=0, f=0x7c43d107cd70) at Python/ceval.c:547

This seems like Python is stuck in some infinite regex

Got a stacktrace from a third:

Traceback (most recent call last):
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 249, in <module>
    main()
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 212, in main
    error_stacktrace, clean_exit, task_payload = task_loop()
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 146, in task_loop
    commands.process_command(task)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 249, in process_command
    task.high_end, task.is_command_override)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 159, in wrapper
    return func(task_name, task_argument, job_name, *args, **kwargs)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 431, in process_command_impl
    preprocess)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 218, in run_command
    result = task.execute(task_argument, job_name, uworker_env)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 127, in execute
    self.execute_locally(task_argument, job_type, uworker_env)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 63, in execute_locally
    uworker_output = utasks.uworker_main_no_io(self.module, uworker_input)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py", line 194, in uworker_main_no_io
    uworker_output = utask_module.utask_main(uworker_input)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 368, in utask_main
    fuzz_target, testcase, testcase_file_path, test_timeout)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 197, in test_for_crash_with_retries
    compare_crash=False)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 801, in test_for_crash_with_retries
    testcase.flaky_stack)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 688, in reproduce_with_retries
    state = self._get_crash_state(round_number, crash_result)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 664, in _get_crash_state
    state = crash_result.get_symbolized_data()
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/crash_result.py", line 48, in get_symbolized_data
    self.output, symbolize_flag=True)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/stack_parsing/stack_analyzer.py", line 113, in get_crash_data
    result = stack_parser.parse(crash_stacktrace_without_inlines)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 472, in parse
    self.match_assert(line, state, ASSERT_REGEX_GLIBC_SUFFIXED)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 322, in match_assert
    regex, line, state, new_type='ASSERT', new_frame_count=1)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 184, in update_state_on_match
    match = compiled_regex.match(line)
KeyboardInterrupt
root@clusterfuzz-

I think a few hundred bots are blocked by this.

Another thing we should have is better task killing. A bot should kill a process that runs a task for too long.
@oliverchang Do you know if we have this feature already? I think we do right?

Another lesson we can learn is to have alerts when queue backlogs get too long. This is both a symptom of problems and an issue in of itself.

This has been mitigated in ClusterFuzz and fuzztest no longer does this.

But we should still deal with backlogs better.