gdb dashboard

motivation

debugging parallel and/or distributed programs is hard. we wouldve saved many many hours of developer time by having a tool like this. (and saved even more time with similar tool integrations based on the tool server, looking at you perf, ss, strace, etc)

concrete problems we/i have faced in debugging

coredumps are not enough
1. you dont necessarily get the full process memory
2. gdb often says "you cant do this without a process"
3. they are often truncated
4. they can overwrite each other in the same path
5. theres no way to inspect many coredumps
coredumpctl only addresses problems 4 and 5
distributed programs are even harder to debug (coredumpctl over ssh is okay)
- its a pain to ssh into every box
- its a pain to attach to every process in the program
- when one process crashes it often cascades causing others to crash
  - deciding what happened first is hard
    - unless ntp is doing a great job
    - or gdb gets to the process before it can cascade
manually attaching to processes is hard
- especially on a live system
- need some way to identify the bad process before it goes bad
- either by guesswork
- or often by doing printf("pid %d\n", getpid()); sleep(60); to attach to something
  - this involves editing the code, hoping the crash is reproducible and going again

gdb server aims to solve all of these problems with a simple approach:

every new process in the program is spawned with a gdb "harness" attached at startup
the stdin, stdout, and stderr streams of each newly spawned gdb instance is connected to the tools server by the harness
the tools server serves a web dashboard that the gdb instances can be controlled from
by parsing the text output, the dashboard provides structured information to the user such as
- unique backtraces among many
- sorting the crashes by timestamp

alternatively, the harness supports another mode where the gdb instance is only started when a signal is caught

this may have lower overhead, particularly on process startup
but it rules out using breakpoints, watchpoints, catchpoints, or anything more complicated than "break on signal"

this solves the problems outlined above

there is no more manual attaching, because everything is attached to by default, with minimal runtime overhead
there is no more recompilation and hoping its reproducible because gdb is always attached
a crash wont cascade because gdb gets control as soon as the crash happens, stopping any communication
- the cascade will only happen if the absence of timely communication causes other crashes
there is no more relying on partial coredumps or not having the process around still

avagordon01/tools-server

gdb dashboard

motivation

architecture