Maintenance Job Observability

Question

Maintenance Job Observability

sumerman opened this issue 2 years ago · 4 comments

This is a top-level tracking issue.

Required metrics (some may exist already):

The amount of work that remains to be done
- A size of a job queue if there is one
- The number of chunks to be removed/compressed
- Break down by the job type
A histogram of job completion times
- Break down by the job type
A failure rate
- Break down by the job type
- Break down by the failure type
  - I don't want to over-specify it. The goal here is to be able to make common failure cases (if there are any) easily distinguishable.
The number of jobs running at any given moment
Last job activity and track inactivity time on a dashboard
Long-running statements
- Ideally, we should be able to limit them to only those, held by maintenance jobs
  - I expect pg_stat_activity to suffice for what I intend here. pg_stat_statements is shipped with the HA image, but not enabled by default. We can also make an optional metric based on pg_stat_statements.
Locks by type if they don’t already.
- Ideally, we should be able to limit them to only those, held by maintenance jobs

Logs

We want to log into a (hyper)table, but transactional control may prevent us from extracting data from transactions that were rolled back. To circumvent that we can try:

To use an FDW pointed at the logging table
Come up with some other C/Rust-based trickery

Tasks

Pick a single simple to-implement metric and add it all the way from the job description in SQL to a colorful panel in Grafana and a corresponding alert (if applicable)
- timescale/promscale#1597
- timescale/promscale#1598
- Survey (and mark as done) existing metrics while at it
Implement support for collecting data from rolled-back transactions
- #495
- #496
- #497
Add every other metric in the list above

Logging improvements (optional)

TBD after Benchmark shmem vs FDW approaches. is done. We can either leave logging as is or use one of the transaction-transcending data collection approaches to implement logging into a (hyper)table.

Answer 1 · 2022-09-05T15:09:19.000Z

One thing that I would find interesting to see would be something like per-metric stats on the last start time, last successful complete time, and last duration of the maintenance job processing of a specific metric (i.e. how much time was spent in _prom_catalog.drop_metric_chunks for a specific metric).

Answer 2 · 2022-09-06T13:27:56.000Z

Another idea: the number of chunks we need to freeze

Answer 3 · 2022-09-07T16:14:31.000Z

FTR timescale/promscale#1232

Answer 4 · 2022-10-03T08:41:53.000Z

timescale/timescaledb#4678 landed in the TimescaleDB