timescale/promscale_extension

Maintenance Job Observability

sumerman opened this issue · 4 comments

This is a top-level tracking issue.

Required metrics (some may exist already):

  • The amount of work that remains to be done
    • A size of a job queue if there is one
    • The number of chunks to be removed/compressed
    • Break down by the job type
  • A histogram of job completion times
    • Break down by the job type
  • A failure rate
    • Break down by the job type
    • Break down by the failure type
      • I don't want to over-specify it. The goal here is to be able to make common failure cases (if there are any) easily distinguishable.
  • The number of jobs running at any given moment
  • Last job activity and track inactivity time on a dashboard
  • Long-running statements
    • Ideally, we should be able to limit them to only those, held by maintenance jobs
      • I expect pg_stat_activity to suffice for what I intend here. pg_stat_statements is shipped with the HA image, but not enabled by default. We can also make an optional metric based on pg_stat_statements.
  • Locks by type if they don’t already.
    • Ideally, we should be able to limit them to only those, held by maintenance jobs

Logs

We want to log into a (hyper)table, but transactional control may prevent us from extracting data from transactions that were rolled back. To circumvent that we can try:

  • To use an FDW pointed at the logging table
  • Come up with some other C/Rust-based trickery

Tasks

  • Pick a single simple to-implement metric and add it all the way from the job description in SQL to a colorful panel in Grafana and a corresponding alert (if applicable)
  • Implement support for collecting data from rolled-back transactions
  • Add every other metric in the list above

Logging improvements (optional)

TBD after Benchmark shmem vs FDW approaches. is done. We can either leave logging as is or use one of the transaction-transcending data collection approaches to implement logging into a (hyper)table.

One thing that I would find interesting to see would be something like per-metric stats on the last start time, last successful complete time, and last duration of the maintenance job processing of a specific metric (i.e. how much time was spent in _prom_catalog.drop_metric_chunks for a specific metric).

Another idea: the number of chunks we need to freeze

timescale/timescaledb#4678 landed in the TimescaleDB