RuntimeConfig: Immense Mutex Locking Times

Question

RuntimeConfig: Immense Mutex Locking Times

Opened this issue 9 months ago · 0 comments

The config.RuntimeConfig struct ensures data integrity through the use of its mu sync.RWMutex.
When creating lots of events in a bulk fashion, e.g., after loading multiple incidents on a fresh Icinga configuration, this mutex will result in long locking times and a high CPU-usage.

I experienced this bug after starting with a fresh Icinga 2-configuration, including lots of firing checks.
After configuring a new notification channel, it has not appeared after minutes. After taking a look at the process, I saw a huge CPU utilization and, after enabling debug logs, missing configuration reloads.

root@9d9171726bb8:/go/src/github.com/Icinga/icinga-notifications# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  998  3.0 3752980 1002368 ?     Ssl  09:00  78:16 /usr/local/bin/icinga-notifications-daemon --config /etc/icinga-notifications/config.yml
root        2519  0.0  0.0 1842300 22912 ?       Sl   09:05   0:00 /usr/libexec/icinga-notifications/channel/email
root        2525  0.3  0.0   4188  3328 pts/0    Ss   09:08   0:00 /bin/bash
root        2533  0.0  0.0   8088  4096 pts/0    R+   09:08   0:00 ps aux

:g/runtime-updates.*\(fetched configuration from database\|applied pending configuration\)
   73 2024-01-29T09:01:11.728Z  DEBUG runtime-updates fetched configuration from database {"took": "14.037396ms"}
   75 2024-01-29T09:01:11.728Z  DEBUG runtime-updates applied pending configuration {"took": "16.432µs"}
   96 2024-01-29T09:01:13.338Z  DEBUG runtime-updates fetched configuration from database {"took": "547.094244ms"}
  126 2024-01-29T09:01:34.164Z  DEBUG runtime-updates applied pending configuration {"took": "11.953µs"}
  144 2024-01-29T09:01:41.130Z  DEBUG runtime-updates fetched configuration from database {"took": "6.964389945s"}
  220 2024-01-29T09:05:52.842Z  DEBUG runtime-updates applied pending configuration {"took": "90.671µs"}
  248 2024-01-29T09:05:55.517Z  DEBUG runtime-updates fetched configuration from database {"took": "2.674626508s"}
  332 2024-01-29T09:09:47.432Z  DEBUG runtime-updates applied pending configuration {"took": "550.435µs"}
  357 2024-01-29T09:09:47.577Z  DEBUG runtime-updates fetched configuration from database {"took": "144.995215ms"}
  433 2024-01-29T09:09:52.194Z  DEBUG runtime-updates applied pending configuration {"took": "292.178µs"}
  459 2024-01-29T09:09:52.202Z  DEBUG runtime-updates fetched configuration from database {"took": "8.296999ms"}
  462 2024-01-29T09:09:52.203Z  DEBUG runtime-updates applied pending configuration {"took": "183.127µs"}

There's a delay between line 144 and 220 of over four minutes for a job which should be executed every second. Between those log messages, only Locking the RuntimeConfig's mutex mu takes place.

A small patch was applied to log mutex usage, the call origin and time.

Logging Patch

diff --git a/internal/config/runtime.go b/internal/config/runtime.go
index ad6dd2e..af70651 100644
--- a/internal/config/runtime.go
+++ b/internal/config/runtime.go
@@ -13,6 +13,7 @@ import (
 	"github.com/jmoiron/sqlx"
 	"go.uber.org/zap"
 	"golang.org/x/crypto/bcrypt"
+	"runtime/debug"
 	"strconv"
 	"strings"
 	"sync"
@@ -86,13 +87,19 @@ func (r *RuntimeConfig) PeriodicUpdates(ctx context.Context, interval time.Durat
 }
 
 // RLock locks the config for reading.
-func (r *RuntimeConfig) RLock() {
+func (r *RuntimeConfig) RLock(start time.Time) {
 	r.mu.RLock()
+	r.logger.Debugw("RLock",
+		zap.String("trace", string(debug.Stack())),
+		zap.Duration("took", time.Since(start)))
 }
 
 // RUnlock releases a lock obtained by RLock().
-func (r *RuntimeConfig) RUnlock() {
+func (r *RuntimeConfig) RUnlock(start time.Time) {
 	r.mu.RUnlock()
+	r.logger.Debugw("RUnlock",
+		zap.String("trace", string(debug.Stack())),
+		zap.Duration("took", time.Since(start)))
 }
 
 func (r *RuntimeConfig) GetRecipient(k recipient.Key) recipient.Recipient {
@@ -147,8 +154,9 @@ func (r *RuntimeConfig) GetContact(username string) *recipient.Contact {
 // This method returns either a *Source or a nil pointer and logs the cause to the given logger. This is in almost all
 // cases a debug logging message, except when something server-side is wrong, e.g., the hash is invalid.
 func (r *RuntimeConfig) GetSourceFromCredentials(user, pass string, logger *logging.Logger) *Source {
-	r.RLock()
-	defer r.RUnlock()
+	start := time.Now()
+	r.RLock(start)
+	defer r.RUnlock(start)
 
 	sourceIdRaw, sourceIdOk := strings.CutPrefix(user, "source-")
 	if !sourceIdOk {
@@ -220,8 +228,14 @@ func (r *RuntimeConfig) fetchFromDatabase(ctx context.Context) error {
 }
 
 func (r *RuntimeConfig) applyPending() {
+	lockStart := time.Now()
 	r.mu.Lock()
-	defer r.mu.Unlock()
+	r.logger.Debugw("Lock", zap.Duration("took", time.Since(lockStart)))
+
+	defer func() {
+		r.mu.Unlock()
+		r.logger.Debugw("Unlock", zap.Duration("took", time.Since(lockStart)))
+	}()
 
 	r.logger.Debug("applying pending configuration")
 	start := time.Now()
diff --git a/internal/config/verify.go b/internal/config/verify.go
index d6b7bf4..446ba5d 100644
--- a/internal/config/verify.go
+++ b/internal/config/verify.go
@@ -7,12 +7,14 @@ import (
 	"github.com/icinga/icinga-notifications/internal/recipient"
 	"github.com/icinga/icinga-notifications/internal/rule"
 	"github.com/icinga/icinga-notifications/internal/timeperiod"
+	"time"
 )
 
 // debugVerify performs a set of config validity/consistency checks that can be used for debugging.
 func (r *RuntimeConfig) debugVerify() error {
-	r.RLock()
-	defer r.RUnlock()
+	start := time.Now()
+	r.RLock(start)
+	defer r.RUnlock(start)
 
 	if r.Channels == nil {
 		return errors.New("RuntimeConfig.Channels is nil")
diff --git a/internal/incident/incident.go b/internal/incident/incident.go
index b0d60aa..a55f515 100644
--- a/internal/incident/incident.go
+++ b/internal/incident/incident.go
@@ -107,8 +107,9 @@ func (i *Incident) ProcessEvent(ctx context.Context, ev *event.Event, created bo
 	i.Lock()
 	defer i.Unlock()
 
-	i.runtimeConfig.RLock()
-	defer i.runtimeConfig.RUnlock()
+	start := time.Now()
+	i.runtimeConfig.RLock(start)
+	defer i.runtimeConfig.RUnlock(start)
 
 	tx, err := i.db.BeginTxx(ctx, nil)
 	if err != nil {
@@ -187,8 +188,9 @@ func (i *Incident) RetriggerEscalations(ev *event.Event) {
 	i.Lock()
 	defer i.Unlock()
 
-	i.runtimeConfig.RLock()
-	defer i.runtimeConfig.RUnlock()
+	start := time.Now()
+	i.runtimeConfig.RLock(start)
+	defer i.runtimeConfig.RUnlock(start)
 
 	if !i.RecoveredAt.IsZero() {
 		// Incident is recovered in the meantime.
@@ -636,8 +638,9 @@ func (i *Incident) processAcknowledgementEvent(ctx context.Context, tx *sqlx.Tx,
 
 // RestoreEscalationStateRules restores this incident's rules based on the given escalation states.
 func (i *Incident) RestoreEscalationStateRules(states []*EscalationState) {
-	i.runtimeConfig.RLock()
-	defer i.runtimeConfig.RUnlock()
+	start := time.Now()
+	i.runtimeConfig.RLock(start)
+	defer i.runtimeConfig.RUnlock(start)
 
 	for _, state := range states {
 		escalation := i.runtimeConfig.GetRuleEscalation(state.RuleEscalationID)

The logs can be downloaded zipped here.

A first inspection of the new logs shows locking times of over four minutes.

Script

awk '
  BEGIN {
    print "set logscale y"
    print "set terminal png size 640, 480"
    print "set output \"lock-time.png\""
    print "plot \"-\" with lines title \"Lock Waiting, Seconds\""
  }

  function duration(ts,    sum, m) {
    sum = 0
    if (match(ts, /([0-9]+)m[0-9]+/, m))
      sum += strtonum(m[1]) * 60
    if (match(ts, /([0-9]+\.[0-9]+)s.*/, m))
      sum += strtonum(m[1])
    if (match(ts, /([0-9]+\.[0-9]+)ms.*/, m))
      sum += strtonum(m[1]) / 1000
    return sum
  }

  /runtime-updates[\t ]+(R|)Lock/ {
    match($0, /"took": "(.*)"/, m)
    print NR, duration(m[1])
  }
  ' runtime-log3 | \
  gnuplot -p

Taking a look at the (R|)Lock origins reveals only three places:

    271 UpdateFromDatabase->applyPending
    271 UpdateFromDatabase->debugVerify
   3734 Listener->ProcessEvent

Script

awk '
/runtime-updates.*Lock/ {
  if ($4 == "Lock") {
    print "UpdateFromDatabase->applyPending"
  } else if (index($0, "debugVerify")) {
    print "UpdateFromDatabase->debugVerify"
  } else if (index($0, "ProcessEvent")) {
    print "Listener->ProcessEvent"
  } else {
    print $0
  }
}' runtime-log3 | sort | uniq -c | sort -n

The same amount for UpdateFromDatabase results in the fact that both are called after another from UpdateFromDatabase as stated above.
However, most locks are originating in the Listener. A second look reveals that there are two different locking points:

   2695  github.com/icinga/icinga-notifications/internal/config.(*RuntimeConfig).GetSourceFromCredentials(?, {?, ?}, {?, ?}, ?)
   1039  github.com/icinga/icinga-notifications/internal/incident.(*Incident).ProcessEvent(?, {?, ?}, ?, ?)

Script

awk '
/runtime-updates.*RLock.*ProcessEvent/ {
  match($0, /"trace": "(.*)",/, m)

  trace = ""
  split(m[1], p, "\\\\n")
  for (i = 6; i < length(p); i++) {
    if (index(p[i], "\\t"))
      continue
    if (index(p[i], "(*Listener).ProcessEvent"))
      break
    trace = trace " " p[i]
  }

  gsub(/0x[0-9a-f]+/, "?", trace)
  print trace
}' runtime-log3 | sort | uniq -c

The (*RuntimeConfig).GetSourceFromCredentials method holds a RLock unnecessary long over the whole method call, including the bcrypt verification. This might be narrowed down.
(*Incident).ProcessEvent holds its RLock even longer over a SQL statement and sub-methods.

~~Taking the caller into account, might unveil a pattern:~~

[...] 190 97.3µs Listener->GetSourceFromCredentials 191 95.009µs Listener->GetSourceFromCredentials 326 22.086814835s UpdateFromDatabase->applyPending 329 21.887824507s Listener->GetSourceFromCredentials 330 19.199263445s Listener->GetSourceFromCredentials [...] 1905 388.92µs Listener->GetSourceFromCredentials 1906 108.036µs Listener->GetSourceFromCredentials 3544 4m50.195727767s UpdateFromDatabase->applyPending 3552 4m49.201059997s Listener->GetSourceFromCredentials 3553 4m49.211130819s Listener->GetSourceFromCredentials [...] 4570 9.854047633s Listener->GetSourceFromCredentials 4572 4m34.851601492s Listener->GetSourceFromCredentials 5641 3m23.225567006s UpdateFromDatabase->applyPending 5647 85µs UpdateFromDatabase->debugVerify 5648 1m27.909018073s Listener->GetSourceFromCredentials

Script

awk ' /runtime-updates[\t ]+(R|)Lock/ { match($0, /"took": "(.*)"/, m) took = m[1] if ($4 == "Lock") { origin = "UpdateFromDatabase->applyPending" } else if (index($0, "debugVerify")) { origin = "UpdateFromDatabase->debugVerify" } else if (index($0, "GetSourceFromCredentials")) { origin = "Listener->GetSourceFromCredentials" } else if (index($0, "ProcessEvent")) { origin = "Listener->ProcessEvent" } else { origin = $0 } print NR, took, origin }' runtime-log3

~~At least the first big spikes are related to a applyPending and not so much to GetSourceFromCredentials, which is all over the place anyway.~~ (Edit: Comparing the times when a mutex lock was acquired does not make sense alone.)