RuntimeConfig: Immense Mutex Locking Times
Opened this issue · 0 comments
The config.RuntimeConfig
struct ensures data integrity through the use of its mu sync.RWMutex
.
When creating lots of events in a bulk fashion, e.g., after loading multiple incidents on a fresh Icinga configuration, this mutex will result in long locking times and a high CPU-usage.
I experienced this bug after starting with a fresh Icinga 2-configuration, including lots of firing checks.
After configuring a new notification channel, it has not appeared after minutes. After taking a look at the process, I saw a huge CPU utilization and, after enabling debug logs, missing configuration reloads.
root@9d9171726bb8:/go/src/github.com/Icinga/icinga-notifications# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 998 3.0 3752980 1002368 ? Ssl 09:00 78:16 /usr/local/bin/icinga-notifications-daemon --config /etc/icinga-notifications/config.yml
root 2519 0.0 0.0 1842300 22912 ? Sl 09:05 0:00 /usr/libexec/icinga-notifications/channel/email
root 2525 0.3 0.0 4188 3328 pts/0 Ss 09:08 0:00 /bin/bash
root 2533 0.0 0.0 8088 4096 pts/0 R+ 09:08 0:00 ps aux
:g/runtime-updates.*\(fetched configuration from database\|applied pending configuration\)
73 2024-01-29T09:01:11.728Z DEBUG runtime-updates fetched configuration from database {"took": "14.037396ms"}
75 2024-01-29T09:01:11.728Z DEBUG runtime-updates applied pending configuration {"took": "16.432µs"}
96 2024-01-29T09:01:13.338Z DEBUG runtime-updates fetched configuration from database {"took": "547.094244ms"}
126 2024-01-29T09:01:34.164Z DEBUG runtime-updates applied pending configuration {"took": "11.953µs"}
144 2024-01-29T09:01:41.130Z DEBUG runtime-updates fetched configuration from database {"took": "6.964389945s"}
220 2024-01-29T09:05:52.842Z DEBUG runtime-updates applied pending configuration {"took": "90.671µs"}
248 2024-01-29T09:05:55.517Z DEBUG runtime-updates fetched configuration from database {"took": "2.674626508s"}
332 2024-01-29T09:09:47.432Z DEBUG runtime-updates applied pending configuration {"took": "550.435µs"}
357 2024-01-29T09:09:47.577Z DEBUG runtime-updates fetched configuration from database {"took": "144.995215ms"}
433 2024-01-29T09:09:52.194Z DEBUG runtime-updates applied pending configuration {"took": "292.178µs"}
459 2024-01-29T09:09:52.202Z DEBUG runtime-updates fetched configuration from database {"took": "8.296999ms"}
462 2024-01-29T09:09:52.203Z DEBUG runtime-updates applied pending configuration {"took": "183.127µs"}
There's a delay between line 144 and 220 of over four minutes for a job which should be executed every second. Between those log messages, only Lock
ing the RuntimeConfig
's mutex mu
takes place.
A small patch was applied to log mutex usage, the call origin and time.
Logging Patch
diff --git a/internal/config/runtime.go b/internal/config/runtime.go
index ad6dd2e..af70651 100644
--- a/internal/config/runtime.go
+++ b/internal/config/runtime.go
@@ -13,6 +13,7 @@ import (
"github.com/jmoiron/sqlx"
"go.uber.org/zap"
"golang.org/x/crypto/bcrypt"
+ "runtime/debug"
"strconv"
"strings"
"sync"
@@ -86,13 +87,19 @@ func (r *RuntimeConfig) PeriodicUpdates(ctx context.Context, interval time.Durat
}
// RLock locks the config for reading.
-func (r *RuntimeConfig) RLock() {
+func (r *RuntimeConfig) RLock(start time.Time) {
r.mu.RLock()
+ r.logger.Debugw("RLock",
+ zap.String("trace", string(debug.Stack())),
+ zap.Duration("took", time.Since(start)))
}
// RUnlock releases a lock obtained by RLock().
-func (r *RuntimeConfig) RUnlock() {
+func (r *RuntimeConfig) RUnlock(start time.Time) {
r.mu.RUnlock()
+ r.logger.Debugw("RUnlock",
+ zap.String("trace", string(debug.Stack())),
+ zap.Duration("took", time.Since(start)))
}
func (r *RuntimeConfig) GetRecipient(k recipient.Key) recipient.Recipient {
@@ -147,8 +154,9 @@ func (r *RuntimeConfig) GetContact(username string) *recipient.Contact {
// This method returns either a *Source or a nil pointer and logs the cause to the given logger. This is in almost all
// cases a debug logging message, except when something server-side is wrong, e.g., the hash is invalid.
func (r *RuntimeConfig) GetSourceFromCredentials(user, pass string, logger *logging.Logger) *Source {
- r.RLock()
- defer r.RUnlock()
+ start := time.Now()
+ r.RLock(start)
+ defer r.RUnlock(start)
sourceIdRaw, sourceIdOk := strings.CutPrefix(user, "source-")
if !sourceIdOk {
@@ -220,8 +228,14 @@ func (r *RuntimeConfig) fetchFromDatabase(ctx context.Context) error {
}
func (r *RuntimeConfig) applyPending() {
+ lockStart := time.Now()
r.mu.Lock()
- defer r.mu.Unlock()
+ r.logger.Debugw("Lock", zap.Duration("took", time.Since(lockStart)))
+
+ defer func() {
+ r.mu.Unlock()
+ r.logger.Debugw("Unlock", zap.Duration("took", time.Since(lockStart)))
+ }()
r.logger.Debug("applying pending configuration")
start := time.Now()
diff --git a/internal/config/verify.go b/internal/config/verify.go
index d6b7bf4..446ba5d 100644
--- a/internal/config/verify.go
+++ b/internal/config/verify.go
@@ -7,12 +7,14 @@ import (
"github.com/icinga/icinga-notifications/internal/recipient"
"github.com/icinga/icinga-notifications/internal/rule"
"github.com/icinga/icinga-notifications/internal/timeperiod"
+ "time"
)
// debugVerify performs a set of config validity/consistency checks that can be used for debugging.
func (r *RuntimeConfig) debugVerify() error {
- r.RLock()
- defer r.RUnlock()
+ start := time.Now()
+ r.RLock(start)
+ defer r.RUnlock(start)
if r.Channels == nil {
return errors.New("RuntimeConfig.Channels is nil")
diff --git a/internal/incident/incident.go b/internal/incident/incident.go
index b0d60aa..a55f515 100644
--- a/internal/incident/incident.go
+++ b/internal/incident/incident.go
@@ -107,8 +107,9 @@ func (i *Incident) ProcessEvent(ctx context.Context, ev *event.Event, created bo
i.Lock()
defer i.Unlock()
- i.runtimeConfig.RLock()
- defer i.runtimeConfig.RUnlock()
+ start := time.Now()
+ i.runtimeConfig.RLock(start)
+ defer i.runtimeConfig.RUnlock(start)
tx, err := i.db.BeginTxx(ctx, nil)
if err != nil {
@@ -187,8 +188,9 @@ func (i *Incident) RetriggerEscalations(ev *event.Event) {
i.Lock()
defer i.Unlock()
- i.runtimeConfig.RLock()
- defer i.runtimeConfig.RUnlock()
+ start := time.Now()
+ i.runtimeConfig.RLock(start)
+ defer i.runtimeConfig.RUnlock(start)
if !i.RecoveredAt.IsZero() {
// Incident is recovered in the meantime.
@@ -636,8 +638,9 @@ func (i *Incident) processAcknowledgementEvent(ctx context.Context, tx *sqlx.Tx,
// RestoreEscalationStateRules restores this incident's rules based on the given escalation states.
func (i *Incident) RestoreEscalationStateRules(states []*EscalationState) {
- i.runtimeConfig.RLock()
- defer i.runtimeConfig.RUnlock()
+ start := time.Now()
+ i.runtimeConfig.RLock(start)
+ defer i.runtimeConfig.RUnlock(start)
for _, state := range states {
escalation := i.runtimeConfig.GetRuleEscalation(state.RuleEscalationID)
The logs can be downloaded zipped here.
A first inspection of the new logs shows locking times of over four minutes.
Script
awk '
BEGIN {
print "set logscale y"
print "set terminal png size 640, 480"
print "set output \"lock-time.png\""
print "plot \"-\" with lines title \"Lock Waiting, Seconds\""
}
function duration(ts, sum, m) {
sum = 0
if (match(ts, /([0-9]+)m[0-9]+/, m))
sum += strtonum(m[1]) * 60
if (match(ts, /([0-9]+\.[0-9]+)s.*/, m))
sum += strtonum(m[1])
if (match(ts, /([0-9]+\.[0-9]+)ms.*/, m))
sum += strtonum(m[1]) / 1000
return sum
}
/runtime-updates[\t ]+(R|)Lock/ {
match($0, /"took": "(.*)"/, m)
print NR, duration(m[1])
}
' runtime-log3 | \
gnuplot -p
Taking a look at the (R|)Lock
origins reveals only three places:
271 UpdateFromDatabase->applyPending
271 UpdateFromDatabase->debugVerify
3734 Listener->ProcessEvent
Script
awk '
/runtime-updates.*Lock/ {
if ($4 == "Lock") {
print "UpdateFromDatabase->applyPending"
} else if (index($0, "debugVerify")) {
print "UpdateFromDatabase->debugVerify"
} else if (index($0, "ProcessEvent")) {
print "Listener->ProcessEvent"
} else {
print $0
}
}' runtime-log3 | sort | uniq -c | sort -n
The same amount for UpdateFromDatabase results in the fact that both are called after another from UpdateFromDatabase
as stated above.
However, most locks are originating in the Listener
. A second look reveals that there are two different locking points:
2695 github.com/icinga/icinga-notifications/internal/config.(*RuntimeConfig).GetSourceFromCredentials(?, {?, ?}, {?, ?}, ?)
1039 github.com/icinga/icinga-notifications/internal/incident.(*Incident).ProcessEvent(?, {?, ?}, ?, ?)
Script
awk '
/runtime-updates.*RLock.*ProcessEvent/ {
match($0, /"trace": "(.*)",/, m)
trace = ""
split(m[1], p, "\\\\n")
for (i = 6; i < length(p); i++) {
if (index(p[i], "\\t"))
continue
if (index(p[i], "(*Listener).ProcessEvent"))
break
trace = trace " " p[i]
}
gsub(/0x[0-9a-f]+/, "?", trace)
print trace
}' runtime-log3 | sort | uniq -c
- The
(*RuntimeConfig).GetSourceFromCredentials
method holds aRLock
unnecessary long over the whole method call, including the bcrypt verification. This might be narrowed down. (*Incident).ProcessEvent
holds itsRLock
even longer over a SQL statement and sub-methods.
Taking the caller into account, might unveil a pattern:
[...]
190 97.3µs Listener->GetSourceFromCredentials
191 95.009µs Listener->GetSourceFromCredentials
326 22.086814835s UpdateFromDatabase->applyPending
329 21.887824507s Listener->GetSourceFromCredentials
330 19.199263445s Listener->GetSourceFromCredentials
[...]
1905 388.92µs Listener->GetSourceFromCredentials
1906 108.036µs Listener->GetSourceFromCredentials
3544 4m50.195727767s UpdateFromDatabase->applyPending
3552 4m49.201059997s Listener->GetSourceFromCredentials
3553 4m49.211130819s Listener->GetSourceFromCredentials
[...]
4570 9.854047633s Listener->GetSourceFromCredentials
4572 4m34.851601492s Listener->GetSourceFromCredentials
5641 3m23.225567006s UpdateFromDatabase->applyPending
5647 85µs UpdateFromDatabase->debugVerify
5648 1m27.909018073s Listener->GetSourceFromCredentials
Script
awk '
/runtime-updates[\t ]+(R|)Lock/ {
match($0, /"took": "(.*)"/, m)
took = m[1]
if ($4 == "Lock") {
origin = "UpdateFromDatabase->applyPending"
} else if (index($0, "debugVerify")) {
origin = "UpdateFromDatabase->debugVerify"
} else if (index($0, "GetSourceFromCredentials")) {
origin = "Listener->GetSourceFromCredentials"
} else if (index($0, "ProcessEvent")) {
origin = "Listener->ProcessEvent"
} else {
origin = $0
}
print NR, took, origin
}' runtime-log3
At least the first big spikes are related to a (Edit: Comparing the times when a mutex lock was acquired does not make sense alone.)applyPending
and not so much to GetSourceFromCredentials
, which is all over the place anyway.