offen/docker-volume-backup

Support structured logging

Closed this issue · 8 comments

pixxon commented

Is your feature request related to a problem? Please describe.
Currently every information is dumped as a single line of text, using fmt.Sprintf to embed information and variable. This is hard to parse by third party application, such as Grafana/Loki.

Describe the solution you'd like
Utilize a standard logfmt that is easy to read and parse. This is already supported by slog. Even better if it can be configured to use json in case somebody wants that. ( slog has that option too. )

Describe alternatives you've considered
slog seems to be a straight choice, it's already used and supports this feature. Only thing needed to be changed is the approach of the logs.

Additional context
Would be nice to provide some additional context if possible. Such as when using the foreground feature, it could add the id of the cron job to the context by default.

m90 commented

This sounds a bit as if you are trying to recreate something that is already there. During each backup run, statistics are collected and stored in a Stats object:

// Stats global stats regarding script execution
type Stats struct {
sync.Mutex
StartTime time.Time
EndTime time.Time
TookTime time.Duration
LockedTime time.Duration
LogOutput *bytes.Buffer
Containers ContainersStats
Services ServicesStats
BackupFile BackupFileStats
Storages map[string]StorageStats
}

After being populated, this is currently only made available to notification templates: https://offen.github.io/docker-volume-backup/how-tos/set-up-notifications.html#notification-templates-reference

I'm not sure if it makes sense to start emitting metrics in a different manner now and also keep supporting both. I also have to admit that I like log lines that are written for humans mostly, e.g. as it helps with getting users up to speed when the log message tells you what's going wrong. If you're looking to get such data into Loki or other log aggregation systems, I would propose to add an option that will dump a serialized form (JSON or logfmt) of the Stats object to stdout after the script has run. This could then be ingested by everything that reads logs.

Also related is the idea of making this available for Prometheus: #64

pixxon commented

I think it's a bit easier if I detail the setup that I have running. For the ease of monitoring my home stack and especially finding some bugs, I have Grafana and Loki running.

In the stack there are 4 components:

  • The service being monitored. These are all running in docker containers, outputting their logs to the stdout. Which is then exposed from the docker daemon.
  • Promtail, which is a monitoring agent that will discover the containers and collect the logs from them. ( This is done via docker labels. )
  • Loki, who will collect all the logs coming from different services and expose them to Grafana for visualization.
  • Grafana, the dashboard that can be used to display this information. For the ease of use this can be done via labels. {container="grafana"} |= `` | logfmt | level = debug

The output of the above filter would be something like:
logger=ngalert.sender.router t=2024-02-02T19:13:51.691945136Z level=debug msg="Finish of admin configuration sync"

In this setup, it is a much easier to have the logs just being fed into Loki automatically without any need for configuration. I would still keep the logs in the same human readable format, just add some labels and annotations to it. ( The current logs are actually already in this logfmt format, but everything is inside the "msg" tag.

m90 commented

The thing I am worried about in this approach is: each time the logger is called, there will be a decision to be made which information to include in a structured way and which one not. Obviously, this is also something that happens when writing logs that are meant for humans, but it's a bit more lenient. When people start consuming logs like this, I would expect requests like "oh I would like to know about X at position Y, can you add it?" to surface, and that's not really sustainable. With having the Stats object, this is well defined and adding / changing properties has to be done only once instead of having to revisit every log call.

I have zero idea about how complicated it would be to make your setup pick up a single line instead all of them.

My question would therefore be: would it be possible / helpful for you to reuse the Stats object here and append its current state to every log call if requested (i.e. enabled using a setting)? Would that give you what you need?

m90 commented

Also, when you start emitting logs in a structured way, you start creating an interface for other consumers, which means this will need to stay stable. I currently do not consider log output a stable interface and actually do change it from time to time. Stats already has stability guarantees so that users can safely render their templates. This would also be a good reason to reuse what's already there.

pixxon commented

I don't think such verbose information is really needed for the logging. I don't want to build metrics out of it, just create potential alerts with some details.

Currently the only thing that I would like to be changed is to output the error in its own tag instead of being embedded inside the msg. Additionally could be useful to assign a unique identifier to individual backup runs, in debug mode provide filename/linenumber if it makes sense.

I personally think there could be a leeway in the logs interface, I have not seen it being very strict in other applications either. ( They server as a secondary line after the metrics themselves. )

m90 commented

Ok, I guess adding the error when logging at ERROR level makes sense. From what I remember this is only happening in a single place anyways which is here:

defer func() {
if pArg := recover(); pArg != nil {
if err, ok := pArg.(error); ok {
s.logger.Error(
fmt.Sprintf("Executing the script encountered a panic: %v", err),
)
if hookErr := s.runHooks(err); hookErr != nil {
s.logger.Error(
fmt.Sprintf("An error occurred calling the registered hooks: %s", hookErr),
)
}
os.Exit(1)
}
panic(pArg)
}
if err := s.runHooks(nil); err != nil {
s.logger.Error(
fmt.Sprintf(
"Backup procedure ran successfully, but an error ocurred calling the registered hooks: %v",
err,
),
)
os.Exit(1)
}
s.logger.Info("Finished running backup tasks.")
}()

I'd just really like to avoid duplicating data that is already available in Stats in the logs.

pixxon commented

There are a few more instances wheree it applies, but I do not want to introduce statistics or any complex data in the logging.

Yesterday while running the test suite I also found out that some data appears in the logs already. The storage context is present in the storage label. So I would really just add these metadatas that can be detected automatically into the logs. What I think will be useful for the new scheduler is the id of the job that printed the given log.

m90 commented

Closing this as I also improved error wrapping and logging in #360

Every logger.Error call will now also emit the underlying error while presenting something a bit more digestable to the user.