Alert example fires for maintained snapshots
v4rakh opened this issue · 15 comments
Hi,
maybe it's just my use case, but I found it confusing to get alerts for old snapshots which I want to keep, e.g. more than 15 days for my backups. This though happens with the example alert provided in the README
file as restic-exporter reports timestamps for each snapshot, of which some might be old, yes. For my case the alert should only fire if the latest snapshot has a certain age, e.g. a backup has potentially been missed.
Maybe we like to add it to the README
?
# for one day
(time() - (max without (snapshot_hash) (restic_backup_timestamp))) > (1 * 86400)
# for 15 days as currently outlined in the README
(time() - (max without (snapshot_hash) (restic_backup_timestamp))) > (15 * 86400)
Alerts are optional and they are provided just as reference.
The common use case is to automate backups with a cron task or another scheduler. I'm doing incremental backups every day. I have configured the 2 alerts in the readme:
- check alert => checks if the backup repository has errors / corrupted
- old backup alert => checks if some device is not running backups on schedule. it could be due to network issues, error in the scheduled task...
In you case, you can keep the first alert and do custom alerts for specific backups. Could you publish the response of the /metrics endpoint? I would like to understand the issue better.
Always thanks for your quick reply.
Not sure it's an issue and not expected behavior of the counter? Each snapshot hash has its own counter attribute exported. So if a lot of snapshots are retained, they'll end up in the metrics or is this unexpected behavior?
Here's an example of one of my backups, copied output of the endpoint of the exporter:
# HELP restic_check_success Result of restic check operation in the repository
# TYPE restic_check_success gauge
restic_check_success 2.0
# HELP restic_snapshots_total Total number of snapshots in the repository
# TYPE restic_snapshots_total counter
restic_snapshots_total 24.0
# HELP restic_backup_timestamp Timestamp of the last backup
# TYPE restic_backup_timestamp gauge
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="428a4022933f2a1e162cbfa6685055afb27fbaefb20b784c63fbefc33a25d49e",snapshot_tag=""} 1.667183411e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="2ec546bf3f53ecef07491a6536fe1e889b9e2d3a230d26cd3d0b189fd9325bb3",snapshot_tag=""} 1.673836203e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="dba9a400b7961289953865932ed0c142ed218bcc0d736ca8bb92af2141340160",snapshot_tag=""} 1.674441002e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="13a9064b8f2f0c6e176b6997dcd52668660d5f2e8dbcf77ed047f4a26e20ed6b",snapshot_tag=""} 1.675045802e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="57a5b3ab8dfb3d770ff484bf52a75ec94569d8a8e2da68dd932b193675f99629",snapshot_tag=""} 1.679279404e+09
EDIT: I have multiple exporters running at a time by the way. They are all exported as a different instance
and scraped separately. Not sure if that's helpful.
Interestingly enough, I've just ran restic snapshots
manually and the result somehow differs:
repository 2663d695 opened (version 1)
ID Time Host Tags Paths
-----------------------------------------------------------------------
c72f769a 2023-03-17 01:00:03 mantell /home/data/stripped
5afb6fb9 2023-03-20 01:00:15 mantell /home/data/stripped
599983ef 2023-03-22 01:00:14 mantell /home/data/stripped
The hashes seem to not match, first time I take a closer look though.
The label "snapshot_hash" in the exporter is not the snapshot hash in Restic. The hash in Restic changes frequently when you do maintenance operations or full backups.
The exporter hash is calculated with the hostname and the path =>
restic-exporter/restic-exporter.py
Line 279 in f2fe3af
In you case, you should have just 1 line in the exporter because the hostname and path matches. 🤔
Update: Could you run this command?
- restic snapshots --json --latest 1
Cleaned up my setup, so assigning different networks to each of the exporters in my docker-compose
file, but I guess the root cause is something different, at least that solved it for one exporter, but now the other one has more.
Correct one:
# HELP restic_backup_timestamp Timestamp of the last backup
# TYPE restic_backup_timestamp gauge
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="8bc201180ab9e700369b659f3f0a75c99dd1a3a72fbbd9e8ad24389d3917cead",snapshot_tag=""} 1.679443214e+09
[{"time":"2023-03-22T01:00:14.29359609+01:00","parent":"5afb6fb92919f038a09f8f53fd6181b556e16ed0d1a7b5c31d3666696c11ef10","tree":"2c9a7be5967957a22deb5e28b5fc89ebdce29ffe70465bba46e145f1d074d8f2","paths":["/home/data/music"],"hostname":"mantell","username":"root","id":"599983efe95ad60dcf93a8ef05a23e6ab38f9b11156849c91fe3a57d8ce80c7e","short_id":"599983ef"}]
Sorry, closed too early.
Now I checked another exporter and the underlying repository with the command you've asked for and it actually returned an array with more than 1 entry.
[
{
"time": "2022-10-31T03:30:11.085083219+01:00",
"parent": "19e95feb063cab22ed2281fbfed5aa16c53c03ab62ed312b06e55a91e5bb2244",
"tree": "7fc6f808fd4165c3736d85f517420d8cdee14c59c406df69600c8a1683a15599",
"paths":
[
"/etc",
],
"hostname": "mantell",
"username": "root",
"id": "c7ef0b0cc48bdc723b8822589ea5988fd4c3671c08f581aa5d79fab26d1c9690",
"short_id": "c7ef0b0c",
},
{
"time": "2023-01-16T03:30:03.544488116+01:00",
"parent": "4f5b4ef0d38223ccabff879b075e2361f7b9c373ea8b05c622236c0cc160b2b7",
"tree": "302e423e9321b826044973b8a7591099346fdef73f6c1cd1e5c77c2d3d450906",
"paths":
[
"/etc",
],
"hostname": "mantell",
"username": "root",
"id": "a3381e1fecced88ec099b447355fc22f10fdb4b7b24a33805fdf4c024eb31924",
"short_id": "a3381e1f",
},
{
"time": "2023-01-23T03:30:02.336435099+01:00",
"tree": "ad4ba7054b03c3325b62a3f6e724a2160ba13cbecfe266523ef5ea4a640ff4ed",
"paths":
[
"/etc",
],
"hostname": "mantell",
"username": "root",
"id": "18907418d224b7cb3dbaa2f9a1e80bed02a15b4de8d0c06d7276c2636655fab3",
"short_id": "18907418",
},
{
"time": "2023-01-30T03:30:02.926379261+01:00",
"parent": "9176548f5fe7369dda9f923b229e8bb8fd19a2bd9a0e1bc2ca462a7d157d7608",
"tree": "66223f24e4b670ebb836404a9cbc403627c30a455de7c789f96768c9949f22c5",
"paths":
[
"/etc",
],
"hostname": "mantell",
"username": "root",
"id": "486b28989cfffac24a34b037b14cacdaa3343849e499c3f41351f1b80eb7967a",
"short_id": "486b2898",
},
{
"time": "2023-03-20T03:30:04.545656117+01:00",
"parent": "8c45d75b8a8c08984556e352397850a1b0646359280a1a561463c04835d93c88",
"tree": "db41d0846694a609f9b9a651ca91dceb371792b9442c294947d1387e7f91a839",
"paths":
[
"/etc",
],
"hostname": "mantell",
"username": "root",
"id": "4d1d3f09f6f056a905d6f295375cea23d28e83dfe10da2331307b906c4e64959",
"short_id": "4d1d3f09",
},
]
The exporter reports the following
# TYPE restic_backup_timestamp gauge
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="428a4022933f2a1e162cbfa6685055afb27fbaefb20b784c63fbefc33a25d49e",snapshot_tag=""} 1.667183411e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="2ec546bf3f53ecef07491a6536fe1e889b9e2d3a230d26cd3d0b189fd9325bb3",snapshot_tag=""} 1.673836203e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="dba9a400b7961289953865932ed0c142ed218bcc0d736ca8bb92af2141340160",snapshot_tag=""} 1.674441002e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="13a9064b8f2f0c6e176b6997dcd52668660d5f2e8dbcf77ed047f4a26e20ed6b",snapshot_tag=""} 1.675045802e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="57a5b3ab8dfb3d770ff484bf52a75ec94569d8a8e2da68dd932b193675f99629",snapshot_tag=""} 1.679279404e+09
Same result. By the way, I always had that and thought it was expected behavior that all snapshots have their own gauge. :-)
So maybe it wasn't even the network setup I had. Would have been weird. But anyway, the output of snapshots --json --latest 1
also seems to report more than one.
Not sure if it helps, but restic version is 0.15.1
on the host which creates the snapshots.
The results you posted in #11 (comment) don't make sense. I calculated the hash and it's impossible the json with 5 snapshots produce that 5 metrics.
Could you double check you are getting the json and the metrics from the same repository? I have some ideas to improve the code but I have to reproduce the issue first.
So this is from one repository: #11 (comment)
This is from the other: (the second half of the post): #11 (comment)
I'll double check later.
Thanks for providing this, I tested it by building the docker image locally and I have very same results.
# HELP restic_backup_timestamp Timestamp of the last backup
# TYPE restic_backup_timestamp gauge
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="20c494f14bbb7e5188a4b36702a1dcce59baa4c516f34268106f92f494eba783",snapshot_tag=""} 1.667183411e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="734c259855b1ad6067777f85598521cab79a4d0fd5a149b4698d8081de33ca88",snapshot_tag=""} 1.673836203e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="f5247872f6ead56c1d7add82e8cbe2d873f47f894fdacd93f22b4b5140273a3b",snapshot_tag=""} 1.674441002e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="8571d9536f0d1616c6e99ad3a1f68d94af547e2616a343e29e97e3c4a2ed557f",snapshot_tag=""} 1.675045802e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="14fb46e68138c0033e32be3a67ac6aa20f6c95f6f4f339e1ee83e1cec4ad8d93",snapshot_tag=""} 1.679279404e+09
@v4rakh When I run the code with your dump I see 5 metrics:
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="428a4022933f2a1e162cbfa6685055afb27fbaefb20b784c63fbefc33a25d49e",snapshot_tag=""} 1.667183411e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="2ec546bf3f53ecef07491a6536fe1e889b9e2d3a230d26cd3d0b189fd9325bb3",snapshot_tag=""} 1.673836203e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="dba9a400b7961289953865932ed0c142ed218bcc0d736ca8bb92af2141340160",snapshot_tag=""} 1.674441002e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="13a9064b8f2f0c6e176b6997dcd52668660d5f2e8dbcf77ed047f4a26e20ed6b",snapshot_tag=""} 1.675045802e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="57a5b3ab8dfb3d770ff484bf52a75ec94569d8a8e2da68dd932b193675f99629",snapshot_tag=""} 1.679279404e+09
That is totally fine because each backup of you have different folders. For example, you don't have embyserver
in the other snapshots.
"paths": [
"/etc",
"/home/admin",
"/home/musicstreamer",
"/opt/embyserver/config",
"/opt/jellyfin/config",
"/opt/portainer",
"/opt/unifi",
"/root",
"/tmp/package_list.txt",
"/tmp/package_list_aur.txt",
"/usr/local/bin"
],
"hostname": "mantell",
"username": "root",
"paths": [
"/etc",
"/home/admin",
"/home/musicstreamer",
"/opt/jellyfin/config",
"/opt/nodered_data",
"/opt/portainer",
"/opt/prometheus_config",
"/opt/unifi",
"/root",
"/tmp/package_list.txt",
"/tmp/package_list_aur.txt",
"/usr/local/bin"
],
"hostname": "mantell",
"username": "root",
The function to calc the hash takes into account the username, hostname and paths. If you have different paths they are considered different backups. That makes sense to me if you want to track the number of files or size across the time. You can not compare different things.
restic-exporter/restic-exporter.py
Line 288 in 1c55ffe
Since most of your backups have the same folders I would recommend you to include all folders in just 1 backup and delete all previous backups. I'm closing this since I can not fix what is not broken.
I agree. Changing source directories though can be a requirement, e.g. not everyone will create a new restic repository or wipe all existing snapshots when adding new or changing existing paths to it, e.g. for an installed application. In my example I could have used /opt
instead of listing them individually or work with exclude, though I prefer to explicitly include them.
Just an idea here, would it be an option to include an env var to change the hash calculation to include paths
or not and by default it's enabled to include it?
Also, documenting the above mentioned different alerts could also be of help for people having a similar setup, declining that this is a use case is probably not correct, I mean that source directories won't change forever. Your call, if you like to keep, I would still propose to document how the hash is actually being calculated/what impact it has to the underlying metrics then.
Thanks for looking into it in depth.