performancecopilot/pcp

pmlogger and clock change

GerryHickman opened this issue · 10 comments

Each night around midnight, my pmlogger archive files get rotated.
Each host has an archive file that looks like this
20240331.23.10.0.xz
20240331.23.10.index
20240331.23.10.meta.xz
after midnight it looks like this
20240331.0.xz
20240331.index
20240331.meta.xz
BUT, this went wrong a few days ago 2024-03-31 when the clocks went forward (daylight saving time).
pmlogger complained that the log files already existed, and therefore could not compress and rename.

@GerryHickman at what time did daylight savings change for you? (the answer is not the same around the world).

And before the files go away, could you please run the attached shell script (disregard the .txt suffix, that's a github constraint) and add the output to this issue?
diag-1936.txt

The clock change was 2024-03-30, 2024-03-31 moving from GMT to BST-1.
The output of the diag is here, with the hostname and domain altered slightly.

cat diag.txt
Found archive dir PCP_ARCHIVE_DIR/hostdv01 in /etc/pcp/pmlogger/control.d/gh_logger.control ...
-> /data/pcp/pmlogger/hostdv01
=== 20240327.meta.xz ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Wed Mar 27 00:10:26.555070321 2024
    ending     Thu Mar 28 00:10:24.693325139 2024
Archive timezone: GMT
Archive zoneinfo: :Europe/London
PID for pmlogger: 5068
=== 20240328.meta.xz ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Thu Mar 28 00:10:24.878480016 2024
    ending     Fri Mar 29 00:10:24.636319874 2024
Archive timezone: GMT
Archive zoneinfo: :Europe/London
PID for pmlogger: 5068
=== 20240329.meta.xz ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Fri Mar 29 00:10:24.820330103 2024
    ending     Sat Mar 30 00:10:22.620045365 2024
Archive timezone: GMT
Archive zoneinfo: :Europe/London
PID for pmlogger: 5068
=== 20240330.meta.xz ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Sat Mar 30 00:10:22.805737163 2024
    ending     Sun Mar 31 00:10:21.550997969 2024
Archive timezone: GMT
Archive zoneinfo: :Europe/London
PID for pmlogger: 5068
=== 20240331.23.10.meta.xz ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Mon Apr  1 00:10:20.930108857 2024
    ending     Tue Apr  2 00:10:20.181392992 2024
Archive timezone: BST-1
Archive zoneinfo: :Europe/London
PID for pmlogger: 5068
=== 20240331.meta.xz ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Sun Mar 31 00:10:21.739990690 2024
    ending     Sun Mar 31 23:10:20.744411851 2024
Archive timezone: GMT
Archive zoneinfo: :Europe/London
PID for pmlogger: 5068
=== 20240401.23.10.meta.xz ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Tue Apr  2 00:10:20.363168643 2024
    ending     Tue Apr  2 11:10:03.641798710 2024
Archive timezone: BST-1
Archive zoneinfo: :Europe/London
PID for pmlogger: 5068
=== 20240402.11.10.meta.xz ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Tue Apr  2 11:10:25.335829950 2024
    ending     Wed Apr  3 00:10:19.109618462 2024
Archive timezone: BST-1
Archive zoneinfo: :Europe/London
PID for pmlogger: 3242649
=== 20240403.00.10.meta ===
Note: timezone set to local timezone of host "hostdv01.blah.local" from archive

Log Label (Log Format Version 3)
Performance metrics from host hostdv01.blah.local
    commencing Wed Apr  3 00:10:19.290139149 2024
    ending     Wed Apr  3 09:20:19.310532387 2024
Archive timezone: BST-1
Archive zoneinfo: :Europe/London
PID for pmlogger: 3242649

@GerryHickman That's helpful, but I am still missing some pieces of the puzzle ... at what TIME did the DST change occur in the region where this all happened. I am guessing it is NOT the UK, because in the UK the change is supposed to happen at 1pm and the time should go FORWARDS then which should not cause any issues.

Something very odd happened at 23:10 GMT on Sun Mar 31 (which is the SAME system time as 00:10 BST-1 on Mon Apr 1 ... if you ignore the DST rules and simply apply the offset from GMT).

And just to eliminate one variable, is pmlogger_daily driven by systemd on this system or cron? The former is most likely these days.

@GerryHickman One more thing, I'd be interested to know how many records are in each of the archives and their "health".
Output from $ pmlogcheck -vwz <arch> for each archive in the /var/log/pcp/pmlogger/... directory would be helpful.

Regarding the exact time of the change here in the uk, here's a quote

When local standard time was about to reach
Sunday, 31 March 2024, 01:00:00 clocks were turned forward 1 hour to
Sunday, 31 March 2024, 02:00:00 local daylight time instead.

the systemd pmlogger_daily.service runs at 00:10:00 each day (ten minutes past midnight)

[I had a very long analysis in the comment buffer here, but a fat fingers event meant lost it!]

However the analysis has helped me understand the problem ... this is a real bug.

It is the same pmlogger process (pid 5068) that was running from (before?) Mar 27 right thru to Apr 2. The pmlogger_daily "rexec" is not an exec(), it is a SIGUSR2 signal. pmlogger has TWO timezones that it uses:

  1. to translate the archive basename %Y%m%d.%H.%M into something like 20240331.23.10, and
  2. fetched from pmcd via the pmcd.timezone metric and used to set up the archive's label record

The first is only being done once, so pmlogger kept naming the archives incorrectly once DST had clicked over (this is the bug) ... it needs to be redone after a SIGUSR2 is received and before the next %Y%m%d.%H.%M translation is done.

The second is done when each new archive is started, so this is correct.

The fix will be simple (sort of), testing it will be a nightmare!

I still have one small concern, I'm not sure where the data for the last hour of Sun Mar 31 (strictly from 23:10 that day to 00:10 on the next) went ... it is not obviously in any of the archives.

Regarding the lost data, I think it's because one of the processes crashed (or exited) when it couldn't rename the files. I didn't notice the problem right away. Once I noticed the problem, I restarted the services, and then it started working from that point forwards.

But no one else has reported this, so I have to wonder what happened with every other pmlogger archive file in the rest of the world on that day...

Time for some updates.

First I have a fix ... I've emulated the UK DST timezone change more than 20+ times now, and was able to reproduce failures with similar (but not identical) signatures to the ones @GerryHickman reported, more analysis and several failed fix attempts before finding the silver bullet and the fix now seems solid ... unfortunately the steps needed to reproduce/test this are so convoluted that we cannot build an automated QA test, so the real proof will come in 12 months time, if the new code makes it to Gerry's machine by then.

Second, a correction, pmlogger does call exec() in this scenario when the archive is "rotated" by pmlogger_daily. But the new pmlogger inherits the environment of the old pmlogger and I suspect that's how the old timezone was being handed on.

Third, I don't think there is any lost data. pmlogger is very robust when it comes to archive naming, and when used with pmlogger_check et al, it will try a -NN addition to the archive name to avoid name collisions, and this is is repeated if necessary until a unique archive name is established. The message that Gerry initially mentioned "pmlogger complained that the log files already existed, and therefore could not compress and rename" comes not from pmlogger but from pmlogger_check and/or pmlogger_daily ... this indicates the archive housekeeping has failed, but not that pmlogger has failed. The archives are there, they just have a slightly unusual name with the wrong date (the fix corrects this).

Finally, some reasons why this is not widely reported, and may not in fact be happening that often ...

  • the DST change has to move the clocks forward to expose the bug, so those of us in the southern hemisphere are not impacted this time, e.g. last night in Melbourne our clocks went back by an hour, so no issues
  • as you move towards the equator it is less, and less, likely that any DST is in play
  • the bug is sensitive to (a) whether $TZ is set in pmlogger's environment or not, and (b) when it is set, the specific format that it has, i.e , or : or ABCDEF...
  • the bug is probably sensitive to the localtime() and tzet() implementation in libc and possibly even the version of libc
  • the bug emerges the two nights after the night when the DST change happens, making the connection to the DST change less obvious
  • the error message comes from pmlogger_check and/or pmlogger_daily and is salted away in the multiple files matching /var/log/pcp/pmlogger/pmlogger_checklog and /var/log/pcp/pmlogger/pmlogger_dailylog and I suspect not too many people are looking there.

Commit incoming.

Thanks for the updates. It's good news that Ken was able to confirm it really was a bug, and improve the code for next year.