balena-os/meta-balena

tests: os: Should restart chronyd when system time skew detected fails sometimes

rcooke-warwick opened this issue · 13 comments

jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:19:18.415Z][worker-os]             ok 1 - Blocking NTP by stopping dnsmasq.service # time=9120.753ms
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:19:18.415Z][worker-os] 
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:19:19.446Z][worker-os]             ok 2 - Should apply time skew of -2 minutes
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.955Z][worker-os]             not ok 3 - Should restart chronyd when system time skew detected
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.956Z][worker-os]               ---
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.956Z][worker-os]               at:
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]                 line: 168
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]                 column: 18
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]                 file: /data/suite/tests/chrony/index.js
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]                 type: global
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]               found:
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]                 !error
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]                 name: Error
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]                 message: |-
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.957Z][worker-os]                   Condition async () => {
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.958Z][worker-os]                   							const regex = /NTP time lost synchronization - restarting chronyd/;
jenkins-leviathan-v2-template-3843-client-1  | [2022-08-12T18:42:09.95

form the journal logs for this particular test run:

Aug 12 18:19:15 healthdog[1921]: 2022-08-12T18:19:15Z chronyd exiting
Aug 12 18:19:15 sshd[2165]: Received disconnect from 172.126.60.3 port 57180:11:
Aug 12 18:19:15 sshd[2165]: Disconnected from user root 172.126.60.3 port 57180
Aug 12 18:19:15 healthdog[2172]: 2022-08-12T18:19:15Z chronyd version 4.0 starting (+CMDMON +NTP +REFCLOCK +RTC -PRIVDROP -SCFILTER -SIGND +ASYNCDNS -NTS -SECHASH +IPV6 -DEBUG)
Aug 12 18:19:15 healthdog[2172]: 2022-08-12T18:19:15Z Wrong permissions on /run/chrony
Aug 12 18:19:15 healthdog[2172]: 2022-08-12T18:19:15Z Disabled command socket /run/chrony/chronyd.sock
Aug 12 18:19:15 healthdog[2172]: 2022-08-12T18:19:15Z Frequency -1.272 +/- 9.880 ppm read from /var/lib/chrony/drift
Aug 12 18:19:15 healthdog[2185]: [chrony-healthcheck][INFO] No online NTP sources - forcing poll
Aug 12 18:19:15 healthdog[2185]: [chrony-healthcheck][ERROR] Failed to trigger NTP sync

When the tests passes (from journal logs of a successful test:

Aug 13 17:20:17 healthdog[1816]: 2022-08-13T17:20:17Z chronyd exiting
Aug 13 17:20:17 healthdog[1840]: try: 1, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:20:17 sshd[1947]: Received disconnect from 172.126.62.3 port 42066:11:
Aug 13 17:20:17 sshd[1947]: Disconnected from user root 172.126.62.3 port 42066
Aug 13 17:20:17 systemd[1]: /lib/systemd/system/sshd@.service:15: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
Aug 13 17:20:17 healthdog[1952]: 2022-08-13T17:20:17Z chronyd version 4.0 starting (+CMDMON +NTP +REFCLOCK +RTC -PRIVDROP -SCFILTER -SIGND +ASYNCDNS -NTS -SECHASH +IPV6 -DEBUG)
Aug 13 17:20:17 healthdog[1952]: 2022-08-13T17:20:17Z Frequency 0.000 +/- 1000000.000 ppm read from /var/lib/chrony/drift
Aug 13 17:20:17 sshd[1968]: error: Unable to load host key: /etc/ssh/hostkeys/ssh_host_dsa_key
Aug 13 17:20:17 healthdog[1962]: [chrony-healthcheck][INFO] No online NTP sources - forcing poll
Aug 13 17:20:17 healthdog[1952]: 2022-08-13T17:20:17Z System clock was stepped by 0.000000 seconds
Aug 13 17:19:16 healthdog[1976]: try: 1, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 2, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 3, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 4, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 5, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 6, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 7, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 8, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 9, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 10, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 11, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 12, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 13, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 14, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 15, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 16, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 17, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 18, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 19, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 20, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 21, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 22, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 23, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 24, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 25, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 26, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 27, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 28, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 29, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 30, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 31, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 32, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 33, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 34, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 35, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 36, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 37, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 38, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 39, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 40, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 41, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 42, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 43, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 44, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 45, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 46, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 47, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 48, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 49, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 50, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 51, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 52, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 53, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 54, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 55, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 56, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 57, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 58, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 59, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1976]: try: 60, refid: 00000000, correction: 0.000000000, skew: 0.000
Aug 13 17:19:16 healthdog[1962]: [chrony-healthcheck][ERROR] NTP time lost synchronization - restarting chronyd

The bottom part of the second set of logs never happens in the failed (aarch64 case)

Looks to me like Aug 12 18:19:15 healthdog[2172]: 2022-08-12T18:19:15Z Wrong permissions on /run/chrony is stopping chrony from doing what it is supposed to do?

cc @alexgg @klutchell @jakogut

Also confirmed it can happen on the generic x86 images: https://jenkins.product-os.io/job/leviathan-v2-template/3857/

Yeah, I think the disabled command socket is a problem

Aug 12 18:19:15 healthdog[2172]: 2022-08-12T18:19:15Z Wrong permissions on /run/chrony
Aug 12 18:19:15 healthdog[2172]: 2022-08-12T18:19:15Z Disabled command socket /run/chrony/chronyd.sock

Must be some kind of race condition when starting chronyd?

Can we compare the owners and permissions of that directory between passing and failing tests by adding some debug? Also a test to see if the command socket was created or not. Then we can see what can be done at the service level to fix the permissions.

Output from one of my devices

root@pihole:~# ls -al /run/chrony             
total 4
drwxr-x---  2 root root  80 Aug  1 12:51 .
drwxr-xr-x 21 root root 640 Aug  1 12:51 ..
-rw-r--r--  1 root root   5 Aug  1 12:51 chronyd.pid
srwxr-xr-x  1 root root   0 Aug  1 12:51 chronyd.sock

I've seen this happen now with the rockpi


-- Logs begin at Wed 2022-09-14 07:52:25 UTC, end at Wed 2022-09-14 08:13:18 UTC. --
Sep 14 07:54:02 b0105db healthdog[2676]: 2022-09-14T07:54:02Z chronyd exiting
Sep 14 07:54:02 b0105db healthdog[2949]: 2022-09-14T07:54:02Z chronyd version 4.2 starting (+CMDMON +NTP +REFCLOCK +RTC -PRIVDROP -SCFILTER -SIGND +ASYNCDNS ->
Sep 14 07:54:02 b0105db healthdog[2949]: 2022-09-14T07:54:02Z Wrong permissions on /run/chrony
Sep 14 07:54:02 b0105db healthdog[2949]: 2022-09-14T07:54:02Z Disabled command socket /run/chrony/chronyd.sock
Sep 14 07:54:02 b0105db healthdog[2949]: 2022-09-14T07:54:02Z Running with root privileges
Sep 14 07:54:02 b0105db healthdog[2949]: 2022-09-14T07:54:02Z Frequency 0.000 +/- 1000000.000 ppm read from /var/lib/chrony/drift
Sep 14 07:54:02 b0105db healthdog[2954]: [chrony-healthcheck][INFO] No online NTP sources - forcing poll
Sep 14 07:54:02 b0105db healthdog[2954]: [chrony-healthcheck][ERROR] Failed to trigger NTP sync
Sep 14 07:52:30 b0105db healthdog[2949]: 2022-09-14T07:52:30Z Backward time jump detected!
Sep 14 07:54:02 b0105db healthdog[3292]: [chrony-healthcheck][INFO] No online NTP sources - forcing poll
Sep 14 07:54:02 b0105db healthdog[3292]: [chrony-healthcheck][ERROR] Failed to trigger NTP sync
Sep 14 07:56:02 b0105db systemd[1]: chronyd.service: Watchdog timeout (limit 4min)!
Sep 14 07:56:02 b0105db systemd[1]: chronyd.service: Killing process 2949 (chronyd) with signal SIGABRT.
Sep 14 07:56:02 b0105db systemd[1]: chronyd.service: Killing process 2952 (exe) with signal SIGABRT.
Sep 14 07:56:02 b0105db systemd[1]: chronyd.service: Main process exited, code=killed, status=6/ABRT
Sep 14 07:56:02 b0105db systemd[1]: chronyd.service: Failed with result 'watchdog'.

Test hangs forever while looking for the NTP time lost synchronization - restarting chronyd which never appears

@rcooke-warwick do you have access to that device?

 18   if ! /usr/bin/chronyc 'burst 4/4' > /dev/null ||                              
   19      ! /usr/bin/chronyc makestep > /dev/null; then                              
   20     error "Failed to trigger NTP sync"                                          
   21     exit 1 

Could you manually try to run /usr/bin/chronyc 'burst 4/4' and /usr/bin/chronyc makestep?

@rcooke-warwick I found at least a couple of instances where the directory can be created with the wrong permissions and fixed them in #2791. Is there an easy way to verify whether this problem is fixed? I am thinking maybe using the Jenkins build for the RockPi from that PR and running just that single test continuosly.

I also see it with the NUC today too. It's a good idea @alexgg, maybe I can create a script to test this locally:

  1. use the image from that PR to provision a NUC
  2. run the test
  3. reboot
  4. run the test
  5. ....... X times

With just the test. From what I've seen, there's a chance that the wrong permissions issue can pop up each time the device is booted. So by rebooting and checking each time, we can see if the problem is fixed?

Seen this more times over the last 2 weeks on test runs running latest meta balena (on the NUC). If you reboot the DUT during this test, you can get chrony to start with the right permissions, then the test works

may end up being resolved by #2791