cloudfoundry/bosh-agent

Is it better to sync time asynchronously?

Closed this issue · 4 comments

// Make a best effort to sync time now but don't error
_, _, _, _ = p.cmdRunner.RunCommand("sync-time")
return

Is it better if bosh-agent sync time asynchronously and continue do other setups?

Historically, a primary rationale for this behavior is to try and ensure the system is starting from a known and shared point in time.

A long while ago, there had been some instances where VMs were starting with incorrect times from hosts which then causes issue with director and TLS communications. Best effort synchronous behavior helps avoid that, and also helps avoid time changes occurring during or after job processes have started.

Are you seeing any particular issues with this approach that we can help with?

Thanks @dpb587-pivotal . From my understanding, best effort synchronous behavior means that bosh-agent doesn't care if sync-time fails. If there are issues with director and TLS communications when time is incorrect, then sync-time should not be a best effort behavior. Please forgive me if my understanding is not correct.

The below is the issue I hit:
Background:
If the VM is associated with Azure Standard Load Balancer, a load balancing rule for UDP (any port) is needed to trigger the UDP SNAT programming. (See: https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections#preallocatedports)

Example 1 (ubuntu-trusty):
Without the LB rule, sync-time (aka ntpdate in ubuntu-trusty) will fail to connect to ntp server. VMs may start with incorrect times. Since ntpdate doesn't retry, so the failure doesn't block bosh-agent, and bosh-agent continues to do other setups. The behavior is fine for me.

Example 2 (ubuntu-xenial):
Without the LB rule, sync-time (aka chronyc in ubuntu-xenial) also fails to connect to ntp server. But chronyc retry for 10 times, which cost 100 seconds. It slows down the bosh-agent booting. If bosh-agent sync time asynchronously, bosh-agent can continue to do other setups.
Of course, adding LB rules for UDP port can make sync-time work and avoid the 100s. But for those who doesn't want to or forget to add these rules. they will suffer extra 100s bosh-agent booting time. That's why I raised this issue. From your comments, it seems that synchronous behavior is preferred and the LB rules should be added to make sync-time work to avoid potential issues.

I see - thank you for the additional details.

In theory, it might be nice to sync time only if we detect problems or can guess there may be time issues. But in practice, that is a bit difficult to catch all possible errors in the various components where those can be seen. So, currently it is best effort.

I agree the 100 second delay is not optimal. If possible, it would be preferable to ensure UDP traffic from the VM can be used. It sounds like load balancer rules can be adjusted somewhere to make that possible?

Yes, load balancer rules for UDP can make that possible.