Concordium/Testnet4-Challenges

Node will not realize full sync. Chases final blocks

NerdsLogic opened this issue · 5 comments

Bug Description
The node named "MouseHouse" will not realize full sync. It remains between 2 to 15 blocks behind after several restarts

Steps to Reproduce
Deployed 0.4.11 on Windows 10 i9, 8GB, NVME storage. Was originally Docker 3.x, downgraded to 2.5.0.1 due to previous bootstrapping errors after restart due to stalled sync.

Expected Result
Full sync

Actual Result
Does not realize sync. Sends catch up messages, fails to finalize blocks.

Versions

  • Software Version - Node 0.4.11, Docker 2.5.0.1 (downgraded from 3.x)
  • OS - Windows 10
  • Browser - not applicable (brave/chrome/opera)
  • Mobile device - not applicable (Android)

This is an interesting issue. Could you share logs of the node?

One thing that could explain this is if your system clock is slightly out of sync.

Could you run docker exec -t concordium-client date and compare the result of that with the time reported on, e.g., https://time.is/

Note that the former command will output the time in UTC, not in your local time zone.

My clock was about 1:30 out of sync with time.is as reported by time.is - I have updated the time so that time.is reports "Your time is exact!" and restarted the node which quickly resumed undesirable behavior

I then queried the time from docker which returned the same out of sync time as before updating system time. I closed and reopened docker, restarted node and queried time from docker which returned the same out of sync time as before updating time and does not match currently synced system time.

I will restart the machine in its entirety and report results. Please see the attached log in the meantime.

concordium-testnet-system-report.log

The clock being out of sync completely explains the behaviour you are seeing. The node will reject blocks that are too far in the future.

Unfortunately there is the issue with docker on windows where time can get out of sync. Restarting the computer usually fixes the problem and syncs it with system time. It might suffice to also just restart the docker service.

After configuring an external time sync source of pool.ntp.org and restarting once again, I was able to start the node with < 30 sec. delta between time.is and system time and the node MouseHouse very quickly came into sync.

Restarting the docker service was not sufficient to cause Docker to sync time with it's Windows host. In my case, I had to restart windows.