Nordix/xcluster

The root fs can get corrupted on a reset

Closed this issue · 4 comments

The root fs is ext3, a journaling fs that should be resilient to corruption. But it can still be corrupted on a system reset somehow.

This problem can be addressed in different way, for example:

  • Update to ext4
  • Investigate closer why the fs gets corrupted even though it shouldn't
  • Do a "sync" before reset

It is not at all certain that ext4 solves the problem! On the contrary, a quick test (patched "diskim") shows that some files becomes zero-size after reset, and possibly a bunch of other problems.

It would be interesting to know why the fs gets corrupted. Possibly just some setup missing, or that the journal is on an ram fs. I suspect the same problem causes ext4 to fail (badly).

The easy fix that works is to do a "sync". A reset doesn't come out of the blue, it's initiated by a test case. So it's simple to just "sync" before reset. This is the way I'll be going at first.

Reproduce

Found in ovl/k8s-haproxy, and can be reproduced there:

./k8s-ha.sh test start > $log; ./k8s-ha.sh reset_vm 191
# on vm-191
less /etc/haproxy/haproxy.cfg-foobar

The file has >80 of NULL characters appended and the servers are missing. The reset must be done immediately, otherwise the fs may be synced automatically.

BTW

A reboot from within a VM works without problems, since in does "sync" automatically.

Merged #59 now. You were too quick for me to even think :)

Honestly, I can't see any way to totally avoid corruption on a sudden reset, you can only minimize the risk. I think the "sync" before reset is the best we can do. So, closing...