soabase/exhibitor

Cleanup task occasionally reports failure to delete phantom txn logs and snapshots

srikanth-viswanathan opened this issue · 9 comments

Could be a zookeeper issue, but I wanted to check here first.

We use exhibitor (1.5.5) to run zookeeper (3.4.6). Occasionally, we'll see this line "Failed to remove" in the logs:

Fri Aug 18 16:02:17 UTC 2017 INFO Cleanup task completed
Fri Aug 18 16:02:17 UTC 2017 INFO Cleanup task completed
Fri Aug 18 16:02:17 UTC 2017 INFO Cleanup Task: Removing file: Aug 18, 2017 3:52:58 PM /mnt/data1/prod/stage-nj1/exhibitor.services/zookeeper_snapshots/version-2/snapshot.2948da6bab
Fri Aug 18 16:02:17 UTC 2017 INFO Cleanup Task: Removing file: Aug 18, 2017 3:54:21 PM /mnt/data1/prod/stage-nj1/exhibitor.services/zookeeper_snapshots/version-2/snapshot.2948dc9562
Fri Aug 18 16:02:17 UTC 2017 INFO Cleanup Task: Removing file: Aug 18, 2017 3:53:45 PM /mnt/data1/prod/stage-nj1/exhibitor.services/zookeeper_snapshots/version-2/snapshot.2948dba4e6
Fri Aug 18 16:02:17 UTC 2017 ERROR Cleanup Task: Failed to remove /mnt/data1/prod/stage-nj1/exhibitor.services/zookeeper_transactions/version-2/log.2948d9175a
Fri Aug 18 16:02:17 UTC 2017 INFO Cleanup Task: Removing file: Jan 1, 1970 12:00:00 AM /mnt/data1/prod/stage-nj1/exhibitor.services/zookeeper_transactions/version-2/log.2948d9175a

The timestamp on the file suggests that .lastModified() here is returning a file with zero epoch timestamp, suggestive of a missing file or some other I/O error.

I'm also intrigued by the simultaneous "Cleanup task completed" log events. Are multiple cleanup jobs being launched by exhibitor, causing a race condition?

I'm not actually familiar with that part of exhibitor but can take a look. Will let you know.

Are you able to reproduce the duplicate Cleanup task completed situation consistently?
If so then this could be a bug. As far as I can tell, if Exhibitor executed cleanup task twice then it would do this twice:
java -cp zookeeper.jar:log4j.jar:conf org.apache.zookeeper.server.PurgeTxnLog <dataDir> <snapDir> -n <count>
I think there are a number of possible causes:

  1. Some kinda race condition on read/writing the cleanup period based on value read from configs.
  2. System time got changed at a bad time somehow and this caused Exhibitor to misinterpret cleanup period. This is because I know we call System.currentTimeMillis() in 1.5.5.

FYI. ZK 3.5.x (not sure about 3.4.x) includes its own log cleanup task

Will try to see if I can reproduce this locally.
Would it be possible for you to post your exhibitor.properties file and command line arguments for launching exhibitor?

Sure. Here's the exhibitor config file we use:

log-index-directory : /mnt/data1/prod/prod-ny1/exhibitor.services/zookeeper_transactions_log_index
java-environment : export JVMFLAGS=\"-Xmx6144m\"
connect-port : 5555
backup-max-store-ms : 259200000
zookeeper-log-directory : /mnt/data1/prod/prod-ny1/exhibitor.services/zookeeper_transactions
cleanup-max-files : 10
auto-manage-instances : 1
cleanup-period-ms : 30000
backup-period-ms : 60000
client-port : 4444
zoo-cfg-extra : tickTime\=2000&initLimit\=10&syncLimit\=5&maxClientCnxns\=200
zookeeper-install-directory : /mnt/data1/prod/prod-ny1/exhibitor.services/install/zookeeper
auto-manage-instances-settling-period-ms : 60000
check-ms : 30000
auto-manage-instances-fixed-ensemble-size : 5
observer-threshold : 0
backup-extra : directory\=/mnt/data1/prod/prod-ny1/exhibitor.services/zookeeper_backup
zookeeper-data-directory : /mnt/data1/prod/prod-ny1/exhibitor.services/zookeeper_snapshots
election-port : 6666

And here's the command-line for exhibitor:
java -Xmx1024m -Dlog4j.configuration=file:///var/lib/mesos/work_dir/slaves/6ba5ce01-66f9-4d74-8a89-75c0a4abbf91-S36/frameworks/20160121-220626-3614211594-5050-52843-0000/executors/prod.prod-ny1.exhibitor.services.95fd837f-716e-11e7-b76b-a0481cb9ec84/runs/982a7b95-9cb8-4e6f-9b37-b2c17d38437f/local-exhibitor-log4j.properties -jar exhibitor-1.5.5.jar --zkconfigconnect 1.2.3.4:2181,1.2.3.5:2181,1.2.3.6:2181,1.2.3.7:2181,1.2.3.8:2181 --zkconfigpollms 100 --hostname 1.2.3.9 --configtype zookeeper --zkconfigexhibitorport 8080 --zkconfigretry 1000:3 --filesystembackup true --port 31335 --zkconfigzpath /prod/prod-ny1/exhibitor.services/config --defaultconfig exhibitor.conf

The IPs are fake, but I've pasted everything else as-is.

Thanks for looking into this! I do also want to raise the possibility that the repeated Cleanup task completed are simply a red herring, and there's a problem elsewhere in the cleanup logic causing the target files to be missing.

@xiaochuanyu Were you able to reproduce this issue? Thanks!

No I wasn't able to repro it actually.