awslabs/mountpoint-s3

unable to remove invalid block

Closed this issue · 5 comments

Mountpoint for Amazon S3 version

1.8.0

AWS Region

us-east-1

Describe the running environment

Running on an ec2 with Rocky 8.10

Runs as a systemd service

Mountpoint options

/usr/bin/mount-s3 --read-only --allow-other --file-mode 0555 --dir-mode 0555 --part-size 134217728 --metadata-ttl 300 --cache /opt/mountpoint/cache/<bucket name> --max-cache-size 1024 <bucket name> --prefix .fuse/references_nosymlinks/

What happened?

Mountpoint was running cleanly for months and then hard failed.

Required a fusermount -zu to be able to remount.

Below logs appeared on many different object keys before the crash. All in the same directory in the bucket.

Relevant log output

Sep 29 00:38:10 <host name> mount-s3[15471]: [WARN] mountpoint_s3::prefetch::caching_stream: error reading block from cache cache_key=ObjectId { inner: InnerObjectId { key: "<object key>", etag: ETag("\"51fbfbec40872e0057cd626920cb58e7-24\"") } } block_index=30 range=18874368..85983232 out of 3101804844 error=IoFailure(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })
Sep 29 00:38:10 <host name> mount-s3[15471]: [WARN] mountpoint_s3::data_cache::disk_data_cache: unable to remove invalid block: Os { code: 2, kind: NotFound, message: "No such file or directory" }
Sep 29 00:38:10 <host name> mount-s3[15471]: [WARN] mountpoint_s3::prefetch::caching_stream: error reading block from cache cache_key=ObjectId { inner: InnerObjectId { key: "<object key>", etag: ETag("\"51fbfbec40872e0057cd626920cb58e7-24\"") } } block_index=30 range=29360128..230686720 out of 3101804844 error=IoFailure(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })
Sep 29 00:38:27 <host name> mount-s3[15471]: [WARN] mountpoint_s3::data_cache::disk_data_cache: block could not be deserialized: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })
Sep 29 00:38:27 <host name> mount-s3[15471]: [WARN] mountpoint_s3::prefetch::caching_stream: error reading block from cache cache_key=ObjectId { inner: InnerObjectId { key: "<object key>", etag: ETag("\"51fbfbec40872e0057cd626920cb58e7-24\"") } } block_index=586 range=614465536..2761949184 out of 3101804844 error=InvalidBlockContent
Sep 29 00:38:59 <host name> mount-s3[15471]: [WARN] mountpoint_s3::data_cache::disk_data_cache: unable to remove invalid block: Os { code: 2, kind: NotFound, message: "No such file or directory" }
Sep 29 00:39:15 <host name> systemd[1]: <service name>.service: Main process exited, code=killed, status=6/ABRT
Sep 29 00:39:15 <host name> systemd[1]: <service name>.service: Failed with result 'signal'.

Hi @daltschu22, the warnings indicate that Mountpoint cannot retrieve the data stored in the local cache. Could there be another process modifying or deleting files in the cache directory /opt/mountpoint/cache/<bucket name>? We recommend avoiding that since Mountpoint will automatically manage the files in the cache to respect the specified max-cache-size.

That said, it is not at all clear whether these issues are related to Mountpoint crashing. Are you able to reproduce the crash? If so, could you enable debug logging (--debug, see docs) and provide more details on what is happening before the crash?

Thanks @passaro !

I cant see a reason why anything else would have been modifying data in that cache directory. The permissions dont allow normal users on the machine to do any modifications in there.

Unfortunately these machines are used by numerous scientists for various purposes so its hard to say specifically what the mount was being used for at the time of the crash. I will say we have 5 mountpoint mounts on the system and none of the others experienced any issues. We also didn't see any other indications of the machine itself having problems, only the mountpoint service for that bucket specifically.

I will be happy to update if we do see the problem again, but I'm not sure we want to run in debug mode all the time until it happens.

I figured these logs wouldnt point to any smoking guns, but wanted to document either way. Happy to close this out if it comes back up? up to you.

Thank you!

Before closing, can I ask if you already checked whether this is another occurrence of out-of-memory, like #674?

We didnt see any out of memory events on our monitoring. The machine seemed fine otherwise.

Thanks @daltschu22. I'll close this for now. But please do reopen if the issue occurs again and/or you have more information.