crate/crate-operator

Cluster restart due to update lead to broken snapshots

turbo-ele opened this issue · 2 comments

One of our clusters got updated from 4.2.2 to 4.2.3 and (possibly) restarted in the middle of a snapshot. This lead to all following (incremental) snapshots being broken:

image

Excerpt from failures column:

 Array[210]
1
:
SnapshotShardFailure{shardId=[.partitioned.raw.04732d9p68rjgd1g60o30c1g][1], reason='[.partitioned.raw.04732d9p68rjgd1g60o30c1g/IzAqPzKJSBSx03f4kiWzVQ][[.partitioned.raw.04732d9p68rjgd1g60o30c1g][1]] IndexShardSnapshotFailedException[java.nio.file.NoSuchFileException: Blob object [index-rz6hd7k5Sc6Y98g2onKbqQ] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 718995D51323DB94; S3 Extended Request ID: MNDADspuM6okpk3M2OKGuTxLiLplDzaRH4P1sCoybkJ86+rhp4eGCvbKVJDbGuswe5PSX4Yg2qs=; Proxy: null)]; nested: NoSuchFileException[Blob object [index-rz6hd7k5Sc6Y98g2onKbqQ] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 718995D51323DB94; S3 Extended Request ID: MNDADspuM6okpk3M2OKGuTxLiLplDzaRH4P1sCoybkJ86+rhp4eGCvbKVJDbGuswe5PSX4Yg2qs=; Proxy: null)]; at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$snapshotShard$27(BlobStoreRepository.java:950) at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1128) at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337) at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.nio.file.NoSuchFileException: Blob object [index-rz6hd7k5Sc6Y98g2onKbqQ] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 718995D51323DB94; S3 Extended Request ID: MNDADspuM6okpk3M2OKGuTxLiLplDzaRH4P1sCoybkJ86+rhp4eGCvbKVJDbGuswe5PSX4Yg2qs=; Proxy: null) at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:100) at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:136) at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:115) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:1391) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:967) ... 5 more ', nodeId='I1bgTGttT1Kgom36qjzbBQ', status=INTERNAL_SERVER_ERROR}
...

If the above was the case, the Operator should take ongoing snapshots into account, when performing restarts of a cluster.

In the end, creating a new repository resolved the issue.

Thanks for the report, @turbo-ele. This is heavily dependent on #46.

The operator now is snapshot-aware and will not perform update operations if a snapshot is in progress.