Retry after failing data transfer
Closed this issue · 6 comments
Hi @lehins!
We're using cache-s3 in a hybrid infra: we have workers in the cloud and on-premises as well. cache-s3 is working great, but sometimes uploading the .cache file to S3 fails:
# Saving the compilation cache...
$ cache-s3 --prefix=macos save --relative-path ./ccache
[cache-s3][Info ][Mon, 2 Sep 2019 07:02:04 UTC]: Caching: ./ccache
[cache-s3][Info ][Mon, 2 Sep 2019 07:02:17 UTC]: <cache-s3/macos/master.cache> - No previously stored cache was found.
[cache-s3][Info ][Mon, 2 Sep 2019 07:02:17 UTC]: <cache-s3/macos/master.cache> - Data change detected, caching 65.6 MiB with sha256: sKL7Z16jyBiNDy+nZl+NSxyLVJtwM8aR0wvJYB+ZRKE=
[cache-s3][Error][Mon, 2 Sep 2019 07:03:29 UTC]: <cache-s3/macos/master.cache> - Critical HTTPException
The resulting cache database is only 65 MB, but it took more than an hour to finish the compilation (because we started from an empty cache). As the HTTP error fails our build, we have to start again the whole build.
To solve this issue I propose to implement a retry mechanism for the data transfer. I was looking at the retry package which could try to recover in case of an HTTPException.
I'm not sure why this exception occurs, so probably I'd start by fixing #18
What do you think? If you agree with this direction I'd be happy to submit a PR.
@wagdav I made a long overdue upgrade of lts snapshot as well as improved exception reporting. So if you wanna give that version a shot it is on the branch switch-to-lts-14
With regards to retry mechanism I would not be opposed to it. Having an optional flag --retry-upload=<n>
, where default value for n
of something sensible like 3 or 5 tries would be good. At the same time using retry
package I think is an overkill. Adding a counter with a threadDelay
to runLoggingAWS
that recursively calls itself until tries are exhausted would be a much easier approach, I think.
If you decide to submit a PR, start of with switch-to-lts-14
branch, since I'll be merging it to master soon.
It was indeed
So, @wagdav because of #26 we have two possible versions:
- v0.1.9 - concurrent uploads, but no retry
- v0.1.10 - retry functionality, but no concurrent uploads.
Hopefully next version will not have to make any compromises :)
Despite that we use cache-s3
internally, unfortunately, neither of those features are critical for us, so I can't spend much more time right now on getting those issues resolved. Naturally, PRs are always welcome and of course, I can always spend more time and take care of it myself, FP Complete is always happy to work with new customers ;)