I've created this tool in purpose of restoring huge number of objects stored in Glacier tier in AWS S3 bucket
AWS have tier which is called Glacier, it costs a fraction of S3 Standard tier and is primarily used for storing long term backups but it have it's specifics
- Restore takes up to 48 hours for files to be available for download
- You need to request retrieval for every single object (file) in bucket
- You don't know which files are already ready for download and you have to check them all again and againt
This tool
- It can generate a list of all objects in bucket if you need to restore whole bucket, or, you can supply your own list of object which you need to retrieve
- It saves a progress where it left off and list of files it already requested for retrieval so it save you time if you need to check multiple times
- Is multithreaded so it can request multiple files at once! (this was a huge boost, from days to hours!)
- Allows you to simply check status of requested files and keep list of files which are already ready for download and does not check them again
You always need to specify --bucket
parameter (name of bucket)
You can specify an --aws-profile
parameter to use specified profile from ~/.aws/config
generate-object-list
ganerates a object list into <bucket>.objects
file (all objects in bucket)
request-objects-restore
uses <bucket>.objects
object list and saves names of already requested objects to <bucket>.progress
file
Parameteres:
--retain-for
number of days you want to keep objects restored [Required]
--retrieval-tier
Standard, Bulk, Expedited (default: Standard) [Optional]
--thread-count
(default: num of cpu in your machine) [Optional]
check-objects-status
uses <bucket>.objects
object list, compares it to <bucket>.available
and check only files which are not already ready for download
Parameters:
--thread-count
(default: num of cpu in your machine) [Optional]
This command will traverse whole bucket and list all objects inside
./s3_restore.py --bucket <your_bucket> genereate-object-list
If you want to retrieve only some objects, you have to generate list yourself, just paste your object paths to file with name <your_bucket>.objects
in format:
/object1.jpg
/object2.jpg
/some/other/object.jpg
... etc
and run request-objects-restore
subcommand
./s3_restore.py --bucket <your_bucket> request-objects-restore --retain-for 10
./s3_restore.py --bucket <your_bucket> check-objects-status
I did not create any benchmarks but in my case requesting retrieval of 700 000 files took around 6-8 hours (doing it naive way would took several days!)
For best performance i recommend creating an EC2 Instance (preferably in same region as your bucket, and with many cores to utilize multithreading) and running it from there, latency is so much lower and in benefit of it the request rate is so much higher :)
The AWS S3 POST
request limit is around 3500req/s so be aware that there are paralelization limits and set your thread count accordingly (if i remember correctly the request rate for single thread is around 5-10req/s)