WORK IN PROGRESS Outstanding work is tracked against the v1.0.0 milestone
Copy each file, in a list of files, from one bucket to another. Make use of AWS Lambda to create many concurrent copy operations and increase speed of processing. Use AWS S3 copy command to ensure lowest cost of operations.
Created out of a need to move ~10million objects.
The main endpoint accepts a list of sources and destinations, this function will asynchronously send each source/destination pair to another lambda function to process that pair. Once each pair has been processed, the main function will return the statuses of each operation and a summary of success/failures.
This function relies on the Serverless framework to
operate, so install that with npm install -g serverless
.
Then deploy these aws-s3cp-bulk functions to AWS Lambda with serverless deploy
The function expects a CSV which contains a list of source and destination files for the copy command. How you create this list will depend on your specific circumstances, but this file must be uploaded to S3 and made available to the function. To do this, upload the file to your 'operational bucket'.
Deployment of the functions will create an 'operational bucket' in your AWS
account, which will be used for the input files and output reports. This bucket
is named using a format of aws-s3cp-bulk-{AWS::Region}-{AWS::AccountId}
. Add a
CSV file to this bucket, containing a list of the files which you want to copy.
This CSV should have 2 columns:
source file path | destination file path |
---|---|
s3://bucket-name/file/path.ext | s3://another-bucket-name/file/path_new.ext |
Your file should NOT have a header row, the above is to show the format of the file and an example.
Before you run the function, you will need to ensure that you grant the function
appropriate access to the source files and destination bucket. The function will
operate using a new IAM role, named along the lines of
aws-s3cp-bulk-development-{AWS::Region}-lambdaRole
Add permissions to that role to allow read from the source files and write to
the destination bucket.
You can now invoke the function using
serverless invoke --function bulk-cp --data {LIST_OF_FILES_TO_COPY.CSV}
Results will be generated and written to your 'operational bucket' alongside
your input file using the naming scheme of
./reports/{INPUT_NAME}_report_{YYYY-MM-DDTHH:mm:ss}.json
Inside this file you will find a summary object and a detailed list of each file copy command's result.
The summary will inform you of the total number of operations, the number of successes and the number of failures. It will also report on some runtime paramaters - the time the process started, and the CSV input file used.
The detail will report the source/destination file pair, along with either the ETag of the file on success, or the error message if an error occurred.
There is a need to feed the command with large amounts of data, around 10million source filenames, each with a destination. This volume of data is too large to otherwise feed into the command, and by outputting the results to a report file, we do not need to ensure response time is within expected request-response cycle timeouts.
With smaller numbers of files, a JSON POST body may suffice, but there is as yet no API Gateway to accept this kind of request-response mode of operation. Feel free to contribute if you want to add this interface.