This project demonstrates an approach to leverage S3 multipart upload and Step Functions (Distributed Map) to concurrently download a large file, up to 100TB (10GB * 10,000) in theory, from any given url (request range is required), and upload to a S3 bucket.
This project also demonstrates a way to host the source code in CodeCommit and deploy via CDK Pipeline.
PartitionerAn Python Lambda takeURLandSingleTaskSizeas input, fetches the total download file size from given url. Based on given single task size, split the upload task into smaller tasks, and pass tasks to next state.UploaderAn Python Lambda is triggered by Step Functions leveragesrequest rangeto download a portion of file, and upload to S3 by using multipart upload.Step FunctionsAn state machine handles tasks validation, fan-out, retry and error handling, also handles S3 multipart upload create, complete and abort.
Simply run make test to run lint and unit test on Partitioner and Uploader.
- An AWS IAM user account which has enough permission to deploy:
- CodeCommit
- CodeBuild
- CodePipeline
- Step Functions
- Lambda
- S3
- Set up
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_DEFAULT_REGIONandCDK_DEFAULT_ACCOUNTin.envfile.
This project is using AWS CodeCommit to host source code and CDK Pipeline to deploy. Simply run make ci-deploy to run
lint, build, create new repository in CodeCommit, push source code and deploy the project CDK Pipeline.
An example Step Functions payload below to upload an awscli file to S3.
{
"URL": "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip",
"SingleTaskSize": 6000000
}