Only upload changed files to (a different) s3 bucket

We have a PR build that uploads the generated (html) output to a public s3 bucket, so you can check the results before merging. This is useful, but the output has grown over time, and is now ~6GB; so the job takes a long time to run, and uploads a lot of unnecessary files.

I recently switched the trunk build to use sync from the AWS CLI (rather than s3cmd), which was noticeably faster; so I thought I’d try using the --dry-run feature, to generate a diff against the production bucket.

docker run --rm -v $PWD:/app -w /app -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY amazon/aws-cli s3 sync output/ s3://foo-prod --dryrun --size-only

Unfortunately, there’s no machine readable output options for that command, so we need to get our awk on. My first attempt was to generate a cp command, for each line:

docker run ... | awk '{sub(/output\//, ""); sub(/&/, "\\\\&"); print "docker run --rm -v $PWD:/app -w /app -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY amazon/aws-cli s3 cp output/"$3" s3://foo-pr/"ENVIRON["GIT_COMMIT"]"/"$3}'

Once you’re satisfied the incantation looks correct, you can pipe the whole lot to bash:

docker run ... | awk ... | bash

With this working locally, it seemed simple to just run that command as a pipeline step. It was not. Trying to escape the combination of quotes in groovy proved fruitless, and in the end I just threw in a bash script, and called that from the Jenkinsfile.

While this solved one problem:

The build now took nearly twice as long to run, presumably due to copying files one at a time. I was considering using the SDK, when I realised I could just copy the needed files locally, and sync that folder instead.
mkdir changed
docker run --rm -v $PWD:/app -w /app -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY amazon/aws-cli s3 sync output/ s3://foo-prod --dryrun --size-only | awk '{sub(/&/, "\\\\&"); print "cp "$3" changed/"}' | bash
docker run --rm -v $PWD:/app -w /app -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY amazon/aws-cli s3 sync changed/ s3://foo-pr/$GIT_COMMIT --size-only

Finally, a build that is both quick(er), and uploads only the changed files!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s