Only upload changed files to (a different) s3 bucket

We have a PR build that uploads the generated (html) output to a public s3 bucket, so you can check the results before merging. This is useful, but the output has grown over time, and is now ~6GB; so the job takes a long time to run, and uploads a lot of unnecessary files.

I recently switched the trunk build to use sync from the AWS CLI (rather than s3cmd), which was noticeably faster; so I thought I’d try using the --dry-run feature, to generate a diff against the production bucket.

docker run --rm -v $PWD:/app -w /app -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY amazon/aws-cli s3 sync output/ s3://foo-prod --dryrun --size-only

Unfortunately, there’s no machine readable output options for that command, so we need to get our awk on. My first attempt was to generate a cp command, for each line:

docker run ... | awk '{sub(/output\//, ""); sub(/&/, "\\\\&"); print "docker run --rm -v $PWD:/app -w /app -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY amazon/aws-cli s3 cp output/"$3" s3://foo-pr/"ENVIRON["GIT_COMMIT"]"/"$3}'

Once you’re satisfied the incantation looks correct, you can pipe the whole lot to bash:

docker run ... | awk ... | bash

With this working locally, it seemed simple to just run that command as a pipeline step. It was not. Trying to escape the combination of quotes in groovy proved fruitless, and in the end I just threw in a bash script, and called that from the Jenkinsfile.

While this solved one problem:

The build now took nearly twice as long to run, presumably due to copying files one at a time. I was considering using the SDK, when I realised I could just copy the needed files locally, and sync that folder instead.
mkdir changed
docker run --rm -v $PWD:/app -w /app -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY amazon/aws-cli s3 sync output/ s3://foo-prod --dryrun --size-only | awk '{sub(/&/, "\\\\&"); print "cp "$3" changed/"}' | bash
docker run --rm -v $PWD:/app -w /app -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY amazon/aws-cli s3 sync changed/ s3://foo-pr/$GIT_COMMIT --size-only

Finally, a build that is both quick(er), and uploads only the changed files!

RDS Postgresql WalWriteLock

We recently had a service degradation/outage, which manifested as WalWriteLock in perf insights:

The direct cause was autovacuum on a large (heavily updated) table, but it had run ~1 hour earlier, without any issues.

Our short term solution was to raise the av threshold, and kill the process. But it had to run again, at some point, or we’d be in real trouble.

We checked the usual suspects for av slowdown, but couldn’t find any transaction older than the av process itself, or any abandoned replication slots.

We don’t currently have a replica (although we are using multi-AZ); but this prompted us to realise that we still had the wal_level set to logical, after using DMS to upgrade from pg 10 to 11. This generates considerably more WAL, than the next level down.

After turning that off (and failing over), we triggered AV again, but were still seeing high WalWriteLock contention. Eventually, we found this 10 year old breadcrumb on the pg-admin mailiing list:

Is it vacuuming a table which was bulk loaded at some time in the past? If so, this can happen any time later (usually during busy periods when many transactions numbers are being assigned)

https://www.postgresql.org/message-id/4DBFF5AE020000250003D1D9%40gw.wicourts.gov

So it seems like this was a little treat left for us by DMS, which combined with the extra WAL from logical, was enough to push us over the edge at a busy time.

Once that particular AV had managed to complete, the next one was back to normal.

Triggering a cron lambda

Once you have a lambda ready to run, you need an EventBridge rule to trigger it:

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli events put-rule --name foo --schedule-expression 'cron(0 4 * * ? *)'

You can either run it at a regular rate, or at a specific time.

And your lambda needs the right permissions:

aws-cli lambda add-permission --function-name foo --statement-id foo --action 'lambda:InvokeFunction' --principal events.amazonaws.com --source-arn arn:aws:events:region:account:rule/foo

Finally, you need a targets file:

[{
    "Id": "1",
    "Arn": "arn:aws:lambda:region:account:function:foo"
}]

to add to the rule:

aws-cli events put-targets --rule foo --targets file://targets.json

Cron lambda (Python)

For a simple task in Redshift, such as refreshing a materialized view, you can use a scheduled query; but sometimes you really want a proper scripting language, rather than SQL.

You can use a docker image as a lambda now, but I still find uploading a zip easier. And while it’s possible to set up the db creds as env vars, it’s better to use temp creds:

import boto3
import psycopg2

def handler(event, context):
    client = boto3.client('redshift')

    cluster_credentials = client.get_cluster_credentials(
        DbUser='user',
        DbName='db',
        ClusterIdentifier='cluster',
    )

    conn = psycopg2.connect(
        host="foo.bar.region.redshift.amazonaws.com",
        port="5439",
        dbname="db",
        user=cluster_credentials["DbUser"],
        password=cluster_credentials["DbPassword"],
    )

    with conn.cursor() as cursor:
        ...

Once you have the bundle ready:

pip install -r requirements.txt -t ./package
cd package && zip -r ../foo.zip . && cd ..
zip -g foo.zip app.py

You need a trust policy, to allow lambda to assume the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "sts:AssumeRole",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Effect": "Allow",
            "Sid": ""
        }
    ]
}

And a policy for the redshift creds:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "GetClusterCredsStatement",
        "Effect": "Allow",
        "Action": [
            "redshift:GetClusterCredentials"
        ],
        "Resource": [
            "arn:aws:redshift:region:account:dbuser:cluster/db",
            "arn:aws:redshift:region:account:dbname:cluster/db"
        ]
    }]
}

In order to create an IAM role:

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli iam create-role --role-name role --assume-role-policy-document file://trust-policy.json
docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli iam attach-role-policy --role-name role --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli iam attach-role-policy --role-name remove-duplicates --policy-arn arn:aws:iam::aws:policy/service-role/AWSXRayDaemonWriteAccess
docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli iam put-role-policy --role-name role --policy-name GetClusterCredentials --policy-document file://get-cluster-credentials.json

And, finally, the lambda itself:

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli lambda create-function --function-name foo --runtime python3.7 --zip-file fileb://foo.zip --handler app.handler --role arn:aws:iam::account:role/role --timeout 900

If you need to update the code, after:

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli lambda update-function-code --function-name foo --zip-file fileb://foo.zip

You can test the lambda in the console. Next time, we’ll look at how to trigger it, using EventBridge.

No module named ‘psycopg2._psycopg’

I was trying to set up a python lambda, and fell at the first hurdle:

What made it confusing was that I had copied an existing lambda, that was working fine. I checked a few things that were different: the python version (3.7), no effect. Even the name of the module/function.

I was using psycopg2-binary, and the zip file structure looked right. Eventually, I found a SO answer suggesting it could be arch related, at which point I realised that I had pip installed using docker, rather than venv.

I have no idea why that mattered (uname showed the same arch from python:3.7 as my laptop), but onwards to the next problem! 🤷

Terminating TLS at an ALB

While there are definite benefits to having a zero trust network, it’s also convenient to outsource all the certificate management.

First you need to create a cert with ACM, either by importing it, or letting them do it (managed renewal ftw!):

  Certificate:
    Type: AWS::CertificateManager::Certificate
    Properties:
      DomainName: !Ref 'DomainName'
      ValidationMethod: DNS

That in hand, you can create the LB, Listener & Target Group:

  LoadBalancer:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Name: foo
      Subnets:
        - !Ref PublicSubnet1
        - ...
      SecurityGroups:
        - !Ref SecurityGroup
  LoadBalancerListener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      LoadBalancerArn: !Ref LoadBalancer
      Port: 443
      Protocol: HTTPS
      DefaultActions:
        - Type: forward
          TargetGroupArn: !Ref DefaultTargetGroup
      SslPolicy: ELBSecurityPolicy-TLS-1-2-2017-01
      Certificates:
        - CertificateArn: !Ref Certificate
  DefaultTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Name: foo
      VpcId: !Ref Vpc
      Port: 80
      Protocol: HTTP
  SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      VpcId: !Ref Vpc
      GroupDescription: Enable HTTPS access for LB
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: '443'
          ToPort: '443'
          CidrIp: '0.0.0.0/0'

Make sure you use a public subnet, or you won’t be able to reach the LB!

Ansible, AWS, and bastion hosts, oh my!

There’s some useful info available about using a jump host with Ansible, and AWS dynamic inventory; but either the world has changed since those were written, or my scenario is slightly different.

I defined my inventory first:

plugin: aws_ec2
regions:
  - eu-west-2
keyed_groups:
 - key: tags.Type
   separator: ''
compose:
  ansible_host: private_ip_address

(that last bit is important, otherwise the ssh config won’t work). At this point you should be able to list (or graph) the instances you want to connect to:

$ ansible-inventory -i inventories/eu-west-2.aws_ec2.yml --list

Next you need some ssh config:

Host 10.0.*.*
    ProxyCommand ssh -W %h:%p admin@52.56.111.199

I kept it pretty minimal. The IP mask needs to match whatever you used for the subnet(s) the instances are attached to (obvs). And the login may vary depending on the image you used, if you are using the defaults.

You can then use this config when running your playbook:

ANSIBLE_SSH_ARGS="-F lon_ssh_config" ansible AppServer -i inventories/eu-west-2.aws_ec2.yml -u admin -m ping

The IP address for the jump host is hard-coded in the ssh config, which isn’t ideal. We may use a DNS record, and update that instead, if it changes; but there doesn’t seem any easy way to either get that from the inventory, or update the cname automatically.

AutoScalingGroup with a LaunchTemplate

There are plenty of examples for creating an ASG using a CloudFormation template, but those I found all used a “launch configuration“.

According to the docs, using a launch template is the new hotness, so I foolishly assumed it would be simple to adapt one to the other.

Some time later, I had a working example:

  AutoScalingGroup:
    Type: 'AWS::AutoScaling::AutoScalingGroup'
    Properties:
      LaunchTemplate:
        LaunchTemplateId: !Ref LaunchTemplate
        Version: !GetAtt LaunchTemplate.LatestVersionNumber
      VPCZoneIdentifier:
        - Fn::ImportValue:
            Fn::Sub: '${Network1StackName}-PublicSubnetId'
        - Fn::ImportValue:
            Fn::Sub: '${Network2StackName}-PublicSubnetId'
      MinSize: 1
      MaxSize: 1
  LaunchTemplate:
    Type: 'AWS::EC2::LaunchTemplate'
    Properties:
      LaunchTemplateData:
        ImageId: "..."
        InstanceType: "..."
        SecurityGroupIds:
          - Fn::ImportValue:
              Fn::Sub: '${SecurityGroupsStackName}-SshIngressSecurityGroupId'