March 2022 – Look on my works, ye Mighty, and despair!

Triggering a cron lambda

March 31, 2022Graham HayLeave a comment

Once you have a lambda ready to run, you need an EventBridge rule to trigger it:

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli events put-rule --name foo --schedule-expression 'cron(0 4 * * ? *)'

You can either run it at a regular rate, or at a specific time.

And your lambda needs the right permissions:

aws-cli lambda add-permission --function-name foo --statement-id foo --action 'lambda:InvokeFunction' --principal events.amazonaws.com --source-arn arn:aws:events:region:account:rule/foo

Finally, you need a targets file:

[{
    "Id": "1",
    "Arn": "arn:aws:lambda:region:account:function:foo"
}]

to add to the rule:

aws-cli events put-targets --rule foo --targets file://targets.json

Cron lambda (Python)

March 30, 2022March 31, 2022Graham HayLeave a comment

For a simple task in Redshift, such as refreshing a materialized view, you can use a scheduled query; but sometimes you really want a proper scripting language, rather than SQL.

You can use a docker image as a lambda now, but I still find uploading a zip easier. And while it’s possible to set up the db creds as env vars, it’s better to use temp creds:

import boto3
import psycopg2

def handler(event, context):
    client = boto3.client('redshift')

    cluster_credentials = client.get_cluster_credentials(
        DbUser='user',
        DbName='db',
        ClusterIdentifier='cluster',
    )

    conn = psycopg2.connect(
        host="foo.bar.region.redshift.amazonaws.com",
        port="5439",
        dbname="db",
        user=cluster_credentials["DbUser"],
        password=cluster_credentials["DbPassword"],
    )

    with conn.cursor() as cursor:
        ...

Once you have the bundle ready:

pip install -r requirements.txt -t ./package
cd package && zip -r ../foo.zip . && cd ..
zip -g foo.zip app.py

You need a trust policy, to allow lambda to assume the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "sts:AssumeRole",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Effect": "Allow",
            "Sid": ""
        }
    ]
}

And a policy for the redshift creds:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "GetClusterCredsStatement",
        "Effect": "Allow",
        "Action": [
            "redshift:GetClusterCredentials"
        ],
        "Resource": [
            "arn:aws:redshift:region:account:dbuser:cluster/db",
            "arn:aws:redshift:region:account:dbname:cluster/db"
        ]
    }]
}

In order to create an IAM role:

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli iam create-role --role-name role --assume-role-policy-document file://trust-policy.json
docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli iam attach-role-policy --role-name role --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli iam attach-role-policy --role-name remove-duplicates --policy-arn arn:aws:iam::aws:policy/service-role/AWSXRayDaemonWriteAccess
docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli iam put-role-policy --role-name role --policy-name GetClusterCredentials --policy-document file://get-cluster-credentials.json

And, finally, the lambda itself:

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli lambda create-function --function-name foo --runtime python3.7 --zip-file fileb://foo.zip --handler app.handler --role arn:aws:iam::account:role/role --timeout 900

If you need to update the code, after:

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli lambda update-function-code --function-name foo --zip-file fileb://foo.zip

You can test the lambda in the console. Next time, we’ll look at how to trigger it, using EventBridge.

No module named ‘psycopg2._psycopg’

March 23, 2022Graham HayLeave a comment

I was trying to set up a python lambda, and fell at the first hurdle:

What made it confusing was that I had copied an existing lambda, that was working fine. I checked a few things that were different: the python version (3.7), no effect. Even the name of the module/function.

I was using psycopg2-binary, and the zip file structure looked right. Eventually, I found a SO answer suggesting it could be arch related, at which point I realised that I had pip installed using docker, rather than venv.

I have no idea why that mattered (uname showed the same arch from python:3.7 as my laptop), but onwards to the next problem! 🤷

Array projection using jq

March 16, 2022Graham HayLeave a comment

jq is an easy way to manipulate json, for example an API response:

curl "https://something.com/api" -d '{...}' | jq '.'

If you have an array of objects:

[{
    "id": 1,
    ...
}, {
    "id": 2,
    ...
}]

and want to extract a single field, you can use map:

curl ... | jq 'map(.[] | .id)'
[
    1,
    2
]

Column aliases are not supported (Redshift)

March 4, 2022Graham HayLeave a comment

I was trying to create a materialized view recently, and got this error:

WARNING:  An incrementally maintained materialized view could not be created, 
reason: Column aliases are not supported. The materialized view created, ***, 
will be recomputed from scratch for every REFRESH.

My view definition did include some column aliases:

CREATE MATERIALIZED VIEW foo
AUTO REFRESH YES
AS
    SELECT
        trunc(date_col) date,
        platform,
        operator,
        category,
        game_name,
        count(*) as count1,
        sum(bar) as sum1,
        count(distinct baz) as count2
    FROM xxx
    GROUP BY 1, 2, 3, 4, 5;

so that did seem believable (although a little unreasonable, and not covered in the documented limitations).

I decided to split my view up, so I didn’t have multiple aggregations of the same type, and I could use the generated col names (e.g. count). I could then have a, non-materialized “super” view, to join them all back together again.

At this point, thanks to some incompetent copy pasta, I discovered that redshift would quite happily create an auto refresh view with a column alias.

Eventually, I realised that the real problem was the count(distinct), which makes much more sense. You can’t incrementally update it, without tracking all the existing values.

Side note: it is also possible to use `APPROXIMATE COUNT (DISTINCT …), with some caveats