Look on my works, ye Mighty, and despair!

Monotonic sequence in TLA+

February 28, 2023Graham HayLeave a comment

I wanted to declare an invariant, that a sequence was monotonic, i.e. always increasing. <<1, 2, 3>> would pass, as would <<1, 3, 5>>; but <<1, 2, 1>> would not.

I thought I might be able to use forall:

MonotonicSeq ==
    IF Len(seq) > 1 THEN
        \A e \in Tail(seq):
            e > Head(seq)
    ELSE TRUE

but was quickly disillusioned:

Error: Evaluating invariant MonotonicSeq failed.
TLC encountered a non-enumerable quantifier bound
<<2>>.

Apparently that syntax only works with sets.

Eventually I worked out how to write a recursive operator:

RECURSIVE Monotonic(_)

Monotonic(seq) ==
    IF Len(seq) > 1 THEN
        /\ Len(SelectSeq(seq, LAMBDA X : X < Head(seq))) = 0
        /\ Monotonic(Tail(seq))
    ELSE TRUE

Multi-stage docker with nodejs

February 10, 2023Graham HayLeave a comment

Multi-stage builds are mostly popular when using compiled languages, with heavy toolchains. For a node app, you would expect to need just the js files, and any dependencies.

I was therefore surprised to find that our images were weighing in at over 500MB, when the base node:18-alpine image is a mere 174MB. After some judicious commenting out, it seemed like the culprit was npm (shocker).

Moving the npm(/yarn) install to a discarded layer could save > 50%:

FROM node:18-alpine AS builder
WORKDIR /build
RUN --mount=type=ssh yarn install

FROM node:18-alpine
WORKDIR /app
COPY --from=builder /build/node_modules ./node_modules
...

Do you even lift?

November 30, 2022Graham HayLeave a comment

As well as the built-in scenarios, pgbench allows you to run a custom “transaction script”; making it a viable alternative (for some scenarios) to driving load testing via your application, e.g. using a locust swarm.

It’s not a full blown programming language (or even a DSL really), but you can for example: insert a row, and then update it:

\set uid random(1, 100000 * :scale)

BEGIN;
INSERT INTO foo (user_id, ...) VALUES (:uid, ...) RETURNING id \gset g_
COMMIT;

\sleep 1s

BEGIN;
UPDATE foo SET bar = 1 WHERE id = :g_id;
COMMIT;

You can then point pgbench at your DB, and this script, with say 20 clients for 5 mins:

docker run -it --rm --network=host -v $PWD:/app -w /app -e PGPASSWORD=postgres postgres:15 pgbench -h ... -U ... -d ... -f foo.sql  -c 20 -T 600 -j 8
...

...
number of transactions actually processed: 81752
number of failed transactions: 0 (0.000%)
latency average = 146.670 ms
initial connection time = 560.452 ms
tps = 136.360106 (without initial connection time)

The transaction count here is actually the number of times the script ran (if you have multiple transactions inside it).

No script security for Job DSL

November 17, 2022Graham HayLeave a comment

After my heroic efforts in generating script approval hashes, I discovered that there was now a global option to turn it off entirely:

Or, using CasC:

security:
  globalJobDslSecurityConfiguration:
    useScriptSecurity: false

Turning this off is pretty safe, as long as you aren’t running a public instance, and have a review process for any Jenkinsfiles being added.

Replicating sequences with pglogical

November 16, 2022Graham HayLeave a comment

Another advantage of using pglogical, over other solutions (cough, DMS), is that it allows you to replicate the current value of a sequence; rather than having to update all those values, post failover.

SELECT pglogical.replication_set_add_sequence('default', 'foo.foo_id_seq');

It is important to understand how it is implemented though:

We periodically capture state current state of the sequence and send update to the replication queue with some additional buffer. This means that in normal situation the sequence values on subscriber will be ahead of those on provider

This can be a bit of a surprise when you check if the replication is working!

Converting an AWS Lambda to an ECS Task

October 21, 2022Graham HayLeave a comment

I have a python script (removing duplicates from a Redshift database); that currently runs as a Lambda, triggered by EventBridge. Unfortunately, it has been known to take longer than 15 mins, the max timeout for a lambda function.

The alternative seems to be running it as an ECS Task (on Fargate). I was uploading the lambda function as a zip; but it’s pretty easy to convert that to a Docker image, and push to ECR:

FROM python:3

COPY requirements.txt  .
RUN  pip3 install -r requirements.txt

COPY app.py .

CMD [ "python", "-c", "import app; app.foo(None, None)" ]

I took the easy way out, and just set the CMD to the existing lambda entrypoint. You then need an ECS cluster (or you may be able to use the default cluster):

docker run --rm -it -v ~/.aws:/root/.aws -v $PWD:/data -w /data -e AWS_PROFILE amazon/aws-cli ecs create-cluster --cluster-name foo

And an IAM role to execute the task:

...  iam create-role --role-name FooExecution --assume-role-policy-document file://aws/ecs/TrustPolicy.json

With a trust policy for ECS to assume the role:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Action": "sts:AssumeRole",
        "Principal": {
            "Service": "ecs-tasks.amazonaws.com"
        },
        "Effect": "Allow"
    }]
}

And permissions to pull the image from ECR:

... iam put-role-policy --role-name FooExecution --policy-name ECR --policy-document file://aws/iam/ECR.json

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "ecr:BatchCheckLayerAvailability",
            "ecr:BatchGetImage",
            "ecr:GetDownloadUrlForLayer"
        ],
        "Resource": [
            "arn:aws:ecr:$region:$account:repository/$repo"
        ]
    }, {
        "Effect": "Allow",
        "Action": [
            "ecr:GetAuthorizationToken"
        ],
        "Resource": [
            "*"
        ]
    }]
}

If you want logs in CloudWatch, you also need a policy allowing that:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:PutLogEvents"
        ],
        "Resource": [
            "*"
        ]
    }]
}

That’s enough to run the task, but if you need to use any AWS API in the task (e.g. boto3), then you need another role:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "redshift:GetClusterCredentials"
        ],
        "Resource": [
            "arn:aws:redshift:$region:$account:dbuser:$cluster/$user",
            "arn:aws:redshift:$region:$account:dbname:$cluster/$db"
        ]
    }]
}

If you want logs, you need a log group:

... logs create-log-group --log-group-name /aws/fargate/foo
... logs put-retention-policy --log-group-name /aws/fargate/foo --retention-in-days 7

Phooooo… nearly there. Now you need a task definition:

... ecs register-task-definition --family foo --cpu 256 --memory 512 --network-mode awsvpc --requires-compatibilities FARGATE --execution-role-arn arn:aws:iam::$account:role/$executionRole --task-role-arn arn:aws:iam::$account:role/$taskRole --container-definitions "[{\"name\":\"foo\",\"image\":\"$image:latest\",\"logConfiguration\":{\"logDriver\":\"awslogs\",\"options\":{\"awslogs-region\":\"$region\",\"awslogs-group\":\"/aws/fargate/foo\",\"awslogs-stream-prefix\":\"foo\"}}}]"

And you should be in a position to test that the task can run, without the cron trigger:

... ecs run-task --launch-type FARGATE --cluster foo --task-definition foo:1 --network-configuration "awsvpcConfiguration={subnets=['foo',...],securityGroups=['sg-...'],assignPublicIp='ENABLED'}" --count 1

You need a public IP to pull the ECR image (unless you want to jump through some hoops).

If that went well, you can proceed to set up the cron trigger (EventBridge):

... events put-rule --name foo --schedule-expression 'cron(0 4 * * ? *)'

You need yet another role, to use the execution role:

...iam create-role --role-name FooEvents --assume-role-policy-document file://aws/events/TrustPolicy.json
...iam put-role-policy --role-name FooEvents --policy-name Ecs --policy-document file://aws/iam/Ecs.json

This time the trust policy needs to be for EventBridge:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Action": "sts:AssumeRole",
        "Principal": {
            "Service": "events.amazonaws.com"
        },
        "Effect": "Allow"
    }]
}

And the permissions for the target:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "iam:PassRole"
        ],
        "Resource": [
            "arn:aws:iam::$account:role/FooExecution",
            "arn:aws:iam::$account:role/FooTask"
        ]
    }, {
        "Effect": "Allow",
        "Action": [
            "ecs:RunTask"
        ],
        "Resource": "*",
        "Condition": {
            "ArnEquals": {
                "ecs:cluster": "arn:aws:ecs:$region:$account:cluster/foo"
            }
        }
    }]
}

And finally, you need a target for the rule:

...events put-targets --rule foo --targets file://aws/events/Targets.json

[{
    "Id": "1",
    "Arn": "arn:aws:ecs:$region:$account:cluster/foo",
    "RoleArn": "arn:aws:iam::$account:role/FooEvents",
    "EcsParameters": {
        "TaskDefinitionArn": "arn:aws:ecs:$region:$account:task-definition/foo",
        "LaunchType": "FARGATE",
        "NetworkConfiguration": {
            "awsvpcConfiguration": {
                "Subnets": ["subnet-***",...],
                "SecurityGroups": ["sg-***"],
                "AssignPublicIp": "ENABLED"
            }
        }
    }
}]

I included all 3 subnets from the default VPC, and the default SG.

That was a lot! But hopefully you now have a working cron job, that can take as long as it wants to complete.

200 posts!

October 20, 2022Graham HayLeave a comment

It’s only taken 9 years, but I’ve hit 200 posts 🎉 (this is actually #201)

Test driving pglogical

September 9, 2022Graham HayLeave a comment

We have been using DMS for two different tasks: upgrading to a new major pg version (when the db is too big to just click the button), and removing some old data during failover (sadly we don’t have some partitions we can just unlink).

Each time we have used it has been a “success”, i.e. we haven’t had any major disasters, or had to bail out of the failover; but it has never been a comfortable ride. Sometimes that was self inflicted, e.g. not running analyze/freeze after the import; but just setting up stable replication was a struggle, and there’s very little insight when it fails.

As we don’t need any of the more exotic features (e.g. moving from an on-prem Oracle DB to PG RDS), it felt like there must be an easier way. Postgres has had logical replication since v10, which would solve one of our problems. But it doesn’t (yet) provide a way to only replicate a subset of data.

There are a number of people selling alternative solutions, but one option that regularly crops up is pglogical. The situation isn’t entirely clear, as 2ndQ are now part of EDB, which is offering v3 (there is no obvious way to download it, so I assume that is part of their consultancy package). But afaict it is only supported as part of BDR, and the v2 repo still seems to be active.

If you’re willing to accept that risk, it’s pretty easy to do some local testing, using docker:

FROM postgres:11-bullseye

RUN apt-get update
RUN apt-get install postgresql-11-pglogical

...

Create an image with the (right version of the) extension installed, and whatever scripts you have to create a shell db (you’re using migrations, right?), and spin up two instances with compose:

version: '3'
services:
  db1:
    image: pg11
    volumes:
      - "./postgresql-11.conf:/var/lib/postgresql/data/postgresql.conf"
    environment:
      POSTGRES_PASSWORD: postgres
  db2:
    image: pg11
    volumes:
      - "./postgresql-11.conf:/var/lib/postgresql/data/postgresql.conf"
    environment:
      POSTGRES_PASSWORD: postgres

You can get the pg config template from the base image, and make the necessary changes. You need to set a password for the user, because reasons.

Once this is up:

docker-compose up

You can connect to the publisher:

docker-compose exec db1 psql -U postgres -d foodb

And create a node:

CREATE EXTENSION pglogical;

SELECT pglogical.create_node(node_name := 'db1', dsn := 'host=db1 port=5432 dbname=foodb user=postgresql password=postgresql');

Then set up the tables you want to replicate:

SELECT pglogical.replication_set_add_table('default', 'foo.bar', true, null, null);

Or with a row filter:

SELECT pglogical.replication_set_add_table('default', 'foo.bar', true, null, E'started_at > \'2022-1-1\'');

Finally, connect to the other instance, and create the subscription:

SELECT pglogical.create_node(node_name := 'db2', dsn := 'host=db2 port=5432 dbname=foodb user=postgres password=postgres');

SELECT pglogical.create_subscription(subscription_name := 'subscription1', provider_dsn := 'host=db1 port=5432 dbname=foodb user=postgres password=postgres');

(this dsn is where the password is needed)

If you get bored of typing this in, you can pipe a script through:

cat publisher.sql | docker-compose exec -T db1 psql -U postgres -d foodb

You should now be able to insert a row in db1, and see it appear in db2 (or not, depending on the filter).

You can also easily test replication from, e.g. 11 -> 14. The bigger questions are how long the initial import takes, and how stable replication is, but those will need to be answered in a production like environment.

There is also some evidence online that RDS allows the extension, but I haven’t tried it yet.

Running script for changed lines in CSV

August 30, 2022Graham HayLeave a comment

We have a large CSV, containing ref data for our data warehouse export process (enrichment). When the file is updated, a Jenkins job uploads the new version to an RDS instance, which affects any new data after that; but updating any existing data is currently a manual task.

The brute force option would be to simply run a script for every line in the (new) CSV, but that would probably take about 3 days, every time it is changed. Sounds like another job for an unholy bash script!

I had initially assumed that I could use the output from git diff, but it didn’t seem possible to get bare output. So the next stop was comm. I could get the old version of the file pretty easily, with a sha:

git show 9f60693d4e1d1bd00025b51070efdd034b98448d:./foo.csv

so that can be used as one of the files to compare (piped to stdin):

git show ... | comm -13 --nocheck-order - foo.csv

Suppressing everything except lines unique to the new file. Parsing a CSV is always a bit hairy, but I was willing to risk just using awk on each line, with a comma separator (this won’t work if you have any columns with escaped commas):

... | awk -F, '{printf "docker run -it --rm -v $PWD:/app -w /app -e PGHOST -e PGPORT -e PGUSER -e PGPASSWORD reports-db python update.py %s %s\n", $1, $2}'

I’m using a python script to run a SQL update (because reasons), but you could probably generate a psql command instead.

And finally, you need to execute each command:

| xargs -o -I @ bash -c '@'

(the --open-tty is needed for docker)

The whole script can then be run using, e.g. the GIT_PREVIOUS_SUCCESSFUL_COMMIT var.

Script approval hashes

August 24, 2022Graham HayLeave a comment

If you administer a Jenkins instance, you may be used to approving script usage. In theory, it is possible to turn the check off, but even behind an auth gateway, that may not be a good idea.

In the before time, best practice was to click the button under “manage jenkins”:

But in a brave new CasC world, it is possible to instead provide a list of pre-approved scripts:

security:
  scriptApproval:
    approvedScriptHashes:
      - 046256fc8829d7af680424b819da55bdb7c660f4 # jobs-dsl/foo.groovy
     ...

However maintaining this list is obviously a nightmare. First of all, you’d think it would be as simple as running:

shasum foo.groovy

But that doesn’t give the “right” answer. Eventually a colleague of mine worked out that you need to prepend a magic string:

(printf "groovy:" && cat foo.groovy) | shasum

With that in hand, you still need to run that over e.g. every Job DSL script you have. And ideally remove old hashes, when the file changes. Behold my glory!

$ find -maxdepth 1 -name "*.groovy" | sort | xargs -I {} sh -c 'HASH=`(printf "groovy:" && cat {}) | shasum | awk '"'"'{print $1}'"'"'`; echo - $HASH \# jobs-dsl/{}'
- 98a015976c7392bde86f3e35d526054b684f605f # jobs-dsl/./foo.groovy
...

I didn’t think I could hate the shell more than I already did.

You should then be able to paste the output of this script into the CasC yaml, and get a reasonable diff (although it turns out that the osx & linux sort commands disagree on which special chars should go first).