Jenkins as Code

Jenkins has come a long way, in the past few years. You can now run it as a docker image:

docker run --rm -p 8080:8080 -p 50000:50000 -v jenkins_home:/var/jenkins_home --name jenkins jenkins/jenkins:lts-jdk11

Or bake your own image, to pre-install plugins:

FROM jenkins/jenkins:lts-jdk11

COPY --chown=jenkins:jenkins plugins.txt /usr/share/jenkins/ref/plugins.txt
RUN jenkins-plugin-cli -f /usr/share/jenkins/ref/plugins.txt

providing a list of plugins


and now you can even configure those plugins using CasC:

docker run --rm -p 8080:8080 -p 50000:50000 -v jenkins_home:/var/jenkins_home -e CASC_JENKINS_CONFIG=/var/jenkins_home/casc_configs -v $PWD/casc_configs:/var/jenkins_home/casc_configs --name jenkins my-jenkins

Export CSV from Redshift

When connected to a database using psql, I would normally use the \copy meta-command to export data; but Redshift doesn’t seem to have implemented that (however it actually works, under the hood).

It is possible to copy data to S3, but you need an available bucket, and all the necessary IAM nonsense.

The simplest thing I’ve found, so far, is to set the output format:

\pset format csv

and then the output file:

\o foo.csv

before running your query:

select * ...

or from the cmd line:

psql ... -c "select ..." -o 'foo.csv' --csv

function get_raw_page(unknown, integer) does not exist

I was looking into HOT updates, and trying to get the page info:

db=# SELECT * FROM heap_page_items(get_raw_page('foo', 0));
ERROR:  function get_raw_page(unknown, integer) does not exist
LINE 1: SELECT * FROM heap_page_items(get_raw_page('foo', 0));
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

The documentation seemed pretty clear, and no doubt the experienced DBAs have spotted the unforced error…

The pageinspect module provides functions…

It’s a module, and needs to be loaded


Locking with redis

We had been looking at using zookeeper (or consul) for locking, rather than postgres rowlocks; but once we realised we didn’t really fancy running a quorum, it seemed just as simple to use redis (which we already had as a session store), instead of a single node of either.

While there is a full-blown lock protocol, all you really need (for our use case) is a key, with a ttl. And some sort of CAS operation.

If you are using the latest redis version (i.e. 6, or 7), then the SET command has some extra options to make this simple:

    async function withLock(id, cb) {
        const key = `/lock/${id}`;
        const val = await redis.set(key, "true", {
            EX: 60,
            KEEPTTL: true,
            GET: true,
        if (val !== null) {
            throw new Error("already locked");
        try {
            const res = await cb();
            return res;
        } finally {
            await redis.del(key);

First, you attempt to acquire the lock (i.e. set some value for a specific key). If the key already exists, you bail; otherwise, you do your business and then free the lock. If the process dies, the lock will be released, when the ttl expires.

If you are using an older redis version (e.g. 4), and don’t fancy upgrading; it is still possible, you just need to explicitly set the TTL (after acquiring the lock):

async function withLock(id, cb) {
        const key = `/lock/${id}`;
        const val = await redis.getset(key, "true");
        if (val !== null) {
            throw new Error("already locked");
        await redis.expire(key, 5 * 60);
        try {

Generating a histogram with R

If you have a csv with a list of values, and you want to see the distribution, R is an excellent choice.

First, you need to load the data:

data <- read.csv("data.csv", header=TRUE)

(faster methods are available, for large datasets)

You will also need to install ggplot2, if you haven’t already:


(you can install the entire tidyverse, but that seems to download most of the internet)

Finally, import the library, and generate the chart:


ggplot(data, aes(x=num)) + geom_histogram(binwidth=.5)

Et voila!

Updating PRs when deployed

I wanted to add a comment, on all the PRs in a changeset, e.g. when deployed to a specific environment.

Assuming you have a commit range, it’s relatively straightforward to get the list of commits between them:

const res = await octokit.request({
        method: 'GET',
        url: '/repos/{owner}/{repo}/compare/{basehead}',
        owner: 'foo',
        repo: 'bar',
        basehead: `${process.env.GIT_PREVIOUS_SUCCESSFUL_COMMIT}...${process.env.GIT_COMMIT}`

Then, for each commit, you can search for a PR containing it:

    const commits = => c.sha);
    const results = await Promise.all( => {
        return octokit.request({
            method: 'GET',
            url: '/search/issues',
            owner: 'foo',
            repo: 'bar',

Assuming you only want one comment per PR, you can use a Set to get a distinct list of numbers:

    const prs = new Set();
    results.forEach(r => {
        if ( {
    const prList = Array.from(prs.keys());

And finally add a comment on each PR:

 await Promise.all( => {
        return octokit.request({
            method: 'POST',
            url: '/repos/{owner}/{repo}/issues/{issue_number}/comments',
            owner: 'foo',
            repo: 'bar',
            issue_number: pr,
            body: `something something: ${process.env.BUILD_URL}`

Or, if you’re feeling really brave, you can do the whole thing in one line of bash!

curl -s -u "$GH_USER:$GH_TOKEN" -H "Accept: application/vnd.github.v3+json" "$BASE_COMMIT...$HEAD_COMMIT" \
    | jq '.commits[] | .sha' \
    | xargs -I '{}' sh -c 'curl -s -u "$GH_USER:$GH_TOKEN" -H "Accept: application/vnd.github.v3+json" "{}" \
    | jq '\''.items[0] | .number'\''' \
    | sort \
    | uniq \
    | xargs -I '{}' sh -c 'curl -s -u "$GH_USER:$GH_TOKEN" -H "Accept: application/vnd.github.v3+json" "{}/comments" -d "{\"body\":\"something something: $BUILD_URL\"}" > /dev/null'

Set operations on a list of ids

I’ve been doing some (small to medium) data work recently, and when trying to identify missing data it’s often necessary to resort to comparing two lists of rows.

SQL is an ideal tool for this, but if you can’t do cross database queries, then you need to resort to something more basic.

I was initially segmenting the data, and using a spreadsheet with conditional formatting to highlight the extra (or missing) rows:

=MATCH(A1, B1, 0)

But once you’re comparing tens of thousands of rows, that loses appeal. At this point I discovered the comm utility:

$ comm -23 old.csv new.csv

This will print out those lines only in the old csv.

Re-processing failed firehose batches

If a batch of records fails during the transformation step, they will be dumped in a folder named /processing-failed/YYYY/MM/DD/HH/ in your s3 bucket.

The file (possibly gzipped) will contain a line for each record, in this format:

    "errorMessage":"The Lambda function was successfully invoked but it returned an error result.",

The error message isn’t particularly informative, so you’ll need to check the lambda logs. Once you’ve fixed the lambda (or removed the offending record), there doesn’t seem to be any one click way to re-process the batch.

But it’s relatively straightforward to script it, using python (or any other available sdk):

import base64
import boto3
import json
import sys

filename = sys.argv[1]

with open(filename) as f:
    all_records = list(map(lambda l: { "Data": base64.b64decode(json.loads(l)["rawData"]) }, f.readlines()))

batch_size = 100
batches = [all_records[i:i + batch_size] for i in range(0, len(all_records), batch_size)]

client = boto3.client('firehose')

for batch in batches:
    response = client.put_record_batch(
    if response["FailedPutCount"] > 0:
        raise Exception("Bad batch")

If your records are quite small, you can probably increase the batch size (max 500, or 4MB).

Doing gitops without Kubernetes

gitops is a hot topic with the “cloud native” crowd, but is it still relevant if you aren’t using k8s? In a word, yes.

While most examples are of kubernetes yaml (or helm charts), there’s nothing stopping you applying the same principles to your homegrown CD pipeline, and reaping the same benefits.

We are in the process of moving from systemd services to docker, running on our own hosts (ec2 instances). The easy thing to do, would be to just pull latest every time, but it’s not ideal to have no control over what version of software you have installed where.

Instead, we have a git repo containing fragments of yaml, one for each service:

  image: >-
  gitCommit: somesha
  buildNumber: '123'
    BAR: baz

The staging branch of these are updated by the CI build for each service, triggered by a push to main, once the new image has been pushed to ECR.

To deploy, we run an ansible playbook on each docker host. First, we load up all the yaml:

- name: Get service list
      dest: "{{ docker_services_folder }}"
      version: "{{ docker_services_env | default('staging') }}"
  delegate_to: localhost
  run_once: yes
  register: docker_services_repo

We then generate an env file for each service:

- name: "Create env files"
  template: src=env_file dest=/etc/someorg/{{ docker_services[item].name }}
  become: yes
    app_name: "{{ docker_services[item].name }}"
    env_vars: "{{ docker_services[item].env_vars | default([]) }}"
  loop: "{{ docker_services.keys() }}"
  register: env_files

And finally run each container:

- name: "Pull & Start Services"
    name: "{{ docker_services[item].name }}"
    image: "{{ docker_services[item].image }}"
    state: "started"
    restart_policy: "always"
    recreate: "{{ env_files.results | selectattr('item', 'equalto', item) | map(attribute='changed') | first }}"
    pull: true
    init: true
    output_logs: true
    log_driver: "syslog"
      tag: someorg
      syslog-facility: local0
    env_file: "/etc/someorg/{{ docker_services[item].name }}"
    network_mode: "host"
  loop: "{{ docker_services.keys() }}"
  become: yes

If the env vars have changed, the container needs to be recreated. Otherwise, only the images that have changed will be restarted (we still remove the node from the LB first).

This gives us an audit trail of which image has been deployed, and makes rollbacks easy (revert the commit).

If the staging deploy is successful (after some smoke tests run), another job creates a PR to merge the changes onto the production branch. When that is merged (after any necessary inspection), the same process repeats on the prod hosts.

Removing (almost) duplicates in Redshift

The AWS Firehose guarantees “at least once” delivery, and Redshift doesn’t enforce uniqueness; which can result in duplicate rows. Or, if you are using an impure transform step (e.g. spot fx rates), with “almost duplicate” rows.

The consensus seems to be to use a temp table, removing all the duplicate rows, and inserting them back just once. Which is very effective. But if you have “almost duplicates”, you need something slightly different (using DISTINCT will result in all the rows being added to the temp table).

CREATE TEMP TABLE duplicated_foo(LIKE foo);
ALTER TABLE duplicated_foo ADD COLUMN row_number integer;

You need an extra column in the temp table, for the row number.

INSERT INTO duplicated_foo
WITH dupes AS (
    SELECT id, region
    FROM foo
    GROUP BY id, region
    HAVING COUNT(*) > 1
), matches AS (
    SELECT foo.*, row_number() over (partition by, foo.region)
    FROM foo
    JOIN dupes ON =
        AND dupes.region = foo.region
FROM matches
WHERE row_number = 1;

We have a composite key, which complicates things further. This is taking the first row, that matches on both columns.

ALTER TABLE duplicated_foo DROP COLUMN row_number;

You can then drop the extra column from the temp table.

USING duplicated_foo
    AND foo.region = duplicated_foo.region;

-- Insert back in the single copies
FROM duplicated_foo;

Remove all duplicate rows (whatever that means to you), and copy back in the valid data.