No module named ‘psycopg2._psycopg’

I was trying to set up a python lambda, and fell at the first hurdle:

What made it confusing was that I had copied an existing lambda, that was working fine. I checked a few things that were different: the python version (3.7), no effect. Even the name of the module/function.

I was using psycopg2-binary, and the zip file structure looked right. Eventually, I found a SO answer suggesting it could be arch related, at which point I realised that I had pip installed using docker, rather than venv.

I have no idea why that mattered (uname showed the same arch from python:3.7 as my laptop), but onwards to the next problem! 🤷

Column aliases are not supported (Redshift)

I was trying to create a materialized view recently, and got this error:

WARNING:  An incrementally maintained materialized view could not be created, 
reason: Column aliases are not supported. The materialized view created, ***, 
will be recomputed from scratch for every REFRESH.

My view definition did include some column aliases:

CREATE MATERIALIZED VIEW foo
AUTO REFRESH YES
AS
    SELECT
        trunc(date_col) date,
        platform,
        operator,
        category,
        game_name,
        count(*) as count1,
        sum(bar) as sum1,
        count(distinct baz) as count2
    FROM xxx
    GROUP BY 1, 2, 3, 4, 5;

so that did seem believable (although a little unreasonable, and not covered in the documented limitations).

I decided to split my view up, so I didn’t have multiple aggregations of the same type, and I could use the generated col names (e.g. count). I could then have a, non-materialized “super” view, to join them all back together again.

At this point, thanks to some incompetent copy pasta, I discovered that redshift would quite happily create an auto refresh view with a column alias.

Eventually, I realised that the real problem was the count(distinct), which makes much more sense. You can’t incrementally update it, without tracking all the existing values.

Side note: it is also possible to use `APPROXIMATE COUNT (DISTINCT …), with some caveats

Jenkins seed job

In the brave new world of Jenkins as Code, you can use CasC to specify an initial job (using the Job DSL):

jobs:
  - script: >
      pipelineJob('jenkins-job-dsl') {
        definition {
          cpsScm{
            scm {
              gitSCM {
                userRemoteConfigs {
                  browser {
                    githubWeb {
                      repoUrl('https://github.com/foo/bar')
                    }
                  }
                  gitTool("github")
                  userRemoteConfig {
                    credentialsId("github-creds")
                    name("")
                    refspec("")
                    url("git@github.com:foo/bar.git")
                  }
                }
                branches {
                  branchSpec { name("main") }
                }
              }
            }
            scriptPath("Jenkinsfile.seed")
          }
        }
        properties {
          pipelineTriggers {
            triggers {
              cron { spec('@daily') }
              githubPush()
            }
          }
        }
      }

using a Jenkinsfile to again call the Job DSL:

pipeline {
    agent any

    options {
        timestamps ()
        disableConcurrentBuilds()
    }

    stages {
        stage('Clean') {
            steps {
                deleteDir()
            }
        }

        stage('Checkout') {
            steps {
                checkout scm
            }
        }

        stage('Job DSL') {
            steps {
                jobDsl(
                    targets: """
                        jobs/*.groovy
                        views/*.groovy
                    """
                )
            }
        }
    }
}

and create all the jobs/views from that repo (each of which is another Jenkinsfile).

This should allow you to recreate your Jenkins instance, without any manual fiddling; and provide an audit trail of any changes.

Jenkins as Code

Jenkins has come a long way, in the past few years. You can now run it as a docker image:

docker run --rm -p 8080:8080 -p 50000:50000 -v jenkins_home:/var/jenkins_home --name jenkins jenkins/jenkins:lts-jdk11

Or bake your own image, to pre-install plugins:

FROM jenkins/jenkins:lts-jdk11

COPY --chown=jenkins:jenkins plugins.txt /usr/share/jenkins/ref/plugins.txt
RUN jenkins-plugin-cli -f /usr/share/jenkins/ref/plugins.txt

providing a list of plugins

antisamy-markup-formatter:latest
build-discarder:latest
configuration-as-code:latest
copyartifact:latest
credentials-binding:latest
envinject:latest
ghprb:latest
git:latest
github:latest
job-dsl:latest
matrix-auth:latest
nodejs:latest
timestamper:latest
workflow-aggregator:latest
ws-cleanup:latest

and now you can even configure those plugins using CasC:

docker run --rm -p 8080:8080 -p 50000:50000 -v jenkins_home:/var/jenkins_home -e CASC_JENKINS_CONFIG=/var/jenkins_home/casc_configs -v $PWD/casc_configs:/var/jenkins_home/casc_configs --name jenkins my-jenkins

Export CSV from Redshift

When connected to a database using psql, I would normally use the \copy meta-command to export data; but Redshift doesn’t seem to have implemented that (however it actually works, under the hood).

It is possible to copy data to S3, but you need an available bucket, and all the necessary IAM nonsense.

The simplest thing I’ve found, so far, is to set the output format:

\pset format csv

and then the output file:

\o foo.csv

before running your query:

select * ...

or from the cmd line:

psql ... -c "select ..." -o 'foo.csv' --csv

function get_raw_page(unknown, integer) does not exist

I was looking into HOT updates, and trying to get the page info:

db=# SELECT * FROM heap_page_items(get_raw_page('foo', 0));
ERROR:  function get_raw_page(unknown, integer) does not exist
LINE 1: SELECT * FROM heap_page_items(get_raw_page('foo', 0));
                                      ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

The documentation seemed pretty clear, and no doubt the experienced DBAs have spotted the unforced error…

The pageinspect module provides functions…

It’s a module, and needs to be loaded

CREATE EXTENSION pageinspect;

Locking with redis

We had been looking at using zookeeper (or consul) for locking, rather than postgres rowlocks; but once we realised we didn’t really fancy running a quorum, it seemed just as simple to use redis (which we already had as a session store), instead of a single node of either.

While there is a full-blown lock protocol, all you really need (for our use case) is a key, with a ttl. And some sort of CAS operation.

If you are using the latest redis version (i.e. 6, or 7), then the SET command has some extra options to make this simple:

    async function withLock(id, cb) {
        const key = `/lock/${id}`;
        const val = await redis.set(key, "true", {
            EX: 60,
            KEEPTTL: true,
            GET: true,
        });
        if (val !== null) {
            throw new Error("already locked");
        }
        try {
            const res = await cb();
            return res;
        } finally {
            await redis.del(key);
        }
    };

First, you attempt to acquire the lock (i.e. set some value for a specific key). If the key already exists, you bail; otherwise, you do your business and then free the lock. If the process dies, the lock will be released, when the ttl expires.

If you are using an older redis version (e.g. 4), and don’t fancy upgrading; it is still possible, you just need to explicitly set the TTL (after acquiring the lock):

async function withLock(id, cb) {
        const key = `/lock/${id}`;
        const val = await redis.getset(key, "true");
        if (val !== null) {
            throw new Error("already locked");
        }
        await redis.expire(key, 5 * 60);
        try {
        ...
    };

Generating a histogram with R

If you have a csv with a list of values, and you want to see the distribution, R is an excellent choice.

First, you need to load the data:

data <- read.csv("data.csv", header=TRUE)

(faster methods are available, for large datasets)

You will also need to install ggplot2, if you haven’t already:

install.packages("ggplot2")

(you can install the entire tidyverse, but that seems to download most of the internet)

Finally, import the library, and generate the chart:

library(ggplot2)

ggplot(data, aes(x=num)) + geom_histogram(binwidth=.5)

Et voila!

Updating PRs when deployed

I wanted to add a comment, on all the PRs in a changeset, e.g. when deployed to a specific environment.

Assuming you have a commit range, it’s relatively straightforward to get the list of commits between them:

const res = await octokit.request({
        method: 'GET',
        url: '/repos/{owner}/{repo}/compare/{basehead}',
        owner: 'foo',
        repo: 'bar',
        basehead: `${process.env.GIT_PREVIOUS_SUCCESSFUL_COMMIT}...${process.env.GIT_COMMIT}`
    });

Then, for each commit, you can search for a PR containing it:

    const commits = res.data.commits.map(c => c.sha);
    const results = await Promise.all(commits.map(q => {
        return octokit.request({
            method: 'GET',
            url: '/search/issues',
            owner: 'foo',
            repo: 'bar',
            q,
        })
    }));

Assuming you only want one comment per PR, you can use a Set to get a distinct list of numbers:

    const prs = new Set();
    results.forEach(r => {
        if (r.data.items.length) {
            prs.add(r.data.items[0].number);
        }
    });
    const prList = Array.from(prs.keys());

And finally add a comment on each PR:

 await Promise.all(prList.map(pr => {
        return octokit.request({
            method: 'POST',
            url: '/repos/{owner}/{repo}/issues/{issue_number}/comments',
            owner: 'foo',
            repo: 'bar',
            issue_number: pr,
            body: `something something: ${process.env.BUILD_URL}`
        });
    }));

Or, if you’re feeling really brave, you can do the whole thing in one line of bash!

curl -s -u "$GH_USER:$GH_TOKEN" -H "Accept: application/vnd.github.v3+json" "https://api.github.com/repos/foo/bar/compare/$BASE_COMMIT...$HEAD_COMMIT" \
    | jq '.commits[] | .sha' \
    | xargs -I '{}' sh -c 'curl -s -u "$GH_USER:$GH_TOKEN" -H "Accept: application/vnd.github.v3+json" "https://api.github.com/search/issues?q={}" \
    | jq '\''.items[0] | .number'\''' \
    | sort \
    | uniq \
    | xargs -I '{}' sh -c 'curl -s -u "$GH_USER:$GH_TOKEN" -H "Accept: application/vnd.github.v3+json" "https://api.github.com/repos/foo/bar/issues/{}/comments" -d "{\"body\":\"something something: $BUILD_URL\"}" > /dev/null'