Redshift – Look on my works, ye Mighty, and despair!

Column aliases are not supported (Redshift)

March 4, 2022Graham HayLeave a comment

I was trying to create a materialized view recently, and got this error:

WARNING:  An incrementally maintained materialized view could not be created, 
reason: Column aliases are not supported. The materialized view created, ***, 
will be recomputed from scratch for every REFRESH.

My view definition did include some column aliases:

CREATE MATERIALIZED VIEW foo
AUTO REFRESH YES
AS
    SELECT
        trunc(date_col) date,
        platform,
        operator,
        category,
        game_name,
        count(*) as count1,
        sum(bar) as sum1,
        count(distinct baz) as count2
    FROM xxx
    GROUP BY 1, 2, 3, 4, 5;

so that did seem believable (although a little unreasonable, and not covered in the documented limitations).

I decided to split my view up, so I didn’t have multiple aggregations of the same type, and I could use the generated col names (e.g. count). I could then have a, non-materialized “super” view, to join them all back together again.

At this point, thanks to some incompetent copy pasta, I discovered that redshift would quite happily create an auto refresh view with a column alias.

Eventually, I realised that the real problem was the count(distinct), which makes much more sense. You can’t incrementally update it, without tracking all the existing values.

Side note: it is also possible to use `APPROXIMATE COUNT (DISTINCT …), with some caveats

Export CSV from Redshift

February 21, 2022Graham HayLeave a comment

When connected to a database using psql, I would normally use the \copy meta-command to export data; but Redshift doesn’t seem to have implemented that (however it actually works, under the hood).

It is possible to copy data to S3, but you need an available bucket, and all the necessary IAM nonsense.

The simplest thing I’ve found, so far, is to set the output format:

\pset format csv

and then the output file:

\o foo.csv

before running your query:

select * ...

or from the cmd line:

psql ... -c "select ..." -o 'foo.csv' --csv

Removing (almost) duplicates in Redshift

March 30, 2021Graham HayLeave a comment

The AWS Firehose guarantees “at least once” delivery, and Redshift doesn’t enforce uniqueness; which can result in duplicate rows. Or, if you are using an impure transform step (e.g. spot fx rates), with “almost duplicate” rows.

The consensus seems to be to use a temp table, removing all the duplicate rows, and inserting them back just once. Which is very effective. But if you have “almost duplicates”, you need something slightly different (using DISTINCT will result in all the rows being added to the temp table).

CREATE TEMP TABLE duplicated_foo(LIKE foo);
ALTER TABLE duplicated_foo ADD COLUMN row_number integer;

You need an extra column in the temp table, for the row number.

INSERT INTO duplicated_foo
WITH dupes AS (
    SELECT id, region
    FROM foo
    GROUP BY id, region
    HAVING COUNT(*) > 1
), matches AS (
    SELECT foo.*, row_number() over (partition by foo.id, foo.region)
    FROM foo
    JOIN dupes ON dupes.id = foo.id
        AND dupes.region = foo.region
)
SELECT *
FROM matches
WHERE row_number = 1;

We have a composite key, which complicates things further. This is taking the first row, that matches on both columns.

ALTER TABLE duplicated_foo DROP COLUMN row_number;

You can then drop the extra column from the temp table.

DELETE FROM foo
USING duplicated_foo
WHERE foo.id = duplicated_foo.id
    AND foo.region = duplicated_foo.region;

-- Insert back in the single copies
INSERT INTO foo
SELECT *
FROM duplicated_foo;

Remove all duplicate rows (whatever that means to you), and copy back in the valid data.

Grouping by time period in Redshift

November 28, 2018December 4, 2018Graham HayLeave a comment

We use Redshift for reporting (or, more accurately, the tool we use for reporting uses Redshift).

It’s pretty simple to calculate daily rollups, using datetrunc:

select date_trunc('day', foo) date, sum(bar) bar
from [baz]
where ...
group by date

Or just by casting to a date:

select [foo:date] date, sum(bar) bar
from [baz]
where ...
group by date

But what if you want to slice into smaller time periods?

The easiest way I’ve found, is to use to_char, and format the timestamp appropriately. e.g.

select to_char(foo, 'YYYY-MM-DD HH24:MI') date, sum(bar) bar
from [baz]
where ...
group by date