Removing (almost) duplicates in Redshift

The AWS Firehose guarantees “at least once” delivery, and Redshift doesn’t enforce uniqueness; which can result in duplicate rows. Or, if you are using an impure transform step (e.g. spot fx rates), with “almost duplicate” rows.

The consensus seems to be to use a temp table, removing all the duplicate rows, and inserting them back just once. Which is very effective. But if you have “almost duplicates”, you need something slightly different (using DISTINCT will result in all the rows being added to the temp table).

CREATE TEMP TABLE duplicated_foo(LIKE foo);
ALTER TABLE duplicated_foo ADD COLUMN row_number integer;

You need an extra column in the temp table, for the row number.

INSERT INTO duplicated_foo
WITH dupes AS (
    SELECT id, region
    FROM foo
    GROUP BY id, region
    HAVING COUNT(*) > 1
), matches AS (
    SELECT foo.*, row_number() over (partition by foo.id, foo.region)
    FROM foo
    JOIN dupes ON dupes.id = foo.id
        AND dupes.region = foo.region
)
SELECT *
FROM matches
WHERE row_number = 1;

We have a composite key, which complicates things further. This is taking the first row, that matches on both columns.

ALTER TABLE duplicated_foo DROP COLUMN row_number;

You can then drop the extra column from the temp table.

DELETE FROM foo
USING duplicated_foo
WHERE foo.id = duplicated_foo.id
    AND foo.region = duplicated_foo.region;

-- Insert back in the single copies
INSERT INTO foo
SELECT *
FROM duplicated_foo;

Remove all duplicate rows (whatever that means to you), and copy back in the valid data.

Grouping by time period in Redshift

We use Redshift for reporting (or, more accurately, the tool we use for reporting uses Redshift).

It’s pretty simple to calculate daily rollups, using datetrunc:

select date_trunc('day', foo) date, sum(bar) bar
from [baz]
where ...
group by date

Or just by casting to a date:

select [foo:date] date, sum(bar) bar
from [baz]
where ...
group by date

But what if you want to slice into smaller time periods?

The easiest way I’ve found, is to use to_char, and format the timestamp appropriately. e.g.

select to_char(foo, 'YYYY-MM-DD HH24:MI') date, sum(bar) bar
from [baz]
where ...
group by date