Removing (almost) duplicates in Redshift

The AWS Firehose guarantees “at least once” delivery, and Redshift doesn’t enforce uniqueness; which can result in duplicate rows. Or, if you are using an impure transform step (e.g. spot fx rates), with “almost duplicate” rows.

The consensus seems to be to use a temp table, removing all the duplicate rows, and inserting them back just once. Which is very effective. But if you have “almost duplicates”, you need something slightly different (using DISTINCT will result in all the rows being added to the temp table).

CREATE TEMP TABLE duplicated_foo(LIKE foo);
ALTER TABLE duplicated_foo ADD COLUMN row_number integer;

You need an extra column in the temp table, for the row number.

INSERT INTO duplicated_foo
WITH dupes AS (
    SELECT id, region
    FROM foo
    GROUP BY id, region
    HAVING COUNT(*) > 1
), matches AS (
    SELECT foo.*, row_number() over (partition by foo.id, foo.region)
    FROM foo
    JOIN dupes ON dupes.id = foo.id
        AND dupes.region = foo.region
)
SELECT *
FROM matches
WHERE row_number = 1;

We have a composite key, which complicates things further. This is taking the first row, that matches on both columns.

ALTER TABLE duplicated_foo DROP COLUMN row_number;

You can then drop the extra column from the temp table.

DELETE FROM foo
USING duplicated_foo
WHERE foo.id = duplicated_foo.id
    AND foo.region = duplicated_foo.region;

-- Insert back in the single copies
INSERT INTO foo
SELECT *
FROM duplicated_foo;

Remove all duplicate rows (whatever that means to you), and copy back in the valid data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s