Removing (almost) duplicates in Redshift

The AWS Firehose guarantees “at least once” delivery, and Redshift doesn’t enforce uniqueness; which can result in duplicate rows. Or, if you are using an impure transform step (e.g. spot fx rates), with “almost duplicate” rows.

The consensus seems to be to use a temp table, removing all the duplicate rows, and inserting them back just once. Which is very effective. But if you have “almost duplicates”, you need something slightly different (using DISTINCT will result in all the rows being added to the temp table).

CREATE TEMP TABLE duplicated_foo(LIKE foo);
ALTER TABLE duplicated_foo ADD COLUMN row_number integer;

You need an extra column in the temp table, for the row number.

INSERT INTO duplicated_foo
WITH dupes AS (
    SELECT id, region
    FROM foo
    GROUP BY id, region
    HAVING COUNT(*) > 1
), matches AS (
    SELECT foo.*, row_number() over (partition by foo.id, foo.region)
    FROM foo
    JOIN dupes ON dupes.id = foo.id
        AND dupes.region = foo.region
)
SELECT *
FROM matches
WHERE row_number = 1;

We have a composite key, which complicates things further. This is taking the first row, that matches on both columns.

ALTER TABLE duplicated_foo DROP COLUMN row_number;

You can then drop the extra column from the temp table.

DELETE FROM foo
USING duplicated_foo
WHERE foo.id = duplicated_foo.id
    AND foo.region = duplicated_foo.region;

-- Insert back in the single copies
INSERT INTO foo
SELECT *
FROM duplicated_foo;

Remove all duplicate rows (whatever that means to you), and copy back in the valid data.

Look on my works, ye Mighty, and despair!

Removing (almost) duplicates in Redshift

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply