The AWS Firehose guarantees “at least once” delivery, and Redshift doesn’t enforce uniqueness; which can result in duplicate rows. Or, if you are using an impure transform step (e.g. spot fx rates), with “almost duplicate” rows.
The consensus seems to be to use a temp table, removing all the duplicate rows, and inserting them back just once. Which is very effective. But if you have “almost duplicates”, you need something slightly different (using
DISTINCT will result in all the rows being added to the temp table).
CREATE TEMP TABLE duplicated_foo(LIKE foo); ALTER TABLE duplicated_foo ADD COLUMN row_number integer;
You need an extra column in the temp table, for the row number.
INSERT INTO duplicated_foo WITH dupes AS ( SELECT id, region FROM foo GROUP BY id, region HAVING COUNT(*) > 1 ), matches AS ( SELECT foo.*, row_number() over (partition by foo.id, foo.region) FROM foo JOIN dupes ON dupes.id = foo.id AND dupes.region = foo.region ) SELECT * FROM matches WHERE row_number = 1;
We have a composite key, which complicates things further. This is taking the first row, that matches on both columns.
ALTER TABLE duplicated_foo DROP COLUMN row_number;
You can then drop the extra column from the temp table.
DELETE FROM foo USING duplicated_foo WHERE foo.id = duplicated_foo.id AND foo.region = duplicated_foo.region; -- Insert back in the single copies INSERT INTO foo SELECT * FROM duplicated_foo;
Remove all duplicate rows (whatever that means to you), and copy back in the valid data.