Deleting data in batches

We have some cron jobs, to remove old data; but recently, as the amount of data increased, they have been causing io spikes. The internet suggested the problem was caused by deleting everything in one transaction:

DELETE FROM foo
WHERE some condition;

We found an example of chunking deletes, in T-SQL, but porting the loop to PL/pgSQL proved… problematic.

It would be nice to simply write:

DELETE FROM foo
WHERE <condition>
LIMIT 100;

But that syntax doesn’t exist. The easiest thing seems to be using a CTE, to find the ids of the rows to delete:

DELETE FROM foo
WHERE id = any(array(SELECT id FROM foo WHERE <condition> LIMIT 100));

And of course it’s useful to know how many rows were actually deleted (rather than the hopeful batch size):

WITH deleted AS (
    DELETE FROM foo
    WHERE id = any(array(SELECT id FROM foo WHERE <condition> limit 100)) 
    RETURNING id
)
SELECT count(*) FROM deleted;

It’s easier to do the loop in bash, if you can persuade psql to return that info:

while :
do
        VALUE=$(psql -qtA -d $DBNAME -c "WITH deleted AS...")
        echo "deleted $VALUE"
        if [ $VALUE -eq 0 ]
        then
                break
        fi
        sleep 1
done

There’s 2 params to play with: the batch size, and the delay between loops. It’s pretty straightforward to identify how fast you can delete, without incurring a noticeable io penalty.

Just remember that you won’t necessarily get that disk space back: it’s just made available for re-use, unless you perform some compaction.

Look on my works, ye Mighty, and despair!

Deleting data in batches

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply