We have some cron jobs, to remove old data; but recently, as the amount of data increased, they have been causing io spikes. The internet suggested the problem was caused by deleting everything in one transaction:
DELETE FROM foo
WHERE some condition;
We found an example of chunking deletes, in T-SQL, but porting the loop to PL/pgSQL proved… problematic.
It would be nice to simply write:
DELETE FROM foo
But that syntax doesn’t exist. The easiest thing seems to be using a CTE, to find the ids of the rows to delete:
DELETE FROM foo
WHERE id = any(array(SELECT id FROM foo WHERE <condition> LIMIT 100));
And of course it’s useful to know how many rows were actually deleted (rather than the hopeful batch size):
WITH deleted AS (
DELETE FROM foo
WHERE id = any(array(SELECT id FROM foo WHERE <condition> limit 100))
SELECT count(*) FROM deleted;
It’s easier to do the loop in bash, if you can persuade psql to return that info:
VALUE=$(psql -qtA -d $DBNAME -c "WITH deleted AS...")
echo "deleted $VALUE"
if [ $VALUE -eq 0 ]
There’s 2 params to play with: the batch size, and the delay between loops. It’s pretty straightforward to identify how fast you can delete, without incurring a noticeable io penalty.
Just remember that you won’t necessarily get that disk space back: it’s just made available for re-use, unless you perform some compaction.
We use pgbouncer as a connection pooler, and in one of our production enviroments (after a recent migration) we were getting some portion of connection attempts failing with ETIMEDOUT.
Our first assumption was that it was due to some limitation of our service provider’s internal network, but they assured us that they couldn’t see any failures; and when we looked at the other end, we couldn’t either.
So it seemed to be some limitation on the client host (e.g. hitting the file descriptor limit). We had a look at some netstat data, and added some datadog tcp metrics, but nothing stood out.
At this point, there seemed to be no other option than to use tcpdump and see if we could find a reason that the connection was rejected. We fired it up:
sudo tcpdump -i eth2 -w tcpdump.log
downloaded the output, and opened it up in wireshark. Following some helpful instructions we identified some likely packets.
At this point it was starting to look like we were suffering from ephemeral port exhaustion, so we decided to experiment with running pgbouncer on the app server instead, as that would reduce the number of open sockets between the hosts.
A resounding success! I’m sure it would also be possible to tune some linux tcp options, to the same effect, but this was acceptable for us (there’s only one app server in that env).
I’m not entirely sure why we were getting a time out, rather than EADDRNOTAVAIL, but that may be due to the client library we are using to connect.
It’s pretty simple to update the next value generated by a sequence, but what if you want to update them for every table?
In our case, we had been using DMS to import data, but none of the sequences were updated afterwards. So any attempt to insert a new row was doomed to failure.
To update one sequence you can call:
SELECT setval('foo.bar_id_seq', (select max(id) from foo.bar), true);
and you can get a list of tables pretty easily:
but how do you put them together? My first attempt was using some vim fu (qq@q), until I realised I’d need to use a regex to capture the table name. And then I found some sequences that weren’t using the same name as the table anyway (consistency uber alles).
select t.schemaname, t.tablename, pg_get_serial_sequence(t.schemaname || '.' || t.tablename, c.column_name)
from pg_tables t
join information_schema.columns c on c.table_schema = t.schemaname and c.table_name = t.tablename
where t.schemaname <> 'pg_catalog' and t.schemaname <> 'information_schema' and pg_get_serial_sequence(t.schemaname || '.' || t.tablename, c.column_name) is not null;
This returns the schema, table name, and sequence name for every (non-system) table; which should be “trivial” to convert to a script updating the sequences (I considered doing that in sql, but dynamic tables aren’t easy to do).
Locust is a programmer-friendly load testing tool (certainly compared with jmeter!). Traditionally, once you needed to generate more load than a single host could easily support, you would set up a swarm. However, if you’re willing to live without the web UI, there is another option.
Once you have a containerised version of your scripts, you can go “serverless”, and run them as a task on AWS Fargate. You can use the wizard to set up a cluster &c, or define them with cloudformation:
(You can use a public subnet/sg from the default vpc). That will spawn 100 VUs, for an hour, against your chosen target. And you can just keep adding more. Any logs from locust will be available in the AWS console.
I kept it pretty minimal. The IP mask needs to match whatever you used for the subnet(s) the instances are attached to (obvs). And the login may vary depending on the image you used, if you are using the defaults.
You can then use this config when running your playbook:
The IP address for the jump host is hard-coded in the ssh config, which isn’t ideal. We may use a DNS record, and update that instead, if it changes; but there doesn’t seem any easy way to either get that from the inventory, or update the cname automatically.