Set operations on a list of ids

I’ve been doing some (small to medium) data work recently, and when trying to identify missing data it’s often necessary to resort to comparing two lists of rows.

SQL is an ideal tool for this, but if you can’t do cross database queries, then you need to resort to something more basic.

I was initially segmenting the data, and using a spreadsheet with conditional formatting to highlight the extra (or missing) rows:

=MATCH(A1, B1, 0)

But once you’re comparing tens of thousands of rows, that loses appeal. At this point I discovered the comm utility:

$ comm -23 old.csv new.csv

This will print out those lines only in the old csv.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s