I’ve been doing some (small to medium) data work recently, and when trying to identify missing data it’s often necessary to resort to comparing two lists of rows.
SQL is an ideal tool for this, but if you can’t do cross database queries, then you need to resort to something more basic.
I was initially segmenting the data, and using a spreadsheet with conditional formatting to highlight the extra (or missing) rows:
=MATCH(A1, B1, 0)
But once you’re comparing tens of thousands of rows, that loses appeal. At this point I discovered the comm utility:
$ comm -23 old.csv new.csv
This will print out those lines only in the old csv.