Running script for changed lines in CSV

We have a large CSV, containing ref data for our data warehouse export process (enrichment). When the file is updated, a Jenkins job uploads the new version to an RDS instance, which affects any new data after that; but updating any existing data is currently a manual task.

The brute force option would be to simply run a script for every line in the (new) CSV, but that would probably take about 3 days, every time it is changed. Sounds like another job for an unholy bash script!

I had initially assumed that I could use the output from git diff, but it didn’t seem possible to get bare output. So the next stop was comm. I could get the old version of the file pretty easily, with a sha:

git show 9f60693d4e1d1bd00025b51070efdd034b98448d:./foo.csv

so that can be used as one of the files to compare (piped to stdin):

git show ... | comm -13 --nocheck-order - foo.csv

Suppressing everything except lines unique to the new file. Parsing a CSV is always a bit hairy, but I was willing to risk just using awk on each line, with a comma separator (this won’t work if you have any columns with escaped commas):

... | awk -F, '{printf "docker run -it --rm -v $PWD:/app -w /app -e PGHOST -e PGPORT -e PGUSER -e PGPASSWORD reports-db python update.py %s %s\n", $1, $2}'

I’m using a python script to run a SQL update (because reasons), but you could probably generate a psql command instead.

And finally, you need to execute each command:

| xargs -o -I @ bash -c '@'

(the --open-tty is needed for docker)

The whole script can then be run using, e.g. the GIT_PREVIOUS_SUCCESSFUL_COMMIT var.

Leave a comment