We have a large CSV, containing ref data for our data warehouse export process (enrichment). When the file is updated, a Jenkins job uploads the new version to an RDS instance, which affects any new data after that; but updating any existing data is currently a manual task.
The brute force option would be to simply run a script for every line in the (new) CSV, but that would probably take about 3 days, every time it is changed. Sounds like another job for an unholy bash script!
I had initially assumed that I could use the output from git diff, but it didn’t seem possible to get bare output. So the next stop was comm. I could get the old version of the file pretty easily, with a sha:
git show 9f60693d4e1d1bd00025b51070efdd034b98448d:./foo.csv
so that can be used as one of the files to compare (piped to stdin):
git show ... | comm -13 --nocheck-order - foo.csv
Suppressing everything except lines unique to the new file. Parsing a CSV is always a bit hairy, but I was willing to risk just using awk on each line, with a comma separator (this won’t work if you have any columns with escaped commas):
... | awk -F, '{printf "docker run -it --rm -v $PWD:/app -w /app -e PGHOST -e PGPORT -e PGUSER -e PGPASSWORD reports-db python update.py %s %s\n", $1, $2}'
I’m using a python script to run a SQL update (because reasons), but you could probably generate a psql command instead.
And finally, you need to execute each command:
| xargs -o -I @ bash -c '@'
(the --open-tty is needed for docker)
The whole script can then be run using, e.g. the GIT_PREVIOUS_SUCCESSFUL_COMMIT var.