Filtering a CSV in R

I have a 2M line CSV (exported from Redshift), and needed to do some sanity checking. In SQL I would have written something like this:

select count(*)
from foo
where date >= '2020-10-1' and date < '2020-11-1'
and foo = 'bar'

So what’s the equivalent in R?

The recommendation seems to be to use data.table

install.packages('data.table')
library(data.table)
data <- fread("foo.csv")
str(data)

Filtering by value is easy:

foo <- data[foo == 'bar']

But a date range is a little trickier. R seems to know that the strings are a date format:

POSIXct, format: "2020-05-21 14:16:24" "2020-05-21 14:16:28" ...

I imagine it’s possible to truncate those values, but the easiest thing for me was to add a new col:

foo$date <- as.Date(totk$started_at)

and then use that with subset:

> nrow(subset(foo, date >= "2020-10-1" & date < "2020-11-1"))
[1] 73594

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s