I have a 2M line CSV (exported from Redshift), and needed to do some sanity checking. In SQL I would have written something like this:
select count(*)
from foo
where date >= '2020-10-1' and date < '2020-11-1'
and foo = 'bar'
So what’s the equivalent in R?
The recommendation seems to be to use data.table
install.packages('data.table')
library(data.table)
data <- fread("foo.csv")
str(data)
Filtering by value is easy:
foo <- data[foo == 'bar']
But a date range is a little trickier. R seems to know that the strings are a date format:
POSIXct, format: "2020-05-21 14:16:24" "2020-05-21 14:16:28" ...
I imagine it’s possible to truncate those values, but the easiest thing for me was to add a new col:
foo$date <- as.Date(totk$started_at)
and then use that with subset:
> nrow(subset(foo, date >= "2020-10-1" & date < "2020-11-1"))
[1] 73594
