R – Look on my works, ye Mighty, and despair!

Filtering a CSV in R

I have a 2M line CSV (exported from Redshift), and needed to do some sanity checking. In SQL I would have written something like this:

select count(*)
from foo
where date >= '2020-10-1' and date < '2020-11-1'
and foo = 'bar'

So what’s the equivalent in R?

The recommendation seems to be to use data.table

install.packages('data.table')
library(data.table)
data <- fread("foo.csv")
str(data)

Filtering by value is easy:

foo <- data[foo == 'bar']

But a date range is a little trickier. R seems to know that the strings are a date format:

POSIXct, format: "2020-05-21 14:16:24" "2020-05-21 14:16:28" ...

I imagine it’s possible to truncate those values, but the easiest thing for me was to add a new col:

foo$date <- as.Date(totk$started_at)

and then use that with subset:

> nrow(subset(foo, date >= "2020-10-1" & date < "2020-11-1"))
[1] 73594

If you have a csv with a list of values, and you want to see the distribution, R is an excellent choice.

First, you need to load the data:

data <- read.csv("data.csv", header=TRUE)

(faster methods are available, for large datasets)

You will also need to install ggplot2, if you haven’t already:

install.packages("ggplot2")

(you can install the entire tidyverse, but that seems to download most of the internet)

Finally, import the library, and generate the chart:

library(ggplot2)

ggplot(data, aes(x=num)) + geom_histogram(binwidth=.5)

Et voila!