Joins in Data.Table
Although I’ve used the data.table package regularly for almost two years now, I still find myself converting my objects into data frames for joins. Part of this is definitely just laziness, so I figured writing about joins in data.table would help me understand and remember the syntax better moving forward (I’ll keep you updated on if that actually happens).
I’m going to rely on the mtcars dataset and a made-up data set of people’s cars to demonstrate the different types of joins. Here’s how I’m building my data for those who want to follow along:
cars <- mtcars %>% tibble::rownames_to_column(var="car_type") %>% select(car_type, mpg) %>% as.data.table()
head(cars)
## car_type mpg
## 1: Mazda RX4 21.0
## 2: Mazda RX4 Wag 21.0
## 3: Datsun 710 22.8
## 4: Hornet 4 Drive 21.4
## 5: Hornet Sportabout 18.7
## 6: Valiant 18.1
friends <- c("Nicholas", "Drake", "JP", "Lily")
cars_models <- c("Ferrari Dino", "Honda Civic", "Fiat X1-9", "Smart Car")
friends_cars <- data.frame(friends,cars_models) %>%
mutate_at(vars(friends, cars_models),
funs(as.character)) %>% as.data.table()
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
head(friends_cars)
## friends cars_models
## 1: Nicholas Ferrari Dino
## 2: Drake Honda Civic
## 3: JP Fiat X1-9
## 4: Lily Smart Car
# Types of Joins
Data.table joins are based on the same premise as SQL joins. The syntax of data.table joins is very simple:
Join Type | Syntax Left Join | Y[X] Right Join | X[Y,] Inner Join | X[Y, nomatch=0]
Set Key
Unlike joining using dplyr or merge, data.table joins determined the on/using clause by setting the keys of the table using the setkey() function. This honestly is the part that turns me off using data.table joins regularly because I always forgot to do this. But, it is a quick and simple step.
For our example, we are going to join on the car_type in the cars data.table and the cars_models in the friends_cars table. So, we set the key for these fields (unsure if that’s the technical language but hopefully y’all follow me):
setkey(cars,car_type)
setkey(friends_cars, cars_models)
# left join
cars[friends_cars,]
## car_type mpg friends
## 1: Ferrari Dino 19.7 Nicholas
## 2: Fiat X1-9 27.3 JP
## 3: Honda Civic 30.4 Drake
## 4: Smart Car NA Lily
# right join
friends_cars[cars,]
## friends cars_models mpg
## 1: <NA> AMC Javelin 15.2
## 2: <NA> Cadillac Fleetwood 10.4
## 3: <NA> Camaro Z28 13.3
## 4: <NA> Chrysler Imperial 14.7
## 5: <NA> Datsun 710 22.8
## 6: <NA> Dodge Challenger 15.5
## 7: <NA> Duster 360 14.3
## 8: Nicholas Ferrari Dino 19.7
## 9: <NA> Fiat 128 32.4
## 10: JP Fiat X1-9 27.3
## 11: <NA> Ford Pantera L 15.8
## 12: Drake Honda Civic 30.4
## 13: <NA> Hornet 4 Drive 21.4
## 14: <NA> Hornet Sportabout 18.7
## 15: <NA> Lincoln Continental 10.4
## 16: <NA> Lotus Europa 30.4
## 17: <NA> Maserati Bora 15.0
## 18: <NA> Mazda RX4 21.0
## 19: <NA> Mazda RX4 Wag 21.0
## 20: <NA> Merc 230 22.8
## 21: <NA> Merc 240D 24.4
## 22: <NA> Merc 280 19.2
## 23: <NA> Merc 280C 17.8
## 24: <NA> Merc 450SE 16.4
## 25: <NA> Merc 450SL 17.3
## 26: <NA> Merc 450SLC 15.2
## 27: <NA> Pontiac Firebird 19.2
## 28: <NA> Porsche 914-2 26.0
## 29: <NA> Toyota Corolla 33.9
## 30: <NA> Toyota Corona 21.5
## 31: <NA> Valiant 18.1
## 32: <NA> Volvo 142E 21.4
## friends cars_models mpg
That left join was so simple that honestly if you’re reading through the script you may not even know it’s a join. The only syntax difference that stands out to me besides no “on” parameter is that in SQL and dplyr, the first table in a left join is listed first whereas here, the first table is listed second, inside the brackets.
Now, let’s see how the nomatch parameter impacts the joins. Once again, I find this detail hard to follow/identify as a join but once you get the hang of it, it is super useful. My understanding of the nomatch parameter is that when you set nomatch=0, the returned output from the join is limited to the records that match on the set key. So, in our example above, we would only expect 3 records to return on our left join since Smart Car is not in the cars data.table (we know that because in our example above, the mpg field’s value for Smart Car is NA).
If you’ve used joins in other programming languages or in R with merge or dplyr, you’ll realize quickly that what I have just described, only returning the matched records, is actually just an inner join.
# left join with nomatch (inner join)
cars[friends_cars,nomatch=0]
## car_type mpg friends
## 1: Ferrari Dino 19.7 Nicholas
## 2: Fiat X1-9 27.3 JP
## 3: Honda Civic 30.4 Drake
friends_cars[cars,nomatch=0]
## friends cars_models mpg
## 1: Nicholas Ferrari Dino 19.7
## 2: JP Fiat X1-9 27.3
## 3: Drake Honda Civic 30.4
The only difference between the two joins above is simply the order of the output. This is something I’ve noticed that honestly I don’t like about data.table joins. But also if we’re being truthful I’m constantly re-ordering my data just to suit my own personal preferences even when it has zero impact on the analysis. Just my personal pet peeve!
Great source that I relied on/has more information : https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html