Group by and Mutate Fun
I wanted to write this post simply because I use this trick a lot and it’s not super complicated but for new programmers, this is often one of the first challenging tasks that can be made simple with dplyr and data.table. As someone who uses both packages, this post will give examples of how to use group by and mutate together in both packages.
In this post, we’re going to use the following data and focus on ranking the ages of each student based on their location.
data<-data.table(Location=c("Los Angeles","San Francisco","Los Angeles","New York","New York","New York","Los Angeles","New York","New York","Los Angeles"),
Age=c(13,24,13,24,25,18,17,4,29,13))
print(data)
## Location Age
## 1: Los Angeles 13
## 2: San Francisco 24
## 3: Los Angeles 13
## 4: New York 24
## 5: New York 25
## 6: New York 18
## 7: Los Angeles 17
## 8: New York 4
## 9: New York 29
## 10: Los Angeles 13
So first, let’s do this in dplyr:
data %>%
as.data.frame() %>%
group_by(Location) %>%
mutate(Location_Rank = 1:n()) %>%
arrange(Location, Location_Rank)
## # A tibble: 10 x 3
## # Groups: Location [3]
## Location Age Location_Rank
## <chr> <dbl> <int>
## 1 Los Angeles 13 1
## 2 Los Angeles 13 2
## 3 Los Angeles 17 3
## 4 Los Angeles 13 4
## 5 New York 24 1
## 6 New York 25 2
## 7 New York 18 3
## 8 New York 4 4
## 9 New York 29 5
## 10 San Francisco 24 1
This isn’t revolutionary but for the first couple weeks of consistently using the group by and summarize functions together to analysis/play around with data, it seemed crazy that you could use other functions with group by. So when I discovered that you could use mutate with group by it was really a gamer changer.
As is common in my R blog posts, you can also do this in data.table!!!. For example:
data <- as.data.table(data)
data[, Location_Rank := seq_len(.N), by = Location]
data[, Location_Rank := rowid(Location)]
print(data)
## Location Age Location_Rank
## 1: Los Angeles 13 1
## 2: San Francisco 24 1
## 3: Los Angeles 13 2
## 4: New York 24 1
## 5: New York 25 2
## 6: New York 18 3
## 7: Los Angeles 17 3
## 8: New York 4 4
## 9: New York 29 5
## 10: Los Angeles 13 4
Other useful functions to use alongside group by and mutate are lag, lead and cumsum.