How to perform a group by on multiple columns in R data frame? By using the group_by() function from the dplyr package we can perform a group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations. Later, I will also explain how to apply summarise() on all columns and finally use multiple aggregation functions together.
1. Quick Examples of Grouping by Multiple Columns
Following are quick examples of grouping dataframe on multiple columns.
# Group by on multiple columns
agg_tbl <- df %>% group_by(department, state) %>%
summarise(total_salary=sum(salary))
# Summarise on multiple columns
df2<- df[,c("department","state","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>%
summarise(across(c(salary, bonus), sum))
# Apply multiple summaries
df2<- df[,c("department","state","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>%
summarise(across(c(salary, bonus), list(mean = mean, sum = sum)))
# Summarise all columns except grouping columns
df2<- df[,c("department","state","age","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
Let’s create a DataFrame by reading a CSV file.
# Read CSV file into DataFrame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df
Yields below output.
2. Group By Multiple Columns in R
Use the group_by() function in R to group the rows in DataFrame by multiple columns (two or more), to use this function, you have to install dplyr first using install.packages(‘dplyr’) and load it using library(dplyr)
.
All functions in the dplyr package take data.frame
as a first argument. When we use dplyr
package, we mostly use the infix operator %>%
from magrittr
, it passes the left-hand side of the operator to the first argument of the operator’s right-hand side. For example, x %>% f(y)
converted into f(x, y)
so the result from the left-hand side is then “piped” into the right-hand side.
I will use the infix operator %>%
across all our examples as the output of the group_by() function is input to summarise() function.
# Load dplyr
library(dplyr)
# Group by on multiple columns
agg_tbl <- df %>% group_by(department, state) %>%
summarise(total_salary=sum(salary))
agg_tbl
Yields below output. This example does the group by on department
and state
columns and summarises on salary
column and applies the sum
function on each summarised column.
Note that the output of group_by() and summarise() is tibble hence, to convert it to data.frame use as.data.frame()
function.
# Convert to data frame
df2 <- agg_tbl %>% as.data.frame()
3. Grop By & Summarise on Multiple Columns
To perform summarising on multiple columns, create a vector with the column names and use it with the across() function.
# Summarise on multiple columns
agg_tbl <- df %>% group_by(department, state) %>%
summarise(across(c(salary, bonus), sum))
This example does the group by on department
and state
columns, summarises on salary
& bonus
columns, and apply the sum
function on each summarised column.
Yields below output.
4. Apply Multiple Summarise Functions
Similarly, you can also perform multiple aggregation functions on all summarise columns in R.
# Apply multiple summaries
agg_tbl <- df %>% group_by(department, state) %>%
summarise(across(c(salary, bonus), list(mean = mean, sum = sum)))
This example does the group by on department
and state
columns, summarises on salary
and bonus
columns, and apply the sum
& mean
functions on each summarised column
Yields below output.
5. Summarise All Columns Except Groupings
Finally, let’s see how to apply the aggregate functions on all columns of the datafFrame except grouping columns. While doing this make sure your dataframe has only numeric columns plus grouping columns. Having non-numeric on summarise returns an error.
# Summarise all columns except grouping columns
df2<- df[,c("department","state","age","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
agg_tbl
This example does the group by on department
and state
columns, summarises on all columns except grouping columns, and apply the sum
& mean
functions on all summarised columns.
Yields below output.
4. Conclusion
In this article, I have explained how to perform group by dataframe on multiple columns and apply different summarising types to get aggregation on grouped data. Since the output of group_by() and summarise() is tibble, use as.data.frame()
function to convert it to data.frame.
Related Articles
- Different Ways to Create a DataFrame in R
- Reorder Columns of DataFrame in R
- R – Join Two or Multiple DataFrames
- How to Read Multiple CSV Files in R
- R Join (Merge) on Multiple Columns
- R Join Multiple Data Frames
- R Write Multiple Lines to Text File