The group_by() function in R is from the dplyr package that is used to group rows by column values in the DataFrame, It is similar to the GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.
In general Group by operation involves splitting the data, applying some functions, and finally aggregating the results.
So it is a two-step process, first, you need to perform grouping and then run the aggregate for each group. In this article, I will explain the group_by() function syntax and usage on a data frame with R programming examples.
1. Syntax of group_by() Function
Following is the syntax of the dplyr group_by() function.
# Syntax group by
group_by(.data, ..., add = FALSE)
Parameters
.data
– tbl...
– Variables or Columns to group by.add
– Defaults to FALSE
Following are the types of tbl (tables) it supports.
- data.frame
- data.table
- SQLite
- PostgreSQL
- MySQL
Create a DataFrame by loading from a CSV file.
# Read CSV file into DataFrame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df
Yields below output.
2. Dplyr group_by Function Example
To group rows in a data frame by specific columns in R, use the group_by()
function. Before you can use this function, you need to install the dplyr
package with install.packages(‘dplyr’) and load it into your environment with library(dplyr)
.
All dplyr
functions require a data frame as the first argument. The dplyr
package commonly uses the infix operator %>%
from the magrittr
package, which allows you to chain operations. This operator pipes the output from the left side into the first argument of the function on the right side. For instance, x %>% f(y)
translates to f(x, y)
, so the result on the left is passed as an argument to the function on the right.
The group_by() function doesn’t change the data frame data how it looks and it just returns the grouped tbl (tibble table) where we can perform summarising on. Let’s perform the group by on the column department
and summarize to get the sum of salary for each group.
# Load dplyr
library(dplyr)
# group_by() on department
grp_tbl <- df %>% group_by(department)
grp_tbl
# summarise on groupped data.
agg_tbl <- grp_tbl %>% summarise(sum(salary))
agg_tbl
Yields below output.
Note that the output of group_by() and summarise() is tibble hence, to convert it to data.frame use as.data.frame()
function.
# Convert tibble to DataFrame
df2 <- agg_tbl %>% as.data.frame()
3. Assign Name to the Summarize Column
If you notice above the second output, the summarise column name has sum(salary) which is not user-friendly, let’s see how to add a custom user-friendly name to it. Also, I will rewrite the above 2 statements into a single statement using dplyr piping.
# Assign column Name to the aggregated column
# Group by on multiple columns
agg_tbl <- df %>% group_by(department) %>%
summarise(total_salary=sum(salary))
agg_tbl
Yields below output.
4. Using dplyr group_by() on Multiple Columns
The group_by() and summarise() also support group by on multiple columns and summarise on multiple columns.
# Group by on multiple columns
# & multiple aggregations
agg_tbl <- df %>% group_by(department, state) %>%
summarise(total_salary=sum(salary),
total_bonus = sum(bonus),
min_salary = min(salary),
max_salary = max(salary),
.groups = 'drop'
)
agg_tbl
Yields below output.
5. Apply List of Summarise Functions
This example performs a group-by-operation on the department
and state
columns, aggregates the salary
and bonus
columns, and applies the sum
and mean
functions to each of these aggregated columns.
# Apply multiple summaries
df2<- df[,c("department","state","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>%
summarise(across(c(salary, bonus), list(mean = mean, sum = sum)))
Yields below output.
6. Summarise All Columns Except Grouping Columns
This example groups the data by the department
and state
columns, then aggregates the remaining columns by applying the sum
and mean
functions to them.
# Summarise all columns except grouping columns
df2<- df[,c("department","state","age","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
agg_tbl
Yields below output.
7. Complete Example
Following is a complete example of an R group by function.
# Create Data Frame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df
# Load dplyr
library(dplyr)
# group_by() on department
grp_tbl <- df %>% group_by(department)
grp_tbl
# summarise on groupped data.
agg_tbl <- grp_tbl %>% summarise(sum(salary))
agg_tbl
# Assign column Name to the aggregated column
agg_tbl <- df %>% group_by(department) %>%
summarise(total_salary=sum(salary))
agg_tbl
# Group by on multiple columns
# & multiple aggregations
agg_tbl <- df %>% group_by(department, state) %>%
summarise(total_salary=sum(salary),
total_bonus = sum(bonus),
min_salary = min(salary),
max_salary = max(salary),
.groups = 'drop'
)
agg_tbl
# Apply multiple summaries
df2<- df[,c("department","state","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>%
summarise(across(c(salary, bonus), list(mean = mean, sum = sum)))
# Summarise all columns except grouping columns
df2<- df[,c("department","state","age","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
agg_tbl
8. Conclusion
In this article, you have learned the syntax of group_by() function in R from the dplyr package and how to use this to group the rows in DataFrame and apply the summarise.
Related Articles
- R Summarise on Group By in Dplyr
- R Group by Mean With Examples
- R Group by Sum With Examples
- R Group by Count With Examples
- R Group by Multiple Columns or Variables
- R solve() Equation with Examples
- Pipe in R with Examples