You can perform a group by sum in R, by using the aggregate()
function from the base R package. Alternatively, you can use the group_by()
function along with summarise()
from the dplyr package. Both methods allow grouping a data frame based on a particular column and calculating the sum of a numeric variable within each group.
First, I will demonstrate an efficient approach using the groupby() function from the dplyr package. Then, I will show how to use the aggregate() function from the R base to group by sum on single or multiple columns.
Quick Examples
Below are brief examples assuming how to execute group-by-summation.
# Create Data Frame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df
# Load dplyr
library(dplyr)
# Group by sum using dplyr
# returns tibble table
agg_tbl <- df %>% group_by(department) %>%
summarise(sum_salary=sum(salary),
.groups = 'drop')
# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()
# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>%
summarise(sum_salary=sum(salary),
sum_bonus= sum(bonus),
.groups = 'drop') %>%
as.data.frame()
# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>%
summarise(across(c(salary, bonus),sum),
.groups = 'drop') %>%
as.data.frame()
# Sum on all columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>%
summarise(across(everything(), sum),
.groups = 'drop') %>%
as.data.frame()
# Group by sum using R Base aggregate()
agg_df <- aggregate(df$salary, by=list(df$department), FUN=sum)
# R Base aggregate() on multiple columns
agg_df <- aggregate(df$salary, by=list(df$department,df$state), FUN=sum)
Let’s generate a data frame by importing data from a CSV file.
Yields below output.
Get Group By Sum of Specified Column of R
When we apply the combination of group_by()
function and the summarise()
function of the dplyr package on an R data frame by a single column, it returns a sum of grouped data.
Before utilizing these functions, you need to first install dplyr by running install.packages(‘dplyr’) and then load it with the library(dplyr). Throughout our examples, I can use the dplyr infix operator %>% since the output of the group_by()
function serves as the input for the summarise()
function.
# Load dplyr
library(dplyr)
# Group by sum using dplyr
agg_tbl <- df %>% group_by(department) %>%
summarise(sum_salary = sum(salary),
.groups = 'drop')
agg_tbl
# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()
df2
Yields below output. The above code returns a groupby object based on department
column and then computes the sum of salary
for each department.
Note: The pair of above functions return tibble if you want to get your result in data frame format, you can go with as.data.frame()
. This will make your result as you expect.
Get Group By Sum of Multiple Columns in R
You can also apply the above combination on multiple columns of a data frame to get the sum of multiple numeric columns based on each combination of grouped data.
# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>%
summarise(sum_salary=sum(salary),
sum_bonus= sum(bonus),
.groups = 'drop') %>%
as.data.frame()
df2
The above code takes a data frame df
, groups it by department
and state
, and then calculates the sum of salary
and bonus
within each group, creating a new data frame df2
with these summary statistics.
Yields below output.
Alternatively, you can use the across()
function with the vector of elements you want to apply to summarise on.
# Group by sum of multiple columns using across()
df2 <- df %>% group_by(department,state) %>%
summarise(across(c(salary, bonus),sum),
.groups = 'drop') %>%
as.data.frame()
df2
Get the Sum of NoN-Grouping Columns
Finally, let’s explore how to use the aggregate function sum
on all columns of a data frame, excluding the grouping columns. Ensure that your data frame contains only numeric columns in addition to the grouping columns, as including non-numeric columns in the summarization will result in an error.
In this example, we will group by the department
and state
columns, summarize all columns except the grouping columns, and apply the sum
function to all summarized columns.
# Sum on all columns except group by columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>%
summarise(across(everything(), sum),
.groups = 'drop') %>%
as.data.frame()
df2
Get Group By Sum using aggregate()
So far, we have learned examples of groupby sum using the dplyr package. Now in this example, we will learn how to get groupby sum based on single/multiple columns of the data frame using R base aggregate()
function.
Let’s pass specified columns one for summing and another for grouping along with FUN
parameters into this function, it will perform grouping and summing finally, return the sum of the grouped object.
# Group by sum using R Base aggregate()
agg_df <- aggregate(df$salary, by=list(df$department), FUN=sum)
agg_df
Yields below output.
Perform Groupby Sum of Multiple Columns using R Base
You can utilize the base R function aggregate()
to perform a groupby sum operation on multiple columns of a data frame. To do this, specify the list of columns for grouping using the by
parameter and include another numeric column along with the FUN
parameter when calling the function.
# R Base aggregate() on multiple columns
agg_df <- aggregate(df$salary, by=list(df$department,df$state), FUN=sum)
agg_df
Yields below output.
Conclusion
In this article, I have detailed how to perform a group by sum in R using the group_by()
function from the dplyr package and the aggregate()
function from base R. The dplyr functions are particularly efficient for handling large datasets, making them the preferable option.
Related Articles
- R Group by Multiple Columns or Variables
- R Group by Count With Examples
- R Group by Mean With Examples
- R Summarise on Group By in Dplyr
- R lm() Function – Fitting Linear Models
- R select() Function from dplyr – Usage with Examples
- R Filter DataFrame by Column Value
References
- https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/grouped_df
- https://www.w3schools.com/sql/sql_groupby.asp