You are currently viewing R Group by Sum With Examples

You can perform a group by sum in R, by using the aggregate() function from the base R package. Alternatively, you can use the group_by() function along with summarise() from the dplyr package. Both methods allow grouping a data frame based on a particular column and calculating the sum of a numeric variable within each group.

Advertisements

First, I will demonstrate an efficient approach using the groupby() function from the dplyr package. Then, I will show how to use the aggregate() function from the R base to group by sum on single or multiple columns.

Quick Examples

Below are brief examples assuming how to execute group-by-summation.


# Create Data Frame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df

# Load dplyr
library(dplyr)

# Group by sum using dplyr
# returns tibble table
agg_tbl <- df %>% group_by(department) %>% 
  summarise(sum_salary=sum(salary),
            .groups = 'drop')

# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()

# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(sum_salary=sum(salary),
            sum_bonus= sum(bonus),
            .groups = 'drop') %>%
  as.data.frame()

# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(across(c(salary, bonus),sum),
            .groups = 'drop') %>%
  as.data.frame()

# Sum on all columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>% 
  summarise(across(everything(), sum),
            .groups = 'drop')  %>%
  as.data.frame()

# Group by sum using R Base aggregate()
agg_df <- aggregate(df$salary, by=list(df$department), FUN=sum)

# R Base aggregate() on multiple columns
agg_df <- aggregate(df$salary, by=list(df$department,df$state), FUN=sum)

Let’s generate a data frame by importing data from a CSV file.

Yields below output.

r group by sum

Get Group By Sum of Specified Column of R

When we apply the combination of group_by() function and the summarise() function of the dplyr package on an R data frame by a single column, it returns a sum of grouped data.

Before utilizing these functions, you need to first install dplyr by running install.packages(‘dplyr’) and then load it with the library(dplyr). Throughout our examples, I can use the dplyr infix operator %>% since the output of the group_by() function serves as the input for the summarise() function.


# Load dplyr
library(dplyr)

# Group by sum using dplyr
agg_tbl <- df %>% group_by(department) %>% 
  summarise(sum_salary = sum(salary),
            .groups = 'drop')
agg_tbl

# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()
df2

Yields below output. The above code returns a groupby object based on department column and then computes the sum of salary for each department.

Note: The pair of above functions return tibble if you want to get your result in data frame format, you can go with as.data.frame(). This will make your result as you expect.

r group by sum

Get Group By Sum of Multiple Columns in R

You can also apply the above combination on multiple columns of a data frame to get the sum of multiple numeric columns based on each combination of grouped data.


# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(sum_salary=sum(salary),
            sum_bonus= sum(bonus),
            .groups = 'drop') %>%
  as.data.frame()
df2

The above code takes a data frame df, groups it by department and state, and then calculates the sum of salary and bonus within each group, creating a new data frame df2 with these summary statistics.

Yields below output.

r group by sum multiple columns

Alternatively, you can use the across() function with the vector of elements you want to apply to summarise on.


# Group by sum of multiple columns using across()
df2 <- df %>% group_by(department,state) %>% 
  summarise(across(c(salary, bonus),sum),
            .groups = 'drop') %>%
  as.data.frame()
df2

Get the Sum of NoN-Grouping Columns

Finally, let’s explore how to use the aggregate function sum on all columns of a data frame, excluding the grouping columns. Ensure that your data frame contains only numeric columns in addition to the grouping columns, as including non-numeric columns in the summarization will result in an error.

In this example, we will group by the department and state columns, summarize all columns except the grouping columns, and apply the sum function to all summarized columns.


# Sum on all columns except group by columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>% 
  summarise(across(everything(), sum),
            .groups = 'drop')  %>%
  as.data.frame()
df2
on all columns

Get Group By Sum using aggregate()

So far, we have learned examples of groupby sum using the dplyr package. Now in this example, we will learn how to get groupby sum based on single/multiple columns of the data frame using R base aggregate() function.

Let’s pass specified columns one for summing and another for grouping along with FUN parameters into this function, it will perform grouping and summing finally, return the sum of the grouped object.


# Group by sum using R Base aggregate()
agg_df <- aggregate(df$salary, by=list(df$department), FUN=sum)
agg_df

Yields below output.

r aggregate sum

Perform Groupby Sum of Multiple Columns using R Base

You can utilize the base R function aggregate() to perform a groupby sum operation on multiple columns of a data frame. To do this, specify the list of columns for grouping using the by parameter and include another numeric column along with the FUN parameter when calling the function.


# R Base aggregate() on multiple columns
agg_df <- aggregate(df$salary, by=list(df$department,df$state), FUN=sum)
agg_df

Yields below output.

multiple columns

Conclusion

In this article, I have detailed how to perform a group by sum in R using the group_by() function from the dplyr package and the aggregate() function from base R. The dplyr functions are particularly efficient for handling large datasets, making them the preferable option.

References