You are currently viewing R Group by Sum With Examples

To perform a group by sum in R, you can use the aggregate() function from the base R package. Alternatively, you can use the group_by() function along with summarise() from the dplyr package. Both methods allow grouping a dataframe based on a specific column and calculating the sum of a numeric variable within each group.

Advertisements

First, I will demonstrate an efficient approach using the groupby() function from the dplyr package. Then, I will show how to use the aggregate() function from the R base to group by sum on single or multiple columns.

Quick Examples

Below are brief examples assuming how to execute group-by-summation.


# Create Data Frame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df

# Load dplyr
library(dplyr)

# Group by sum using dplyr
# returns tibble table
agg_tbl <- df %>% group_by(department) %>% 
  summarise(sum_salary=sum(salary),
            .groups = 'drop')

# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()

# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(sum_salary=sum(salary),
            sum_bonus= sum(bonus),
            .groups = 'drop') %>%
  as.data.frame()

# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(across(c(salary, bonus),sum),
            .groups = 'drop') %>%
  as.data.frame()

# Sum on all columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>% 
  summarise(across(everything(), sum),
            .groups = 'drop')  %>%
  as.data.frame()

# Group by sum using R Base aggregate()
agg_df <- aggregate(df$salary, by=list(df$department), FUN=sum)

# R Base aggregate() on multiple columns
agg_df <- aggregate(df$salary, by=list(df$department,df$state), FUN=sum)

Let’s generate a data frame by importing data from a CSV file.

Yields below output.

r group by sum

Get Group By Sum of Single Column in R

When we apply the combination of group_by() function and the summarise() function of the dplyr package on an R data frame based on a single column, it returns a sum of grouped data.

Before utilizing these functions, you need to first install dplyr by running install.packages(‘dplyr’) and then load it with the library(dplyr). Throughout our examples, I can use the dplyr infix operator %>% since the output of the group_by() function serves as the input for the summarise() function.


# Load dplyr
library(dplyr)

# Group by sum using dplyr
agg_tbl <- df %>% group_by(department) %>% 
  summarise(sum_salary = sum(salary),
            .groups = 'drop')
agg_tbl

# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()
df2

Yields below output. The above code returns a groupby object based on department column and then calculates the sum of salary for each department.

Note: The pair of above functions return tibble if you want to get your result in data frame format, you can go with as.data.frame(). This will make your result as you expect.

r group by sum

Get Group By Sum of Multiple Columns in R

Alternatively, you can also apply the above combination on multiple columns of a data frame to get the sum of multiple numeric columns based on each combination of grouped data.


# Group by sum of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(sum_salary=sum(salary),
            sum_bonus= sum(bonus),
            .groups = 'drop') %>%
  as.data.frame()
df2

The above code takes a dataframe df, groups it by department and state, and then calculates the sum of salary and bonus within each group, creating a new dataframe df2 with these summary statistics.

Yields below output.

r group by sum multiple columns

You can also use across() with the vector of elements you want to apply to summarise on.


# Group by sum of multiple columns using across()
df2 <- df %>% group_by(department,state) %>% 
  summarise(across(c(salary, bonus),sum),
            .groups = 'drop') %>%
  as.data.frame()
df2

Get Sum on All Columns Except Group by Columns

Finally, let’s see how to apply the aggregate function sum on all columns of the DataFrame except grouping columns. While doing this make sure your dataframe has only numeric columns plus grouping columns. Having non-numeric on summarise returns an error.

This example does the group by on department and state columns, summarises on all columns except grouping columns, and apply the sum function on all summarised columns.


# Sum on all columns except group by columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>% 
  summarise(across(everything(), sum),
            .groups = 'drop')  %>%
  as.data.frame()
df2
on all columns

Get Group By Sum using R base aggregate()

So far, we have learned examples of groupby sum using the dplyr package. Now in this example, we will learn how to get groupby sum based on single/multiple columns of the data frame using R base aggregate() function.

Let’s pass specified columns one for summing and another for grouping along with FUN parameter into this function, it will perform grouping and summing finally, return the sum of grouped object.


# Group by sum using R Base aggregate()
agg_df <- aggregate(df$salary, by=list(df$department), FUN=sum)
agg_df

Yields below output.

r aggregate sum

Perform Groupby Sum of Multiple Columns using R Base

You can utilize the base R function aggregate() to perform a groupby sum operation on multiple columns of a data frame. To do this, specify the list of columns for grouping using the by parameter and include another numeric column along with the FUN parameter when calling the function.


# R Base aggregate() on multiple columns
agg_df <- aggregate(df$salary, by=list(df$department,df$state), FUN=sum)
agg_df

Yields below output.

multiple columns

Conclusion

In this article, I have explained the process of performing a group by sum in R, using the group_by() function from the dplyr package and the aggregate() function from the R base. The dplyr functions are especially efficient when dealing with large datasets, making them the superior choice among these options.

References