You are currently viewing R Group by Mean With Examples

To group by mean in R, you can use either the aggregate() function from base R or the group_by() and summarise() functions from the dplyr package. These methods allow you to group data in a data frame by a specific column and then compute the mean for each group in another column. The mean is determined by dividing the sum of all values in a column by the total number of values. It is also known as the average.

Advertisements

The group_by() function from the dplyr package is a highly efficient method for grouping data, so I will explain it first. Then, I will move on to using the aggregate() function from base R to demonstrate how to group by mean on both single and multiple columns.

1. Quick Examples

Following are quick examples of how to perform group by mean/average.


# Group by mean using dplyr
agg_tbl <- df %>% group_by(department) %>% 
  summarise(mean_salary=mean(salary),
            .groups = 'drop')

# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()

# Group by mean of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(mean_salary=mean(salary),
            mean_bonus= mean(bonus),
            .groups = 'drop') %>%
  as.data.frame()

# Group by mean of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(across(c(salary, bonus),mean),
            .groups = 'drop') %>%
  as.data.frame()

# Mean on all columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>% 
  summarise(across(everything(), mean),
            .groups = 'drop')  %>%
  as.data.frame()

# Group by mean using R Base aggregate()
agg_df <- aggregate(df$salary, by=list(df$department), FUN=mean)

# R Base aggregate() on multiple columns
agg_df <- aggregate(df$salary, by=list(df$department,df$state), FUN=mean)

Let’s create a data frame by reading a CSV file.


# Read CSV file into DataFrame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df

Yields below output.

r groupby mean

2. Perform Group By Mean on a Single Column in R

To calculate the group by mean or average in an R data frame, you can use the group_by() function in combination with the summarise() from the dplyr package. The group_by() function creates a grouped data frame based on specified single/multiple columns. You can apply the summarise() function on grouped data to calculate the mean or average for each group. Mean is the average of the given sample or data set, it is equal to the total of observations of a column divided by the number of observations.


Before going to use these functions, you need to install the dplyr package with <a href="https://sparkbyexamples.com/r-programming/install-and-update-r-packages/">install.packages('dplyr')</a>, then load it into your R environment using library(dplyr). In all our examples, I will use the dplyr infix operator %>% to pipe the result from the group_by() function to the summarise() function.


# Load dplyr
library(dplyr)

# Group by mean using dplyr
agg_tbl <- df %>% group_by(department) %>% 
  summarise(mean_salary=mean(salary),
            .groups = 'drop')
agg_tbl

# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()
df2

Yields below output. It groups the data by the department column using group_by(), then calculates the average salary for each department using summarise().

Keep in mind that the group_by() and summarise() functions return a tibble. If you need a data frame, you should convert the tibble to a data frame with as.data.frame().

r groupby mean

3. Perform Group By Mean on Multiple Columns in R

Alternatively, you can perform group by mean on multiple columns of the data frame using the group_by() function and the summarise() function. Apply the group_by() function on multiple columns of the data frame, it will return the grouped object based on multiple columns. Then apply the summarize() function on grouped data, it will return the mean for every unique combination of specified multiple columns.


# Group by mean of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(mean_salary=mean(salary),
            mean_bonus= mean(bonus),
            .groups = 'drop') %>%
  as.data.frame()
df2

Yields below output.

r group by mean multiple columns

You can also use across() to apply summarise to a set of specified elements or columns.


# Group by mean of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(across(c(salary, bonus),mean),
            .groups = 'drop') %>%
  as.data.frame()
df2

4. Perform Mean on Non Grouping Columns


Let’s explore how to use the groupby() method and the summarize() method to get the mean for all columns in a data frame except grouping columns. Make sure your data frame contains only numeric columns and grouping columns. Using non-numeric data summarize will return an error.


# Mean on all columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>% 
  summarise(across(everything(), mean),
            .groups = 'drop')  %>%
  as.data.frame()
df2

In the above code, the data frame is grouped by the department and state columns, then summarize all other columns except the grouping columns, applying the mean() function to these summarized columns.

Yields below output.

r group by mean all columns

5. Group By Mean using R base aggregate()

So far, we have learned how to get the mean/average of grouped data using the dplyr package functions. Now we will see how to calculate the mean of grouped data using the R base aggregate() function. This function allows you to group a data frame by specific columns and calculate the mean of those specific columns.


# Group by mean using R Base aggregate()
agg_df <- aggregate(df$salary, by=list(df$department), FUN=mean)
agg_df

Yields below output.

r group by mean

5. R Base aggregate() on Multiple Columns

You can also apply the aggregate() function on multiple columns of the data frame to group the data by multiple columns, and then apply the mean function to calculate the average of those columns based on specified criteria.


# R Base aggregate() on multiple columns
agg_df <- aggregate(df$salary, by=list(df$department,df$state), FUN=mean)
agg_df

Yields below output.

r aggregate mean

Conclusion

In this article, I have explained how to calculate the group by mean or average for single or multiple columns in a data frame in R, using the group_by() function from the dplyr package and the aggregate() function from base R. When working with larger datasets, the dplyr approach tends to be more efficient than base R.

References