R Summarise on Group By in Dplyr

How to use summarise on group by DataFrame in R? The summarise() or summarize() functions performs the aggregations on grouped data, so in order to use these functions first, you need to use group_by() to get grouped dataframe. All these functions are from dplyr package.

Key Points –

  • summarise() is used to get aggregation results on specified columns for each group.
  • For empty grouping columns/variables, it returns a single row summarising all rows/observations in the input.
  • Both summarise() & summarize() functions works exactly same as they are synonyms.
  • These function returns tibble and you need to use as.data.frame() tp convert to DataFrame.

1. Syntax of Summarise()

Following is the syntax of summarise() or summarize() functions.


# Syntax of summarise & summarize functions
summarise(.data, ..., .groups = NULL)
summarize(.data, ..., .groups = NULL)

Arguments of summarise() function.

  • .data – tibble or dataframe
  • ... – columns/variables to perform aggregations on along with aggregation/summarise functions.
  • .groups

Let’s create a DataFrame by reading a CSV file. I will use this dataframe to group on certain columns and summarize on numeric columns like salary, age, and bonus.


# Read CSV file into DataFrame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df

Yields below output.

r group by summarise

2. Group By Summarise R Example

The summarise() or summarize() function takes the grouped dataframe/table as input and performs the summarize functions. To get the dropped dataframe use group_by() function.

To use group_by() and summarize() functions, you have to install dplyr first using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in dplyr package take data.frame as a first argument. When we use dplyr package, we mostly use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 


# Load dplyr
library(dplyr)

# Group by mean using dplyr
agg_tbl <- df %>% group_by(department) %>% 
  summarise(mean_age=mean(age),
            .groups = 'drop')
agg_tbl

Note that the group_by() takes DataFrame as input and summarise() function takes the tibble/dataframe as input and returns the tibble table, so to convert the tibble to dataframe use as.data.frame(), let’s rewrite the above statement using this function.


# Group by mean using dplyr
df2 <- df %>% group_by(department) %>% 
  summarise(mean_age=mean(age),.groups = 'drop') %>%
  as.data.frame()
df2 

Yields below output

3. Group By Summarise() Functions in R

There are several aggregation functions you can use with summarise(). All these functions are used to calculate aggregations on grouped data.

Summarize GroupSummarise FunctionDescription
Countn()Get the count of values
n_distinct()Get the count of distinct values
Aggsum()Computes sum
mean()Generic function for the (trimmed) arithmetic mean.
median()Computes the sample median.
Rangemin()Computes minimum of input
max()Computes maximum of input
quantile()Produces sample quantiles
Positionfirst()Get the first value
last()Get the last value
nth()Get the nth value
Spreadsd()Computes the standard deviation
iqr()Computes interquartile range
madCompute the median absolute deviation
Logicalany()any
all()all
Group By Summarise Functions

In the rest of the article, I will explain different examples of using summarise() on a group by data and then will cover examples for each above functions.

5. Summarise on Multiple Columns in R

You can also call summarise on multiple columns at a time and also apply either same or different summarise function for each column. The below example perform group on department and state columns (multiple columns) and get the mean of salary and bonus for each department & state combination.


# Group by mean of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(mean_salary=mean(salary),
            mean_bonus= mean(bonus),
            .groups = 'drop') %>%
  as.data.frame()
df2

Yields below output.

r summarise multiple columns

You can also use across() with the vector of elements you wanted to apply summarise on.


# Group by mean of multiple columns
df2 <- df %>% group_by(department,state) %>% 
  summarise(across(c(salary, bonus),mean),
            .groups = 'drop') %>%
  as.data.frame()
df2

6. Summarise All Columns Except Group by Columns

Let’s see how to apply the groupby and aggregate function mean on all columns of the DataFrame except grouping columns. While doing this make sure your dataframe has only numeric columns plus grouping columns. Having non-numeric on summarise returns an error.

This example does the group by on department and state columns, summarises on all columns except grouping columns, and apply the mean function on all summarised columns.


# Mean on all columns
num_df<- df[,c("department","state","age","salary","bonus")]
df2 <- num_df %>% group_by(department, state) %>% 
  summarise(across(everything(), mean),
            .groups = 'drop')  %>%
  as.data.frame()
df2

Yields below output.

r group by mean all columns

Conclusion

In this article, I have explained how to group by mean or average in R by using group_by() function from the dplyr package and aggregate() function from the R base. Between these two, dplyr functions perform efficiently when you are dealing with larger datasets.

Related Articles

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing R Summarise on Group By in Dplyr