You are currently viewing R group_by() Function from Dplyr

The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.

In general Group by operation involves splitting the data, applying some functions, and finally aggregating the results.

So it is a two-step process, first, you need to perform grouping and then run the aggregate for each group. In this article, I will explain group_by() function syntax and usage on DataFrame with R programming examples.

1.Syntax of group_by() Function

Following is the syntax of the dplyr group_by() function.


# Syntax group by
group_by(.data, ..., add = FALSE)

Parameters

  • .data – tbl
  • ... – Variables or Columns to group by.
  • add – Defaults to FALSE

Following are the types of tbl (tables) it supports.

  • data.frame
  • data.table
  • SQLite
  • PostgreSQL
  • MySQL

First, let’s create a DataFrame by reading from a CSV file.


# Read CSV file into DataFrame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df

Yields below output.

r dplyr group by

2. Dplyr group_by Function Example

Use group_by() function in R to group the rows in DataFrame by columns, to use this function, you have to install dplyr first using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in dplyr package take data.frame as a first argument. When we use dplyr package, we mostly use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 

The group_by() function doesn’t change the dataframe data how it looks and it just returns the grouped tbl (tibble table) where we can perform summarise on. Let’s perform the group by on the column department and summarize to get the sum of salary for each group.


# Load dplyr
library(dplyr)

# group_by() on department
grp_tbl <- df %>% group_by(department)
grp_tbl

# summarise on groupped data.
agg_tbl <- grp_tbl %>% summarise(sum(salary))
agg_tbl

r dplyr group by

Note that the output of group_by() and summarise() is tibble hence, to convert it to data.frame use as.data.frame() function.


# Convert tibble to DataFrame
df2 <- agg_tbl %>% as.data.frame()

3. Assign Name to the Summarize Column

If you notice above the second output, the summarise column name has sum(salary) which is not user-friendly, let’s see how to add a custom user-friendly name to it. Also, I will rewrite the above 2 statements into a single statement using dplyr piping.


# Assign column Name to the aggregated column
# Group by on multiple columns
agg_tbl <- df %>% group_by(department) %>% 
               summarise(total_salary=sum(salary))
agg_tbl

Yields below output.

4. Using dplyr group_by() on Multiple Columns

The group_by() and summarise() also support group by on multiple columns and summarise on multiple columns.


# Group by on multiple columns
# & multiple aggregations
agg_tbl <- df %>% group_by(department, state) %>%
            summarise(total_salary=sum(salary),
                      total_bonus = sum(bonus),
                      min_salary = min(salary),
                      max_salary = max(salary),
                     .groups = 'drop'
                    )
agg_tbl

Yields below output.

r dataframe group by

5. Apply List of Summarise Functions

This example does the group by on department and state columns, summarises on salary and bonus columns, and apply the sum & mean functions on each summarised column.


# Apply multiple summaries
df2<- df[,c("department","state","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>% 
  summarise(across(c(salary, bonus), list(mean = mean, sum = sum)))

Yields below output.

6. Summarise All Columns Except Grouping Columns

This example does the group by on department and state columns, summarises on all columns except grouping columns, and apply the sum & mean functions on all summarised columns.


# Summarise all columns except grouping columns
df2<- df[,c("department","state","age","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>% 
  summarise(across(everything(), list(mean = mean, sum = sum)))
agg_tbl

Yields below output.

r group by dataframe

7. Complete Example

Following is a complete example of an R group by function.


# Create Data Frame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df

# Load dplyr
library(dplyr)

# group_by() on department
grp_tbl <- df %>% group_by(department)
grp_tbl

# summarise on groupped data.
agg_tbl <- grp_tbl %>% summarise(sum(salary))
agg_tbl


# Assign column Name to the aggregated column
agg_tbl <- df %>% group_by(department) %>% 
               summarise(total_salary=sum(salary))
agg_tbl

# Group by on multiple columns
# & multiple aggregations
agg_tbl <- df %>% group_by(department, state) %>%
            summarise(total_salary=sum(salary),
                      total_bonus = sum(bonus),
                      min_salary = min(salary),
                      max_salary = max(salary),
                     .groups = 'drop'
                    )
agg_tbl

# Apply multiple summaries
df2<- df[,c("department","state","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>% 
  summarise(across(c(salary, bonus), list(mean = mean, sum = sum)))

# Summarise all columns except grouping columns
df2<- df[,c("department","state","age","salary","bonus")]
agg_tbl <- df2 %>% group_by(department, state) %>% 
  summarise(across(everything(), list(mean = mean, sum = sum)))
agg_tbl

8. Conclusion

In this article, you have learned the syntax of group_by() function in R from the dplyr package and how to use this to group the rows in DataFrame and apply the summarise.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium