To perform a group-by operation to count occurrences in R, you can use either the aggregate()
function from base R or a combination of group_by()
and summarise()
from the dplyr
package. This allows the grouping of rows in a data frame based on a specific column and then counts the number of rows in each group.
First, I will cover the usage of the group_by() function from the dplyr package, which is an efficient approach. Then, I will demonstrate using the aggregate() function from the R base.
Quick Examples
Here are some simple examples demonstrating how to perform a group-by-count.
# Load dplyr
library(dplyr)
# Group by count using dplyr
agg_tbl <- df %>% group_by(department) %>%
summarise(total_count=n(),
.groups = 'drop')
# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()
# Group by count of multiple columns
df2 <- df %>% group_by(department,state) %>%
summarise(total_count=n(),.groups = 'drop') %>%
as.data.frame()
# Group by count using R Base aggregate()
agg_df <- aggregate(df$state, by=list(df$department), FUN=length)
# R Base aggregate() on multiple columns
agg_df <- aggregate(df$state, by=list(df$department,df$state), FUN=length)
Let’s build a data frame by loading a CSV file.
# Read CSV file into DataFrame
df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv')
df
Yields below output.
Grouping and Counting in R with dplyr
To perform group-by operations in R data frames, you can use group_by()
from the dplyr
package, followed by summarise()
to get counts for each group. The group_by()
function returns grouped data, and then you can apply summarise()
on this grouped data to compute the count.
Before using these functions, make sure to install dplyr
with install.packages(‘dplyr’) and load it using library(dplyr)
. In the examples, I will use the dplyr infix operator %>%
to chain functions, allowing group_by()
to be used as an input to summarise()
.
# Load dplyr
library(dplyr)
# Group by count using dplyr
agg_tbl <- df %>% group_by(department) %>%
summarise(total_count=n(),
.groups = 'drop')
agg_tbl
# Convert tibble to df
df2 <- agg_tbl %>% as.data.frame()
df2
The code snippet below demonstrates how to group data by the department
column and count the number of entries for each department
.
Please keep in mind that the group_by()
and summarise()
functions yield a tibble. If you would prefer a data frame, you can convert the tibble into a dataframe using as.data.frame()
.
Counting Rows Based on Grouped Columns in R
This example groups the data by the department
and state
columns then find the count of occurrences for each unique department
and state
combination.
# Group by count of multiple columns
df2 <- df %>% group_by(department,state) %>%
summarise(total_count=n(),.groups = 'drop') %>%
as.data.frame()
df2
Yields below output.
Grouping and Counting using R base aggregate()
R base provides an aggregate() function to perform the grouping on the dataframe, let’s use this to perform a groupby on the department column and get the count for each department.
R base package has the aggregate()
function, which allows you to group data in a data frame. You can apply this function on a given data frame to group the data based on a specific column and calculate the count for each unique value of that column.
# Group by count using R Base aggregate()
agg_df <- aggregate(df$state, by=list(df$department), FUN=length)
agg_df
Yields below output.
Applying aggregate() to Multiple Columns
Alternatively, you can use the aggregate()
function to group the data according to multiple columns. Then apply the length()
function on grouped data to get the count for each unique combination of those columns.
# R Base aggregate() on multiple columns
agg_df <- aggregate(df$state, by=list(df$department,df$state), FUN=length)
agg_df
The above code groups rows according to the department
and state
columns, then use the length()
function to count the number of occurrences for each unique combination of department
and state
.
Yields below output.
Conclusion
In this article, I have discussed how to perform group by count in R using the group_by()
function from the dplyr package and the base R aggregate()
function. When working with larger datasets, dplyr functions are generally more efficient.
Related Articles
- R Group by Sum With Examples
- R Group by Mean With Examples
- R Summarise on Group By in Dplyr
- R lm() Function – Fitting Linear Models
References
- https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/grouped_df
- https://www.w3schools.com/sql/sql_groupby.asp