The tapply()
function in R applies a specified function to each subset of a vector, where another vector defines the subsets. It is commonly used to compute summary statistics across levels of a factor. The tapply()
function is similar to the apply() function but is specifically designed for grouped data. In this article, I will explain how to use tapply()
in various ways with well-defined examples.
Key points-
- The
tapply()
function in R applies a specified function to each subset of a vector, where another vector defines the subsets. - It is commonly used to compute summary statistics across levels of a variable.
- The
tapply()
function is similar to theapply()
function but is specifically designed for grouped data. - Returns an array by default, but can return a list if
simplify
is set toFALSE
. - This Can be used to apply a function to a single variable grouped by another variable.
- Additional arguments such as
na.rm
can be used to handle missing values in the data. - It can perform grouping operations based on multiple columns by specifying the grouping variables within a list.
The tapply() Function in R
The tapply()
function accepts inputs to create statistical summaries such as mean, sum, max, min, etc., by a group based on single or multiple columns of a data frame. It is used to apply a function over subsets of a vector within a dataset that can be broken up into groups via categorical variables (factors). The goal is to break the dataset into groups and apply a function to each group.
Syntax of tapply() Function
Following is the syntax of the tapply() function.
# Syntax of tapply()
tapply(X, INDEX, FUN, ..., simplify = TRUE)
Parameters
<strong>X</strong>:
An atomic object is typically a vector.INDEX:
A factor or a list of factors, each of the same length asX
.<strong>FUN</strong>:
The function to be applied.…: Optional arguments toFUN
. IfFUN
isNULL
, tapply returns a vector.simplify:
Logical; ifFALSE
, the result is a list, otherwise, the result is simplified to an array (if possible).
Return Value
It returns an array. If we set the simplify
is FALSE
it returns a list.
Apply tapply() Function to a Single Variable Grouped by Another Variable
Let’s create a data frame and use the tapply()
function to calculate a specified statistical summary for each specified column in the given dataset grouped by another specified column.
# Apply tapply() to single variable grouped by another column
# Create data frame
emp_df <- data.frame(
name = c('John', 'Jane', 'Doe', 'Smith', 'Emily', 'Chris'),
department = c('HR', 'Finance', 'HR', 'IT', 'Finance', 'IT'),
location = c('NY', 'NY', 'SF', 'SF', 'NY', 'SF'),
salary = c(50000, 60000, 55000, 70000, 65000, 75000)
)
print("Given data frame:")
print(emp_df)
# Calculate the total salary for department group
result <- tapply(emp_df$salary, emp_df$department, sum)
print("After applying tapply to single column:")
print(result)
print("Get the type of result:")
print(class(result))
From the above code, the tapply()
has calculated the total salary for each department by applying the sum
function to the salary
vector, grouped by the department
vector.
Yields below output.
Calculate the Mean of Single Column Using tapply()
In this example, you can use the tapply()
function to calculate the mean for each specified column grouped by another specified column. To do this, pass the numeric column you want to calculate the mean for, along with the grouping column and the specified function, into the tapply()
function. It will apply the specified function to the grouped data and return an array.
# Calculate the mean of single column using tapply()
result <- tapply(emp_df$salary, emp_df$department, mean)
print("After applying tapply to data frame:")
print(result)
print("Get the type of result:")
print(class(result))
In this example, the tapply()
has calculated the average for each department by applying the mean
function to the salary
vector, grouped by the department
vector.
Yields below output.
Access Specified Element of the Output
As shown above, this function returns an array. To access the elements of the array, you can use the specified index within square brackets.
# Get the each element of output by square bracket
result[2]
# Output:
# HR
# 52500
Modify Output Class to List
However, you can change the output class to a list by setting the simplify
argument to FALSE
.
# Get output as a list by setting simplyfy = FALSE
result <- tapply(emp_df$salary, emp_df$department, mean, simplify = FALSE)
print("After applying tapply to data frame:")
print(result)
# Output:
# [1] "After applying tapply to data frame:"
# $Finance
# [1] 62500
# $HR
# [1] 52500
# $IT
# [1] 72500
Add Additional Arguments: Ignore NA
If your data frame contains some NA
values in its columns, you can include additional arguments after the function, such as na.rm
, to calculate the while ignoring the NA
values in the data frame.
# Add addititonal arguments: na.rm = TRUE
emp_df[4, 4] <- NA
print("Given data frame:")
emp_df
# Calculate the maen salary for each department
result <- tapply(emp_df$salary, emp_df$department, mean)
print("After applying tapply to data frame:")
print(result)
# Output:
# [1] "Given data frame:"
# name department location salary
# 1 John HR NY 50000
# 2 Jane Finance NY 60000
# 3 Doe HR SF 55000
# 4 Smith IT SF NA
# 5 Emily Finance NY 65000
# 6 Chris IT SF 75000
# [1] "After applying tapply to data frame:"
# Finance HR IT
# 62500 52500 NA
Let’s include the specified additional argument after the FUN
argument. In this example, the mean
function allows you to set na.rm = TRUE
to ignore NA
values.
# Calculate the maen salary for each department by removing NA values
result <- tapply(emp_df$salary, emp_df$department, mean, na.rm = TRUE)
print("After applying tapply to data frame:")
print(result)
# Output:
# [1] "After applying tapply to data frame:"
# Finance HR IT
# 62500 50000 72500
Apply tapply Function to One Variable, Grouped by Multiple Variables
Finally, you can utilize the tapply()
function to perform a grouping operation based on multiple columns. Specify the multiple grouping variables within a list and pass it to this function, which applies the function to the grouped object and returns an array.
# tapply with multiple variables
# Calculate the total salary for multiple grouping variables
result <- tapply(emp_df$salary, list(emp_df$department, emp_df$location), mean)
print("After applying tapply to multiple grouping columns:")
print(result)
print("Get the type of result:")
print(class(result))
# Output:
# [1] "After applying tapply to multiple grouping columns:"
# NY SF
# Finance 62500 NA
# HR 50000 55000
# IT NA 72500
# [1] "Get the type of result:"
# [1] "matrix" "array"
Conclusion
In this article, I have explained the tapply() function and using its syntax, parameters, and usage how to apply statistical functions to each column of a data frame based on one or multiple grouped columns.
Happy learning!!