• Post author:
  • Post category:R Programming
  • Post last modified:August 16, 2024
  • Reading time:11 mins read
You are currently viewing Explain tapply() Function with examples

The tapply() function in R applies a specified function to each subset of a vector, where another vector defines the subsets. It is commonly used to compute summary statistics across levels of a factor. The tapply() function is similar to the apply() function but is specifically designed for grouped data. In this article, I will explain how to use tapply() in various ways with well-defined examples.

Advertisements

Key points-

  • The tapply() function in R applies a specified function to each subset of a vector, where another vector defines the subsets.
  • It is commonly used to compute summary statistics across levels of a variable.
  • The tapply() function is similar to the apply() function but is specifically designed for grouped data.
  • Returns an array by default, but can return a list if simplify is set to FALSE.
  • This Can be used to apply a function to a single variable grouped by another variable.
  • Additional arguments such as na.rm can be used to handle missing values in the data.
  • It can perform grouping operations based on multiple columns by specifying the grouping variables within a list.

The tapply() Function in R

The tapply() function accepts inputs to create statistical summaries such as mean, sum, max, min, etc., by a group based on single or multiple columns of a data frame. It is used to apply a function over subsets of a vector within a dataset that can be broken up into groups via categorical variables (factors). The goal is to break the dataset into groups and apply a function to each group.

Syntax of tapply() Function

Following is the syntax of the tapply() function.


# Syntax of tapply()
tapply(X, INDEX, FUN, ..., simplify = TRUE)

Parameters

  • <strong>X</strong>: An atomic object is typically a vector.
  • INDEX: A factor or a list of factors, each of the same length as X.
  • <strong>FUN</strong>: The function to be applied.: Optional arguments to FUN.  If FUN is NULL, tapply returns a vector.
  • simplify: Logical; if FALSE, the result is a list, otherwise, the result is simplified to an array (if possible).

Return Value

It returns an array. If we set the simplify is FALSE it returns a list.

Apply tapply() Function to a Single Variable Grouped by Another Variable

Let’s create a data frame and use the tapply() function to calculate a specified statistical summary for each specified column in the given dataset grouped by another specified column.


# Apply tapply() to single variable grouped by another column
# Create data frame
emp_df <- data.frame(
name = c('John', 'Jane', 'Doe', 'Smith', 'Emily', 'Chris'),
department = c('HR', 'Finance', 'HR', 'IT', 'Finance', 'IT'),
location = c('NY', 'NY', 'SF', 'SF', 'NY', 'SF'),
salary = c(50000, 60000, 55000, 70000, 65000, 75000)
)
print("Given data frame:")
print(emp_df)

# Calculate the total salary for department group
result <- tapply(emp_df$salary, emp_df$department, sum)
print("After applying tapply to single column:")
print(result)
print("Get the type of result:")
print(class(result))

From the above code, the tapply() has calculated the total salary for each department by applying the sum function to the salary vector, grouped by the department vector.

Yields below output.

r tapply

Calculate the Mean of Single Column Using tapply()

In this example, you can use the tapply() function to calculate the mean for each specified column grouped by another specified column. To do this, pass the numeric column you want to calculate the mean for, along with the grouping column and the specified function, into the tapply() function. It will apply the specified function to the grouped data and return an array.


# Calculate the mean of single column using tapply()
result <- tapply(emp_df$salary, emp_df$department, mean)
print("After applying tapply to data frame:")
print(result)
print("Get the type of result:")
print(class(result))

In this example, the tapply() has calculated the average for each department by applying the mean function to the salary vector, grouped by the department vector.

Yields below output.

r tapply

Access Specified Element of the Output

As shown above, this function returns an array. To access the elements of the array, you can use the specified index within square brackets.


# Get the each element of output by square bracket
result[2]

# Output:
#    HR 
# 52500 

Modify Output Class to List

However, you can change the output class to a list by setting the simplify argument to FALSE.


# Get output  as a list by setting simplyfy = FALSE
result <- tapply(emp_df$salary, emp_df$department, mean, simplify = FALSE)
print("After applying tapply to data frame:")
print(result)

# Output:
# [1] "After applying tapply to data frame:"
# $Finance
# [1] 62500
# $HR
# [1] 52500
# $IT
# [1] 72500

Add Additional Arguments: Ignore NA

If your data frame contains some NA values in its columns, you can include additional arguments after the function, such as na.rm, to calculate the while ignoring the NA values in the data frame.


# Add addititonal arguments: na.rm = TRUE
emp_df[4, 4] <- NA
print("Given data frame:")
emp_df

# Calculate the maen salary for each department
result <- tapply(emp_df$salary, emp_df$department, mean)
print("After applying tapply to data frame:")
print(result)

# Output:
# [1] "Given data frame:"
#    name department location salary
# 1  John         HR       NY  50000
# 2  Jane    Finance       NY  60000
# 3   Doe         HR       SF  55000
# 4 Smith         IT       SF     NA
# 5 Emily    Finance       NY  65000
# 6 Chris         IT       SF  75000

# [1] "After applying tapply to data frame:"
# Finance      HR      IT 
#   62500   52500      NA
 

Let’s include the specified additional argument after the FUN argument. In this example, the mean function allows you to set na.rm = TRUE to ignore NA values.


# Calculate the maen salary for each department by removing NA values
result <- tapply(emp_df$salary, emp_df$department, mean, na.rm = TRUE)
print("After applying tapply to data frame:")
print(result)

# Output:
# [1] "After applying tapply to data frame:"

# Finance      HR      IT 
#   62500   50000   72500

Apply tapply Function to One Variable, Grouped by Multiple Variables

Finally, you can utilize the tapply() function to perform a grouping operation based on multiple columns. Specify the multiple grouping variables within a list and pass it to this function, which applies the function to the grouped object and returns an array.


# tapply with multiple variables
# Calculate the total salary for multiple grouping variables
result <- tapply(emp_df$salary, list(emp_df$department, emp_df$location), mean)
print("After applying tapply to multiple grouping columns:")
print(result)
print("Get the type of result:")
print(class(result))

# Output:
# [1] "After applying tapply to multiple grouping columns:"

#            NY    SF
# Finance 62500    NA
# HR      50000 55000
# IT         NA 72500

# [1] "Get the type of result:"
# [1] "matrix" "array"

Conclusion

In this article, I have explained the tapply() function and using its syntax, parameters, and usage how to apply statistical functions to each column of a data frame based on one or multiple grouped columns.

Happy learning!!

References