• Post author:
• Post category:R Programming

The `tapply()` function in R applies a specified function to each subset of a vector, where another vector defines the subsets. It is commonly used to compute summary statistics across levels of a factor. The `tapply()` function is similar to the apply() function but is specifically designed for grouped data. In this article, I will explain how to use `tapply()` in various ways with well-defined examples.

Key points-

• The `tapply()` function in R applies a specified function to each subset of a vector, where another vector defines the subsets.
• It is commonly used to compute summary statistics across levels of a variable.
• The `tapply()` function is similar to the `apply()` function but is specifically designed for grouped data.
• Returns an array by default, but can return a list if `simplify` is set to `FALSE`.
• This Can be used to apply a function to a single variable grouped by another variable.
• Additional arguments such as `na.rm` can be used to handle missing values in the data.
• It can perform grouping operations based on multiple columns by specifying the grouping variables within a list.

The tapply() Function in R

The `tapply()` function accepts inputs to create statistical summaries such as mean, sum, max, min, etc., by a group based on single or multiple columns of a data frame. It is used to apply a function over subsets of a vector within a dataset that can be broken up into groups via categorical variables (factors). The goal is to break the dataset into groups and apply a function to each group.

Syntax of tapply() Function

Following is the syntax of the tapply() function.

``````
# Syntax of tapply()
tapply(X, INDEX, FUN, ..., simplify = TRUE)
``````

Parameters

• `<strong>X</strong>:` An atomic object is typically a vector.
• `INDEX:` A factor or a list of factors, each of the same length as `X`.
• `<strong>FUN</strong>:` The function to be applied.: Optional arguments to `FUN`.  If `FUN` is `NULL`, tapply returns a vector.
• `simplify:` Logical; if `FALSE`, the result is a list, otherwise, the result is simplified to an array (if possible).

Return Value

It returns an array. If we set the `simplify` is `FALSE` it returns a list.

Apply tapply() Function to a Single Variable Grouped by Another Variable

Let’s create a data frame and use the `tapply()` function to calculate a specified statistical summary for each specified column in the given dataset grouped by another specified column.

``````
# Apply tapply() to single variable grouped by another column
# Create data frame
emp_df <- data.frame(
name = c('John', 'Jane', 'Doe', 'Smith', 'Emily', 'Chris'),
department = c('HR', 'Finance', 'HR', 'IT', 'Finance', 'IT'),
location = c('NY', 'NY', 'SF', 'SF', 'NY', 'SF'),
salary = c(50000, 60000, 55000, 70000, 65000, 75000)
)
print("Given data frame:")
print(emp_df)

# Calculate the total salary for department group
result <- tapply(emp_df\$salary, emp_df\$department, sum)
print("After applying tapply to single column:")
print(result)
print("Get the type of result:")
print(class(result))
``````

From the above code, the `tapply()` has calculated the total salary for each department by applying the `sum` function to the `salary` vector, grouped by the `department` vector.

Yields below output.

Calculate the Mean of Single Column Using tapply()

In this example, you can use the `tapply()` function to calculate the mean for each specified column grouped by another specified column. To do this, pass the numeric column you want to calculate the mean for, along with the grouping column and the specified function, into the `tapply()` function. It will apply the specified function to the grouped data and return an array.

``````
# Calculate the mean of single column using tapply()
result <- tapply(emp_df\$salary, emp_df\$department, mean)
print("After applying tapply to data frame:")
print(result)
print("Get the type of result:")
print(class(result))
``````

In this example, the `tapply()` has calculated the average for each department by applying the `mean` function to the `salary` vector, grouped by the `department` vector.

Yields below output.

Access Specified Element of the Output

As shown above, this function returns an array. To access the elements of the array, you can use the specified index within square brackets.

``````
# Get the each element of output by square bracket
result[2]

# Output:
#    HR
# 52500
``````

Modify Output Class to List

However, you can change the output class to a list by setting the `simplify` argument to `FALSE`.

``````
# Get output  as a list by setting simplyfy = FALSE
result <- tapply(emp_df\$salary, emp_df\$department, mean, simplify = FALSE)
print("After applying tapply to data frame:")
print(result)

# Output:
# [1] "After applying tapply to data frame:"
# \$Finance
# [1] 62500
# \$HR
# [1] 52500
# \$IT
# [1] 72500
``````

If your data frame contains some `NA` values in its columns, you can include additional arguments after the function, such as `na.rm`, to calculate the while ignoring the `NA` values in the data frame.

``````
emp_df[4, 4] <- NA
print("Given data frame:")
emp_df

# Calculate the maen salary for each department
result <- tapply(emp_df\$salary, emp_df\$department, mean)
print("After applying tapply to data frame:")
print(result)

# Output:
# [1] "Given data frame:"
#    name department location salary
# 1  John         HR       NY  50000
# 2  Jane    Finance       NY  60000
# 3   Doe         HR       SF  55000
# 4 Smith         IT       SF     NA
# 5 Emily    Finance       NY  65000
# 6 Chris         IT       SF  75000

# [1] "After applying tapply to data frame:"
# Finance      HR      IT
#   62500   52500      NA
``````

Let’s include the specified additional argument after the `FUN` argument. In this example, the `mean` function allows you to set `na.rm = TRUE` to ignore `NA` values.

``````
# Calculate the maen salary for each department by removing NA values
result <- tapply(emp_df\$salary, emp_df\$department, mean, na.rm = TRUE)
print("After applying tapply to data frame:")
print(result)

# Output:
# [1] "After applying tapply to data frame:"

# Finance      HR      IT
#   62500   50000   72500
``````

Apply tapply Function to One Variable, Grouped by Multiple Variables

Finally, you can utilize the `tapply()` function to perform a grouping operation based on multiple columns. Specify the multiple grouping variables within a list and pass it to this function, which applies the function to the grouped object and returns an array.

``````
# tapply with multiple variables
# Calculate the total salary for multiple grouping variables
result <- tapply(emp_df\$salary, list(emp_df\$department, emp_df\$location), mean)
print("After applying tapply to multiple grouping columns:")
print(result)
print("Get the type of result:")
print(class(result))

# Output:
# [1] "After applying tapply to multiple grouping columns:"

#            NY    SF
# Finance 62500    NA
# HR      50000 55000
# IT         NA 72500

# [1] "Get the type of result:"
# [1] "matrix" "array"
``````

Conclusion

In this article, I have explained the tapply() function and using its syntax, parameters, and usage how to apply statistical functions to each column of a data frame based on one or multiple grouped columns.

Happy learning!!