You are currently viewing Explain drop_na() Function in R with Examples

The drop_na() function of the tidyr package in R is a powerful tool for efficiently handling missing values by removing rows containing NA values. This function is especially useful for cleaning datasets before analysis, by doing operations such as calculations and visualizations that can be performed without interruptions caused by missing data.

Advertisements

In this article, I will explore the drop_na() function in tidyr, covering its syntax, parameters, and practical use cases. With clear examples, I will demonstrate how to remove rows with NA values from an entire data frame or specific columns.

Key Points-

  • The drop_na() function from the tidyr package is used to remove rows with missing (NA) values from a data frame.
  • By default, drop_na() removes rows with NA values in any column of the data frame.
  • You can specify one or more column names in the function to remove rows with NA values only in the specified columns.
  • You can target multiple columns by listing them in the ... argument, removing rows with NA values across the specified columns.
  • When used with grouped data (e.g., with dplyr’s group_by()), drop_na() removes rows with NA values within each group.
  • If columns are not specified in the ... argument, only rows with NA values in those columns are affected; other columns with NA values remain unchanged.
  • The function is ideal for cleaning datasets before performing calculations, visualizations, or other operations that require complete data.
  • Whether dealing with an entire data frame, specific columns, or grouped data, drop_na() provides a versatile and efficient way to handle missing values.

R drop_na() Function

The drop_na() function is used to remove rows with missing values from a data frame, helping to tidy the data. This function accepts two parameters: the first is the data frame, and the second specifies the columns where the removal operation should be performed. By default, it removes rows with NA values in any column. However, you can also specify one or more columns to restrict the removal of rows with NA values in those specific columns.

Syntax of drop_na() Function

Following is the syntax of the drop_na() function.


# Syntax of drop_na()
drop_na(data, ...)

Parameters

  • data: The input data frame or tibble.
  • ...: Optional column names. If specified, only rows with NA values in these columns will be removed.

Return Value

This function returns a data frame or tibble with rows containing NA values removed based on the specified criteria.

Drop Rows with Missing Values in R

You can use the drop_na() function to remove rows with missing values in a data frame. By default, it considers all columns. Let’s pass the given data frame into this function to remove rows with NA values of any column.


# Drop rows in a data frame
library(tidyr)
df <- data.frame(
  Student = c("Geetha", "Ram", "Sai"),
  History = c(89, 81, 78),
  Math = c(75, NA, 85),
  Science = c(85, NA, 90),
  Total = c(NA, 261, 253)
  )
print("Original Data frame:")
print(df)
cleaned_df <- drop_na(df)
print("Data After Dropping Rows with NA Values:")
print(cleaned_df)

Yields below output.

drop_na() in r

As you can see from the above, the original data frame has three rows, but after removing the rows with NA values, it now contains only one row.

Drop Rows with NA in Specific Columns in R

To remove rows with NA values in specific columns, you can use the drop_na() function. Pass the desired column name as an argument to the function, and it will remove rows containing NA values in the specified column.


# Drop rows with NA in specific columns
cleaned_df <- drop_na(df, Math)
print("After Dropping Rows with NA of a specific Column:")
print(cleaned_df)

Yields below output.

drop_na() in r

Handling Multiple Columns using R drop_na()

You can remove rows with NA values in multiple columns by specifying the relevant column names. Let’s see how passing multiple columns to the drop_na() function removes rows containing NA values in those specific columns.


# Drop rows with NA in multiple columns
cleaned_df <- drop_na(df, Math, Science)
print("After Dropping Rows with NA of multiple Columns:")
print(cleaned_df)

Yields below output.


# Output:
[1] "After Dropping Rows of multiple Columns:"
  Student History Math Science Total
1  Geetha      89   75      85    NA
2     Sai      78   85      90   253

Using drop_na() with Grouped Data

So far, we have learned how to use the drop_na() function to remove rows with NA values from an entire data frame or specific columns. Next, we will explore how to use this function with grouped data. The drop_na() function can remove rows containing NA values within each group, making it particularly useful for cleaning datasets with hierarchical structures.


# Remove rows within grouped data
# Load the required libraries
library(dplyr)
library(tidyr)

# Example data frame
df <- data.frame(
  Class = c("A", "A", "B", "B", "C", "C"),
  Student = c("Geetha", "Ram", "Sai", "Priya", "Arjun", "Maya"),
  Math = c(75, NA, 85, NA, 78, 92),
  Science = c(85, 80, 90, NA, 88, NA)
)
print("Original Data Frame:")
print(df)

# Group by 'Class' and drop rows with NA in 'Math'
grouped_cleaned_df <- df %>%
  group_by(Class) %>%
  drop_na(Math)

print("After Dropping Rows with NA in 'Math' within Groups:")
print(grouped_cleaned_df)

Yields below output.


# Output:
[1] "Original Data Frame:"

  Class Student Math Science
1     A  Geetha   75      85
2     A     Ram   NA      80
3     B     Sai   85      90
4     B   Priya   NA      NA
5     C   Arjun   78      88
6     C    Maya   92      NA

[1] "After Dropping Rows with NA in 'Math' within Groups:"
# A tibble: 4 × 4
# Groups:   Class [3]
  Class Student  Math Science
  <chr> <chr>   <dbl>   <dbl>
1 A     Geetha     75      85
2 B     Sai        85      90
3 C     Arjun      78      88
4 C     Maya       92      NA

Frequently Asked Questions of drop_na() Function

What is the purpose of the drop_na() function in R?

The drop_na() function removes rows with missing values from a data frame or tibble.

How can I remove rows with NA values in specific columns only?

You can specify the column names in the ... argument of the drop_na() function.

What happens if there are NA values in columns not specified?

Rows with NA in columns not specified remain unaffected.

How can drop_na() handle grouped data?

When used with grouped data, it operates within each group.

Conclusion

In this article, I have explained the drop_na() function from R’s tidyr package is an essential tool for efficient data cleaning and preparation. Its flexibility allows you to remove rows with missing values from an entire data frame, specific columns, or within grouped data. This capability guarantees that your dataset is tidy and prepared for analysis, facilitating seamless calculations and visualizations. By mastering the use of drop_na(), you can handle missing values effectively and streamline your data preprocessing workflow.

Happy Learning!

References