• Post author:
• Post category:R Programming

How can you count the number of NA values in each column of an R data frame? You can use R base functions like `colSums()` and `lapply()` to find the number of missing values (NA) in a given data frame. In R, missing values are represented by predefined terms such as NA (Not Available) and NaN (Not a Number).

In this article, I will demonstrate several methods to count NA (missing) values in each column of a data frame.

Key points-

• You can use base R functions like `colSums()` and `lapply()` to count the number of NA values in each column of a data frame.
• When we apply arithmetic functions to a data set containing NA values, the result will be a missing value (NA). To exclude NA values, we can simply set `na.rm = TRUE`.
• The `lapply()` function, similar to `sapply()`, can be used to count NA values by applying a custom function to each column of a data frame, returning a list with counts.
• The `tidyverse` package includes tools for data manipulation and visualization. It provides a convenient way to count NA values using functions like `summarise_all()`.
• The `dplyr` package, part of the `tidyverse`, provides powerful tools for data manipulation. You can count NA values using functions like `summarise()`, `across()`, and `everything()`.

First, let’s create a sample data frame that contains NA values.

``````
# Create dataframe with 5 rows and 3 columns
df = data.frame(id=c(2, 1, 3, 4, NA),
name = c('sravan', NA,'chrisa', 'shivgami', NA),
gender = c(NA, 'm', NA, 'f', NA))

# Display dataframe
print("Create a data frame:")
print(df)
``````

Yields below output.

## Count NA Values in Each Column using colSum()

To calculate the number of missing values in each column, you can use the `colSums()` function. This function returns the count of missing values for each column. First, use the `is.na()` function to check for NA values in the data frame, returning `TRUE` for every missing value in a boolean vector. Then, apply the `colSums()` function to this boolean array to get the count of `TRUE` values (i.e., missing values) for each column.

``````
# Get the count of NA values in each column using colSum()
na_count <- colSums(is.na(df))
print("Get the count of NA values in each column:")
print(na_count)
``````

The above code shows that the `id` column has 1 missing value, the `name` column has 2 missing values, and the `gender` column has 3 missing values.

Yields below output.

## Count NA Values in Each Column using sapply() of R

Alternatively, you can use the R base `sapply()` method to find the count of NA values in each column of the data frame. This function applies a specified function to each column of a data frame. First, create an anonymous function and apply it to each column to check for NA values using the `is.na()` function. Then, apply the `sum()` function to the resulting boolean vector, which will return the count of `TRUE` values (i.e., NA values) for each column of the data frame.

``````
# Get the count of NA values in each column using sapply()
na_count <- sapply(df, function(x) sum(is.na(x)))
print("Get the count of NA values in each column:")
print(na_count)

# Output:
# [1] "Get the count of NA values in each column:"
#   id   name gender
#     1      2      3
``````

## Count NA Values in R using lapply() Function

Similarly, you can use another R base function which is the lappy() function to get the count of missing values(NA) in each column of the data frame. It applies an anonymous function to each column of the data frame and returns a list where each element represents the count of NA values for each column.

``````
# Get the count of NA values in each column using lapply()
print("Get the count of NA values in each column:")
na_count <- lapply(df, function(x) { length(which(is.na(x)))})
print(na_count)

# Output
# [1] "Get the count of NA values in each column:"
# \$id
# [1] 1

# \$name
# [1] 2

# \$gender
# [1] 3
``````

## Count NA Values in R using tidyverse

You can also calculate the count of NA values using the `tidyverse` package, which is a collection of R packages designed for data science. It includes packages like `dplyr`, `ggplot2`, `tibble`, and others that provide tools for data manipulation, visualization, and more.

First, you can use the pipe operator `%>%` to pass the given data frame to the `summarise_all()` function, which is a `dplyr` function that applies a specified function to all columns of a data frame. And then `~` is used to define an anonymous function in `dplyr`. Finally, apply this function on sum(is.na()) to get the count of NA values in each column of data frame.

``````
# Get the count of NA values in an each column using tidyverse
library(tidyverse)
print("Get the count of NA values in each column:")
df %>% summarise_all(~ sum(is.na(.)))

# Output:
# [1] "Get the count of NA values in each column:"
#   id   name gender
#     1      2      3
``````

## Count NA Values Using dplyr

Finally, you can use the dplyr package to get the count of NA values in each column of the data frame. the `dplyr package`, which is a part of the `tidyverse` collection of R packages. `dplyr` provides a set of functions that are useful for data manipulation.

Let’s use a combination of the `summarise()`, `across()`, and `everything()` functions to apply an anonymous function to multiple columns and get the count of NA values using the `sum(is.na())` function.

``````
# Get the count of NA values in an each column by dplyr
library(dplyr)
df %>% summarise(across(everything(), ~ sum(is.na(.))))
print("Get the count of NA values in a specified column:")

# Output:
# [1] "Get the count of NA values in each column:"
#   id   name gender
#     1      2      3
``````

## Conclusion

In this article, I have explained several methods to count NA (missing) values in each column of a data frame in R. We used base R functions like `colSums()`, `sapply()`, and `lapply()`, as well as functions from the `tidyverse` and `dplyr` packages. These methods provide flexible and efficient ways to handle and analyze missing data in R.