• Post author:
  • Post category:R Programming
  • Post last modified:June 7, 2024
  • Reading time:9 mins read

How can you count the number of NA values in each column of an R data frame? You can use R base functions like colSums() and lapply() to find the number of missing values (NA) in a given data frame. In R, missing values are represented by predefined terms such as NA (Not Available) and NaN (Not a Number).

Advertisements

In this article, I will demonstrate several methods to count NA (missing) values in each column of a data frame.

Key points-

  • You can use base R functions like colSums() and lapply() to count the number of NA values in each column of a data frame.
  • When we apply arithmetic functions to a data set containing NA values, the result will be a missing value (NA). To exclude NA values, we can simply set na.rm = TRUE.
  • The lapply() function, similar to sapply(), can be used to count NA values by applying a custom function to each column of a data frame, returning a list with counts.
  • The tidyverse package includes tools for data manipulation and visualization. It provides a convenient way to count NA values using functions like summarise_all().
  • The dplyr package, part of the tidyverse, provides powerful tools for data manipulation. You can count NA values using functions like summarise(), across(), and everything().

First, let’s create a sample data frame that contains NA values.


# Create dataframe with 5 rows and 3 columns
df = data.frame(id=c(2, 1, 3, 4, NA),
        name = c('sravan', NA,'chrisa', 'shivgami', NA),
        gender = c(NA, 'm', NA, 'f', NA))

# Display dataframe
print("Create a data frame:")
print(df)

Yields below output.

count NA values in r

Count NA Values in Each Column using colSum()

To calculate the number of missing values in each column, you can use the colSums() function. This function returns the count of missing values for each column. First, use the is.na() function to check for NA values in the data frame, returning TRUE for every missing value in a boolean vector. Then, apply the colSums() function to this boolean array to get the count of TRUE values (i.e., missing values) for each column.


# Get the count of NA values in each column using colSum()
na_count <- colSums(is.na(df))
print("Get the count of NA values in each column:")
print(na_count)

The above code shows that the id column has 1 missing value, the name column has 2 missing values, and the gender column has 3 missing values.

Yields below output.

count NA values in r

Count NA Values in Each Column using sapply() of R

Alternatively, you can use the R base sapply() method to find the count of NA values in each column of the data frame. This function applies a specified function to each column of a data frame. First, create an anonymous function and apply it to each column to check for NA values using the is.na() function. Then, apply the sum() function to the resulting boolean vector, which will return the count of TRUE values (i.e., NA values) for each column of the data frame.


# Get the count of NA values in each column using sapply()
na_count <- sapply(df, function(x) sum(is.na(x)))
print("Get the count of NA values in each column:")
print(na_count)

# Output:
# [1] "Get the count of NA values in each column:"
#   id   name gender 
#     1      2      3 

Count NA Values in R using lapply() Function

Similarly, you can use another R base function which is the lappy() function to get the count of missing values(NA) in each column of the data frame. It applies an anonymous function to each column of the data frame and returns a list where each element represents the count of NA values for each column.


# Get the count of NA values in each column using lapply()
print("Get the count of NA values in each column:")
na_count <- lapply(df, function(x) { length(which(is.na(x)))})
print(na_count)

# Output
# [1] "Get the count of NA values in each column:"
# $id
# [1] 1

# $name
# [1] 2

# $gender
# [1] 3

Count NA Values in R using tidyverse

You can also calculate the count of NA values using the tidyverse package, which is a collection of R packages designed for data science. It includes packages like dplyr, ggplot2, tibble, and others that provide tools for data manipulation, visualization, and more.

First, you can use the pipe operator %>% to pass the given data frame to the summarise_all() function, which is a dplyr function that applies a specified function to all columns of a data frame. And then ~ is used to define an anonymous function in dplyr. Finally, apply this function on sum(is.na()) to get the count of NA values in each column of data frame.


# Get the count of NA values in an each column using tidyverse
library(tidyverse)
print("Get the count of NA values in each column:")
df %>% summarise_all(~ sum(is.na(.)))

# Output:
# [1] "Get the count of NA values in each column:"
#   id   name gender 
#     1      2      3 

Count NA Values Using dplyr

Finally, you can use the dplyr package to get the count of NA values in each column of the data frame. the dplyr package, which is a part of the tidyverse collection of R packages. dplyr provides a set of functions that are useful for data manipulation.

Let’s use a combination of the summarise(), across(), and everything() functions to apply an anonymous function to multiple columns and get the count of NA values using the sum(is.na()) function.


# Get the count of NA values in an each column by dplyr
library(dplyr)
df %>% summarise(across(everything(), ~ sum(is.na(.))))
print("Get the count of NA values in a specified column:")

# Output:
# [1] "Get the count of NA values in each column:"
#   id   name gender 
#     1      2      3 

Conclusion

In this article, I have explained several methods to count NA (missing) values in each column of a data frame in R. We used base R functions like colSums(), sapply(), and lapply(), as well as functions from the tidyverse and dplyr packages. These methods provide flexible and efficient ways to handle and analyze missing data in R.