How can you count the number of NA values in each column of an R data frame? You can use R base functions like colSums()
and lapply() to find the number of missing values (NA) in a given data frame. In R, missing values are represented by predefined terms such as NA (Not Available) and NaN (Not a Number).
In this article, I will demonstrate several methods to count NA (missing) values in each column of a data frame.
Key points-
- You can use base R functions like
colSums()
andlapply()
to count the number of NA values in each column of a data frame. - When we apply arithmetic functions to a data set containing NA values, the result will be a missing value (NA). To exclude NA values, we can simply set
na.rm = TRUE
. - The
lapply()
function, similar tosapply()
, can be used to count NA values by applying a custom function to each column of a data frame, returning a list with counts. - The
tidyverse
package includes tools for data manipulation and visualization. It provides a convenient way to count NA values using functions likesummarise_all()
. - The
dplyr
package, part of thetidyverse
, provides powerful tools for data manipulation. You can count NA values using functions likesummarise()
,across()
, andeverything()
.
First, let’s create a sample data frame that contains NA values.
# Create dataframe with 5 rows and 3 columns
df = data.frame(id=c(2, 1, 3, 4, NA),
name = c('sravan', NA,'chrisa', 'shivgami', NA),
gender = c(NA, 'm', NA, 'f', NA))
# Display dataframe
print("Create a data frame:")
print(df)
Yields below output.
Count NA Values in Each Column using colSum()
To calculate the number of missing values in each column, you can use the colSums()
function. This function returns the count of missing values for each column. First, use the is.na()
function to check for NA values in the data frame, returning TRUE
for every missing value in a boolean vector. Then, apply the colSums()
function to this boolean array to get the count of TRUE
values (i.e., missing values) for each column.
# Get the count of NA values in each column using colSum()
na_count <- colSums(is.na(df))
print("Get the count of NA values in each column:")
print(na_count)
The above code shows that the id
column has 1 missing value, the name
column has 2 missing values, and the gender
column has 3 missing values.
Yields below output.
Count NA Values in Each Column using sapply() of R
Alternatively, you can use the R base sapply() method to find the count of NA values in each column of the data frame. This function applies a specified function to each column of a data frame. First, create an anonymous function and apply it to each column to check for NA values using the is.na()
function. Then, apply the sum() function to the resulting boolean vector, which will return the count of TRUE values (i.e., NA values) for each data frame column.
# Get the count of NA values in each column using sapply()
na_count <- sapply(df, function(x) sum(is.na(x)))
print("Get the count of NA values in each column:")
print(na_count)
# Output:
# [1] "Get the count of NA values in each column:"
# id name gender
# 1 2 3
Count NA Values in R using lapply() Function
Similarly, you can use another R base function which is the lappy() function to get the count of missing values(NA) in each column of the data frame. It applies an anonymous function to each column of the data frame and returns a list where each element represents the count of NA values for each column.
# Get the count of NA values in each column using lapply()
print("Get the count of NA values in each column:")
na_count <- lapply(df, function(x) { length(which(is.na(x)))})
print(na_count)
# Output
# [1] "Get the count of NA values in each column:"
# $id
# [1] 1
# $name
# [1] 2
# $gender
# [1] 3
Count NA Values in R using tidyverse
You can also calculate the count of NA values using the tidyverse
package, which is a collection of R packages designed for data science. It includes packages like dplyr
, ggplot2
, tibble
, and others that provide tools for data manipulation, visualization, and more.
First, you can use the pipe operator %>%
to pass the given data frame to the summarise_all()
function, which is a dplyr
function that applies a specified function to all columns of a data frame. And then ~
is used to define an anonymous function in dplyr
. Finally, apply this function on sum(is.na()) to get the count of NA values in each column of data frame.
# Get the count of NA values in an each column using tidyverse
library(tidyverse)
print("Get the count of NA values in each column:")
df %>% summarise_all(~ sum(is.na(.)))
# Output:
# [1] "Get the count of NA values in each column:"
# id name gender
# 1 2 3
Count NA Values Using dplyr
Finally, you can use the dplyr package to get the count of NA values in each column of the data frame. the dplyr package
, which is a part of the tidyverse
collection of R packages. dplyr
provides a set of functions that are useful for data manipulation.
Let’s use a combination of the summarise()
, across()
, and everything()
functions to apply an anonymous function to multiple columns and get the count of NA values using the sum(is.na())
function.
# Get the count of NA values in an each column by dplyr
library(dplyr)
df %>% summarise(across(everything(), ~ sum(is.na(.))))
print("Get the count of NA values in a specified column:")
# Output:
# [1] "Get the count of NA values in each column:"
# id name gender
# 1 2 3
Conclusion
In this article, I have explained several methods to count NA (missing) values in each column of a data frame in R. We used base R functions like colSums()
, sapply()
, and lapply()
, as well as functions from the tidyverse
and dplyr
packages. These methods provide flexible and efficient ways to handle and analyze missing data in R.
Happy Learning!!