• Post author:
  • Post category:R Programming
  • Post last modified:June 21, 2024
  • Reading time:10 mins read

How do you select data frame columns by the condition in R? You can use the select_if() function from the dplyr package to select columns based on their types. Selecting columns based on specific conditions is common in data manipulation and analysis. In addition to dplyr, base R also offers versatile methods to achieve the same.

In this article, I will explore different ways to select columns from a data frame based on various conditions using both dplyr and base R functions.

Key points-

  • Use select_if() with functions like is.numeric, is.character, and inherits() to select columns of the data frame based on their types.
  • Select columns that contain missing values using a custom predicate function within select_if().
  • Utilize base R functions like grepl() and sapply() for flexible column selection without additional packages.
  • Combine logical conditions within select_if() for more complex selection criteria.
  • Use the pipe operator (%>%) to chain multiple dplyr functions for clean and readable code.
  • Select columns by their index positions using df[].

Let’s start by creating a sample data frame with columns of mixed data type.


# Create DataFrame
df <- data.frame(
id = c(10, 11, 12, 13),
name = c('sai', 'ram', 'deepika', 'sahithi'),
gender = c('M','M','F','F'),
dob = as.Date(c('1990-10-02', '1981-3-24', '1987-6-14', '1985-8-16')),
state = c('CA', 'NY', 'DE', NA),
row.names=c('r1', 'r2' ,'r3', 'r4')
)
print("Create data frame:")
print(df)

Yields below output.

r select columns by condition

Select Columns Based on Condition using dplyr

You can use the select_if() function from the dplyr package to select columns from a data frame based on specific conditions. dplyr package widely used for data manipulation and transformation in R. It provides a set of intuitive and efficient functions for data manipulation.

The select_if() function allows you to extract columns of different data types. To use select_if(), first, install the dplyr package with install.packages("dplyr") and then load it into your R environment with library(dplyr).

Let’s explore how to use this function to select columns from a data frame in various ways.

Select Column Based on Numeric Type

If you want to select columns of numeric data types from a given data frame, pass is.numeric into the select_if() function. This will select all columns from the given data frame where the condition is.numeric is TRUE and return a new data frame having only a numeric column.

It filters the columns based on whether they contain numeric data types.


# Select columns of numeric based on condition 
library(dplyr)
sel_cols <- df %>% select_if(is.numeric)
print("Select columns based on conditions:")
print(sel_cols)

Yields below output.

r select columns by condition

Select Column Based on Character Type

To select columns of character data type, you can pass the is.character function into the select_if() function. This will check each column in the data frame to see if it is of character data type, and then it will select all columns that meet this condition.


# Select columns of character based on condition 
library(dplyr)
sel_cols <- df %>% select_if(is.character)
print("Select columns based on conditions:")
print(sel_cols)

# Output:
# [1] "Select columns based on condition:"
#       name gender state
# r1     sai      M    CA
# r2     ram      M    NY
# r3 deepika      F    DE
# r4 sahithi      F  <NA>

Select Column Based on Date Type

To select the date type column based on the condition you can use the select_if() function along with ~ inherits() is an anonymous function (or lambda function) that checks if a column inherits from the “Date” class. Let’s use these functions to select all columns from the data frame where the column is of the "Date" class.


# Select columns based on condition using dplyr
library(dplyr)
sel_cols <- df %>% select_if(~ inherits(., "Date"))
print("Select columns based on conditions:")
print(sel_cols)

# Output:
# [1] "Select columns based on conditions:"
#           dob
# r1 1990-10-02
# r2 1981-03-24
# r3 1987-06-14
# r4 1985-08-16

The above code returns the new data frame containing only the columns that are of the "Date" class from the original data frame df.

Select Column Based on NA values

Finally, you can use the select_if() function to select columns having NA values based on a condition. To do this, you can use the ~ any() function, which is an anonymous function (lambda function) that checks if any element in a column is NA. The is.na(.) function checks if each element in the column is NA, and any() returns TRUE if at least one element satisfies the condition, indicating the presence of any NA value in the column.


# Select columns having NA values
library(dplyr)
sel_cols <- df %>% select_if(~ any(is.na(.)))
print("Select columns based on condition:")
print(sel_cols)

 Output:
# [1] "Select columns based on condition:"
#    state
# r1    CA
# r2    NY
# r3    DE
# r4  <NA>

Select Specific Column Based on Condition using df[]

So far, we have learned how to use the dplyr package to select columns of a data frame based on conditions. Now, we will explore using R-based approaches to achieve the same goal. We know that by using the df[] notation, we can select both rows and columns from a data frame. To select columns, specify them after the comma within square brackets ([]).

You can pass the grepl("specified column", names(df)) function into the df[] notation to return a logical vector indicating whether the specified column is found in each of the column names. This will select only those columns where the corresponding value in the logical vector is TRUE.


# Select columns based on condition using df[]
sel_cols <- df[, grepl("id", names(df))]
print("Select columns based on conditions:")
sel_cols

# Output:
# Select columns based on conditions:
# [1] 10 11 12 13

Select Multiple columns using df[] and sapply()

Alternatively, you can use the df[] notation along with sapply() function to select multiple columns of the data frame based on condition. You can use the sapply() function to apply the is.character function to each column of the given data frame and return a logical vector indicating which columns are of type character. By passing this logical vector into the df[] notation, you can obtain a subset of the data frame that includes only the columns of type character.


# Select columns based on condition using df[] and sapply() 
sel_cols <- df[, sapply(df, is.character)]
print("Select columns based on condition:")
print(sel_cols)

# Output:
# [1] "Select column based on condition:"
#       name gender state
# r1     sai      M    CA
# r2     ram      M    NY
# r3 deepika      F    DE
# r4 sahithi      F  <NA>

Conclusion

In this article, I explained how to select columns from a data frame based on specific conditions using both dplyr and base R functions. Additionally, I focused on using the dplyr package to select columns based on their data type.

Happy Learning!!