You are currently viewing R filter Data Frame by Multiple Conditions

How to filter the data frame by multiple conditions in R? You can use df[] notation and which() function to filter the data frame based on multiple conditions. Filtering a data frame typically refers to the process of selecting a few rows or columns from a larger dataframe based on specific criteria. This can involve selecting rows where a certain column meets certain conditions (e.g., values greater than a threshold) or columns based on their names or data types.

You can also use the filter() from the dplyr package and the subset() function from the R base package to implement the filtering of data frames based on certain conditions. In this article, I will explain different ways to filter the R DataFrame by multiple conditions.

Key Points –

  • In a resultant Data Frame, the order of rows preserved is the same as in the original data.
  • After filtering, the columns remain unchanged.
  • Some groups of rows might be combined if they meet the conditions.
  • You can use logical operators like AND(&), and OR(|) to implement the multiple conditions and filter the rows of the data frame.
  • You can represent the columns using its name (df$col_name) or its index (df[]).

Create Data Frame

To run some examples of filtering a data frame, let’s create an R DataFrame. If you have data in CSV you can easily import CSV files to R DataFrame.


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13,14,15,16,17),
  name = c('sai','ram','deepika','sahithi','kumar','scott','Don','Lin'),
  gender = c('M','M',NA,'F','M','M','M','F'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16',
                  '1995-03-02','1991-6-21','1986-3-24','1990-8-26')),
  state = c('CA','NY',NA,NA,'DC','DW','AZ','PH'),
  row.names=c('r1','r2','r3','r4','r5','r6','r7','r8')
)
df

Yields below output.

r filter multiple conditions

Using df[] to Filter by Multiple Conditions

You can use df[] notation without which() to implement the filtering of the data frame by multiple conditions. To filter rows in a data frame based on multiple conditions on column values, use the logical AND operator. This operator combines conditions using the & symbol and returns TRUE if both conditions are TRUE.


# Fiter the data frame by multiple conditions
# using df[] without which()
fil_df <- df[df$gender == 'F' & df$state %in% c('PH', NA),]
print("After filtering the data frame:")
fil_df

The above code has returned a new data frame where the rows are based on the gender column value F and state column value PH and NA.

Yields below output.


# Output:
[1] "After filtering the data frame:"
> fil_df 
   id    name gender        dob state
r4 13 sahithi      F 1985-08-16  <NA>
r8 17     Lin      F 1990-08-26    PH

Filter Rows by Multiple Conditions using df[] with which()

Alternatively, you can use the df[] notation along with the which() function to filter the data frame by multiple conditions using the row indices obtained from the which() function. This effectively filters the data frame based on the specified conditions.

To filter rows in a data frame based on multiple conditions, use the logical OR operator. This operator combines conditions using the | symbol and returns TRUE if at least one of the conditions is TRUE.


# Fiter the data frame by multiple conditions
# using df[] with which()
fil_df <- df[which(df$gender == 'F' | df$state != 'CA'),] 
print("After filtering the data frame:")
fil_df

The above code has returned a new data frame where the rows are based on the gender column value F and state column value CA (California).

Yields below output.

r filter multiple conditions

Using the filter() Function to Filter by Multiple Conditions

Similarly, you can use filter() function from dplyr package to implement the filtering of data frame based on multiple conditions. Before going to use the filter() function you need to install the dplyr package using install.packages('dplyr'). After completing the installation you need to load it using library(dplyr).

Let’s pass multiple conditions on specified column values using logical operators to filter the data frame rows.


# Using dplyr::filter
# Load dplyr package
library(dplyr)
fil_df <- df %>% filter(gender == 'F' | state != 'CA')
print("After fitering the data frame:")
fil_df 

fil_df <- df %>% filter(gender == 'F' & state %in% c('PH', NA))
print("After fitering the data frame:")
fil_df 

Yields below output.


# Output:
[1] "After fitering the data frame:"

> fil_df 
   id    name gender        dob state
r2 11     ram      M 1981-03-24    NY
r4 13 sahithi      F 1985-08-16  <NA>
r5 14   kumar      M 1995-03-02    DC
r6 15   scott      M 1991-06-21    DW
r7 16     Don      M 1986-03-24    AZ
r8 17     Lin      F 1990-08-26    PH

[1] "After fitering the data frame:"
> fil_df 
   id    name gender        dob state
r4 13 sahithi      F 1985-08-16  <NA>
r8 17     Lin      F 1990-08-26    PH

Use Base Function to Filter the Data Frame

Finally, you can use the subset() of the R base function to filter the data frame based on multiple conditions. This function accepts the given data frame as the first argument and an expression as the second argument.


# subset by multiple conditions using |
fil_df <- subset(df, gender == 'F' | state != 'CA')
print("After fitering the data frame:")
fil_df 

# subset by multiple conditions using &
fil_df <- subset(df, gender == 'F' & state %in% c('PH',NA))
print("After fitering the data frame:")
fil_df 

Yields the output as same as the above.

Conclusion

In this article, I have explained how to filter the data frame based on multiple conditions in R. Using df[] notation with and without which() function, filter() function from dplyr package, and R base function. The logical AND operator (&) symbol returns TRUE if both conditions are TRUE. The logical OR operator (|) symbol returns FALSE if both conditions are FALSE.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium