Filter in sparklyr | R Interface to Spark

The sparklyr filter() function is a powerful tool for filtering data rows from DataFrame based on a given condition or SQL expression. Sparklyr package allows you to use Spark capabilities in R programming, it provides an interface between R and Apache Spark. In this sparklyr article, we will explore how to apply a filter on DataFrame columns of string, arrays, and struct types using both single and multiple conditions using R language. We will also dive into the process of applying a filter using %in% operator.

By the end of this sparklyr article, you will have an in-depth understanding of how to leverage the sparklyr filter() function to extract the desired data rows from your Spark data frame using a variety of filtering techniques to refine your results. Whether you’re filtering by string, array, or struct types, or using %in% to filter based on a set of specific values, Sparklyr provides a wealth of tools and capabilities to help you effectively manipulate and analyze your data. So let’s dive in and explore the world of Sparklyr filtering!

1. Sparklyr DataFrame filter() Syntax

The code snippet below shows the syntax of the sparklyr filter() function. You will use an expression to specify the condition you wish to apply for filtering and apply it to the data frame.


# Syntax of filter()
filter(dataframe,condition)

Before we start with examples, first let’s create a DataFrame.

 
library(sparklyr)

data <- data.frame(
  id = 1:6,
  name = c("James", "Anna", "Julia", "Maria", "Jen", "Mike"),
  age = c(30, 25, 35, 40, 45, 20),
  city = c("New York", "London", "Paris", "Madrid", "Berlin", "Rome")
)

# Initiate spark connection
sc <- spark_connect(master = 'local', 
                    spark_home = Sys.getenv("SPARK_HOME"), 
                    app_name = "SparkByExamples.com", 
                    method = 'shell',
                    version = '3.0.0')

# Load data into a Spark dataframe
df <- sdf_copy_to(sc, data, 'df')

# View the first 5 rows
df % head(5)


The above code yields the below output;

sparklyr filter

2. DataFrame filter() with Column Condition in sparklyr

Use Column with the condition to filter the rows from sparklyr DataFrame, using this you can also express complex conditions.


library(sparklyr)

# Using equals condition
df %>%
  filter(city == "London") %>%
  View()

#cOutput:
# Source: spark [?? x 4]
#     id name    age city  
#      
#1     2 Anna     25 London

This sparklyr code applies an equals condition to filter the data frame, using the %>% operator to chain together a series of operations. Specifically, the sparklyr filter() function is used to retrieve the rows where the city column matches the value “London”. Finally, the View() function is used to display the filtered data frame.

The resulting output of the first block of code is a data frame that contains a single row. It is displaying the details of a person with id 2, named Anna, aged 25, and residing in London.


# Using not equal condition
df %>%
  filter(city != "London") %>%
  show()

# Output:
# Source: spark [?? x 4]
#     id name    age city    
#        
#1     1 James    30 New York
#2     3 Julia    35 Paris   
#3     4 Maria    40 Madrid  
#4     5 Jen      45 Berlin  
#5     6 Mike     20 Rome   

# Using equal with negate in front
df %>%
  filter(!(city== "London")) %>%
  show() 

Here, the code applies a not equal condition to filter the data frame. Here, the sparklyr filter() function is used again to retrieve the rows where the city column does not match the value “London”. The resulting data frame is then displayed using the show() function.

Alternatively, you can also write the above statement by putting the negation at the front of the expression.


# Using equal with negate in front
df %>%
  filter(!(city== "London")) %>%
  show() 

This yields the same output as above example.

3. Filter with multiple conditions together

You can also filter sparklyr DataFrame with multiple conditions, just combine the multiple conditions together using the conditional operators and pass it as an expression to filter(). Here is an example.


library(sparklyr)

# Using multiple conditions
df %>%
    filter(city != "London" & age > 30) %>%
    show()

Output:
# Source: spark [?? x 4]
#     id name    age city  
#      
#1     3 Julia    35 Paris 
#2     4 Maria    40 Madrid
#3     5 Jen      45 Berlin

This code block employs a series of operations to apply two filter conditions to the data frame: filtering for both city not equal to “London” and age greater than 30. The %>% operator chains these operations together.

To accomplish this, the code utilizes the sparklyr filter() function to filter rows in the data frame where the city column is not equal to “London” and the age column is greater than 30. The & operator combines the two conditions.

4. Filter Based on List Values

If you have a list of values in a R Vector and you would like to filter the rows based on a column value present in the list, use the sparklyr filter() with %in% operator.

 
library(sparklyr)

#Filter in List values
df %>%
    filter(city %in% c('London','New York')) %>%
    show()


Here, c() is a combined function that is used to create a vector. The above code yields the below output:

Output of filter based on list values

5. Filter Based on Starts With, Ends With, Contains

 
library(sparklyr)

# Search for city starts with N
data %>% filter(grepl('N*', city))

# Output:
#  id  name age     city
#1  1 James  30 New York

# Search for city ends with n
data %>% filter(grepl('*n', city))

# Output:
#id name age   city
#1  2 Anna  25 London
#2  5  Jen  45 Berlin

The code block utilizes the grepl() function to specify a pattern that matches the city column in the data frame. Specifically, it filters the rows where the city column begins with the letter “N”, followed by zero or more characters. Then, we apply the filter() function to retrieve the resulting data frame, and we display it using the show() function.

6. Filter based on the pattern in sparklyr

Finally, let’s filter the DataFrame rows using pattern.

 
# Search for name which contains a
data %>% filter(grepl('*a', name))

# Output:
#id  name age     city
#1  1 James  30 New York
#2  2  Anna  25   London
#3  3 Julia  35    Paris
#4  4 Maria  40   Madrid

7. Source code of filter operation

 
library(sparklyr)

data <- data.frame(
  id = 1:6,
  name = c("James", "Anna", "Julia", "Maria", "Jen", "Mike"),
  age = c(30, 25, 35, 40, 45, 20),
  city = c("New York", "London", "Paris", "Madrid", "Berlin", "Rome")
)

# Initiate spark connection
sc <- spark_connect(master = 'local', 
                    spark_home = Sys.getenv("SPARK_HOME"), 
                    app_name = "SparkByExamples.com", 
                    method = 'shell',
                    version = '3.0.0')

# Load data into a Spark dataframe
df <- sdf_copy_to(sc, data, 'df')

# View the first 5 rows
df % head(5)

# Using equals condition
df %>%
  filter(city == "London") %>%
  show()

# Using not equal condition
df %>%
  filter(city != "London") %>%
  show()

# Using equal with negate in front
df %>%
  filter(!(city== "London")) %>%
  show()

# Using multiple condition
df %>%
    filter(city != "London" & age > 30) %>%
    show()

# Filter in List values
df %>%
    filter(city %in% c('London','New York')) %>%
    show()

# Search for city starts with N
data %>% filter(grepl('N*', city))

# Search for city ends with n
data %>% filter(grepl('*n', city))

# Search for name which contains a
data %>% filter(grepl('*a', name))



Conclusion

In conclusion sparklyrr filter() is used to filter rows from the DataFrame/RDD. Sparklyr package allows you to use Spark capabilities in R programming, it provides an interface between R and Apache Spark. I hope you have learned how to apply a filter on DataFrame columns of string, arrays, and struct types using both single and multiple conditions using sparklyr in R language.

Reference

https://spark.rstudio.com/

Leave a Reply