PySpark isin() & SQL IN Operator

  • Post author:
  • Post category:PySpark
  • Post last modified:August 5, 2022

PySpark isin() or IN operator is used to check/filter if the DataFrame values are exists/contains in the list of values. isin() is a function of Column class which returns a boolean value True if the value of the expression is contained by the evaluated values of the arguments.

Sometimes, when you use isin(list_param) from the Column class, you will get an error java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon. In this article I will explain how to overcome this error while using isin() function.

pyspark isin

1. Quick Examples of Using isin()

Following are quick examples of how to use isin() to filter the rows from DataFrame.


# PySpark isin()
listValues = ["Java","Scala"]
df.filter(df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(col("languages").isin(listValues)).show()

# Using IN operator
df.filter("languages in ('Java','Scala')" ).show()

# PySpark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages IN ('Java','Scala')").show()

2. Syntax of isin()

Following is the syntax of isin() function. This function takes *cols as argument.


# Syntax of isin()
Column.isin(*cols)

Let’s create a DataFrame


#Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName("SparkByExamples.com").getOrCreate()

# Create DataFrame    
data = [("James","Java"),("Michael","Spark"),
        ("Robert","Python")]
columns = ["name","languages"]
df = spark.createDataFrame(data,columns)
df.show()

3. PySpark isin() Example

pyspark.sql.Column.isin() function is used to check if a column value of DataFrame exists/contains in a list of string values and this function mostly used with either where() or filter() functions. Let’s see with an example, below example filter the rows languages column value present in ‘Java‘ & ‘Scala‘. Note that the isin() or IN operator is a shorthand for multiple OR conditions.


# PySpark isin()
listValues = ["Java","Scala"]
df.filter(df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(col("languages").isin(listValues)).show()

Yields below output.

pyspark isin

4. Using PySpark IN Operator

Let’s see how to use IN operator in PySpark to filter rows.


# Using IN operator
df.filter("languages in ('Java','Scala')" ).show()

5. PySpark SQL IN Operator

In PySpark SQL, isin() function doesn’t work instead you should use IN operator to check values present in a list of values, it is usually used with the WHERE clause. In order to use SQL, make sure you create a temporary view using createOrReplaceTempView().


# PySpark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages IN ('Java','Scala')").show()

These results same output as above.

6. Conclusion

PySpark isin() function is used to check if the DataFrame column value exists in a list/array of values. isin() function is from Column class that return a boolean value.

Happy Learning !!

Related Articles

Reference

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing PySpark isin() & SQL IN Operator