PySpark isin() & SQL IN Operator

  • Post author:
  • Post category:PySpark
  • Post last modified:December 8, 2022
  • Reading time:6 mins read

PySpark isin() or IN operator is used to check/filter if the DataFrame values are exists/contains in the list of values. isin() is a function of Column class which returns a boolean value True if the value of the expression is contained by the evaluated values of the arguments.

Sometimes, when you use isin(list_param) from the Column class, you will get an error java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon. In this article I will explain how to overcome this error while using isin() function.

pyspark isin

1. Quick Examples of Using isin()

Following are quick examples of how to use isin() to filter the rows from DataFrame.


# PySpark isin()
listValues = ["Java","Scala"]
df.filter(df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(col("languages").isin(listValues)).show()

# Using IN operator
df.filter("languages in ('Java','Scala')" ).show()

# PySpark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages IN ('Java','Scala')").show()

2. Syntax of isin()

Following is the syntax of isin() function. This function takes *cols as argument.


# Syntax of isin()
Column.isin(*cols)

Let’s create a DataFrame


#Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName("SparkByExamples.com").getOrCreate()

# Create DataFrame    
data = [("James","Java"),("Michael","Spark"),
        ("Robert","Python")]
columns = ["name","languages"]
df = spark.createDataFrame(data,columns)
df.show()

3. PySpark isin() Example

pyspark.sql.Column.isin() function is used to check if a column value of DataFrame exists/contains in a list of string values and this function mostly used with either where() or filter() functions. Let’s see with an example, below example filter the rows languages column value present in ‘Java‘ & ‘Scala‘. Note that the isin() or IN operator is a shorthand for multiple OR conditions.


# PySpark isin()
listValues = ["Java","Scala"]
df.filter(df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(col("languages").isin(listValues)).show()

Yields below output.

pyspark isin

4. Using PySpark IN Operator

Let’s see how to use IN operator in PySpark to filter rows.


# Using IN operator
df.filter("languages in ('Java','Scala')" ).show()

5. PySpark SQL IN Operator

In PySpark SQL, isin() function doesn’t work instead you should use IN operator to check values present in a list of values, it is usually used with the WHERE clause. In order to use SQL, make sure you create a temporary view using createOrReplaceTempView().


# PySpark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages IN ('Java','Scala')").show()

These results same output as above.

6. Conclusion

PySpark isin() function is used to check if the DataFrame column value exists in a list/array of values. isin() function is from Column class that return a boolean value.

Happy Learning !!

Reference

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply