You are currently viewing PySpark isin() & SQL IN Operator

In PySpark, the isin() function, or the IN operator is used to check DataFrame values and see if they’re present in a given list of values. This function is part of the Column class and returns True if the value matches any of the provided arguments.

Advertisements

1. isin() Examples

Check the following isin() examples to filter the DataFrame rows.


# PySpark isin()
listValues = ["Java","Scala"]
df.filter(df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(col("languages").isin(listValues)).show()

# Using IN operator
df.filter("languages in ('Java','Scala')" ).show()

# PySpark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages IN ('Java','Scala')").show()

2. isin() Function Syntax

Following is the syntax of the isin() function. This function takes *cols as an argument. It returns a boolean column indicating the presence of each row’s value in the list.


# Syntax of isin()
Column.isin(*cols)

Create a DataFrame to learn with an example.


#Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName("SparkByExamples.com").getOrCreate()

# Create DataFrame    
data = [("James","Java"),("Michael","Spark"),
        ("Robert","Python")]
columns = ["name","languages"]
df = spark.createDataFrame(data,columns)
df.show()

3. PySpark isin() Example

The isin() function in PySpark is used to checks if the values in a DataFrame column match any of the values in a specified list/array. If a value in the DataFrame column is found in the list, it returns True; otherwise, it returns False. This function is handy for filtering data based on specific values you’re interested in.

Let’s explore an example: the following code snippet filters rows where the value in the “languages” column has either ‘Java’ or ‘Scala’. It’s worth noting that the isin() function or IN operator serves as a shortcut for multiple OR conditions.


# PySpark isin()
listValues = ["Java","Scala"]
df.filter(df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(col("languages").isin(listValues)).show()

Yields below output.

pyspark isin

4. Using IN Operator

Alternativley, you can also use the IN operator in PySpark to filter rows.


# Using IN operator
df.filter("languages in ('Java','Scala')" ).show()

5. Using SQL IN Operator


In PySpark SQL, the isin() function is not supported. Hence, you should use the IN operator to verify if values exist within a provided list. Typically, it’s utilized alongside the WHERE clause. To utilize SQL, use createOrReplaceTempView() to create a temporary table or view.


# PySpark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages IN ('Java','Scala')").show()

These results in the same output.

6. Conclusion

The isin() function in PySpark is used to filter rows in a DataFrame based on whether the values in a specified column match any value in a given list. It returns a boolean column indicating the presence of each row’s value in the list. This function is useful for selecting rows with specific values from a column, similar to SQL’s IN clause.

Happy Learning !!

Reference