In PySpark, the isin() function, or the IN operator is used to check DataFrame values and see if they’re present in a given list of values. This function is part of the Column class and returns True if the value matches any of the provided arguments.
1. isin() Examples
Check the following isin() examples to filter the DataFrame rows.
# PySpark isin()
listValues = ["Java","Scala"]
df.filter(df.languages.isin(listValues)).show()
from pyspark.sql.functions import col
df.filter(col("languages").isin(listValues)).show()
# Using IN operator
df.filter("languages in ('Java','Scala')" ).show()
# PySpark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
"languages IN ('Java','Scala')").show()
2. isin() Function Syntax
Following is the syntax of the isin() function. This function takes *cols as an argument. It returns a boolean column indicating the presence of each row’s value in the list.
# Syntax of isin()
Column.isin(*cols)
Create a DataFrame to learn with an example.
#Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
.appName("SparkByExamples.com").getOrCreate()
# Create DataFrame
data = [("James","Java"),("Michael","Spark"),
("Robert","Python")]
columns = ["name","languages"]
df = spark.createDataFrame(data,columns)
df.show()
3. PySpark isin() Example
The isin() function in PySpark is used to checks if the values in a DataFrame column match any of the values in a specified list/array. If a value in the DataFrame column is found in the list, it returns True; otherwise, it returns False. This function is handy for filtering data based on specific values you’re interested in.
Let’s explore an example: the following code snippet filters rows where the value in the “languages” column has either ‘Java’ or ‘Scala’. It’s worth noting that the isin() function or IN operator serves as a shortcut for multiple OR conditions.
# PySpark isin()
listValues = ["Java","Scala"]
df.filter(df.languages.isin(listValues)).show()
from pyspark.sql.functions import col
df.filter(col("languages").isin(listValues)).show()
Yields below output.
4. Using IN Operator
Alternativley, you can also use the IN
operator in PySpark to filter rows.
# Using IN operator
df.filter("languages in ('Java','Scala')" ).show()
5. Using SQL IN Operator
In PySpark SQL, the isin() function is not supported. Hence, you should use the IN operator to verify if values exist within a provided list. Typically, it’s utilized alongside the WHERE clause. To utilize SQL, use createOrReplaceTempView() to create a temporary table or view.
# PySpark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
"languages IN ('Java','Scala')").show()
These results in the same output.
6. Conclusion
The isin()
function in PySpark is used to filter rows in a DataFrame based on whether the values in a specified column match any value in a given list. It returns a boolean column indicating the presence of each row’s value in the list. This function is useful for selecting rows with specific values from a column, similar to SQL’s IN
clause.
Happy Learning !!
Related Articles
- PySpark SQL Left Semi Join Example
- PySpark NOT isin() or IS NOT IN Operator
- PySpark Get Number of Rows and Columns
- PySpark NOT isin() or IS NOT IN Operator
- PySpark SQL Self Join With Example
- PySpark SQL expr() (Expression ) Function
- PySpark SQL – Working with Unix Time | Timestamp
- PySpark SQL Date and Timestamp Functions