In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
contains()
– This method checks if string specified as an argument contains in a DataFrame column if contains it returns true otherwise false.- This function is available in
Column
class.
You can also match by wildcard character using like() & match by regular expression by using rlike() functions.
In order to explain contains() with examples first, let’s create a DataFrame with some test data.
// Make sure you create a SparkSession object.
import spark.implicits._
val data = Seq((1,"James Smith"), (2,"Michael Rose"),
(3,"Robert Williams"), (4,"Rames Rose"),(5,"Rames rose")
)
val df = data.toDF("id","name")
1. Filter DataFrame Column contains() in a String
The contains()
method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not. Below example returns, all rows from DataFrame that contains string mes
on the name
column.
// Filter all rows that contains string 'mes' in a 'name' column
import org.apache.spark.sql.functions.col
df.filter(col("name").contains("mes")).show()
+---+-----------+
| id| name|
+---+-----------+
| 1|James Smith|
| 4| Rames Rose|
| 5| Rames rose|
+---+-----------+
// You can also use with like
df.filter(col("name").like("%mes%")).show()
If you wanted to filter by case insensitive refer to Spark rlike() function to filter by regular expression
2. Spark SQL contains() Example
// Using it on SQL to filter rows
df.createOrReplaceTempView("TAB")
spark.sql("select * from TAB where name like '%mes%'").show()
3. PySpark contains() Example
// PySpark contains() Example
from pyspark.sql.functions import col
df.filter(col("name").contains("mes")).show()
Conclusion
In this Spark, PySpark article, I have covered examples of how to filter DataFrame rows based on columns contains in a string with examples.
Happy Learning !!
Related Articles
- How to Filter Rows with NULL/NONE (IS NULL & IS NOT NULL) in Spark
- Spark Filter – startsWith(), endsWith() Examples
- Spark Filter using Multiple Conditions
- Spark Filter Rows with NULL Values in DataFrame
- Spark Data Frame Where () To Filter Rows
- Spark DataFrame Where Filter | Multiple Conditions
- Calculate Size of Spark DataFrame & RDD
- Spark Word Count Explained with Example