You are currently viewing Spark Filter Using contains() Examples

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

  • contains() – This method checks if string specified as an argument contains in a DataFrame column if contains it returns true otherwise false.
  • This function is available in Column class.

You can also match by wildcard character using like() & match by regular expression by using rlike() functions.

In order to explain contains() with examples first, let’s create a DataFrame with some test data.

// Make sure you create a SparkSession object.
import spark.implicits._

val data = Seq((1,"James Smith"), (2,"Michael Rose"),
  (3,"Robert Williams"), (4,"Rames Rose"),(5,"Rames rose")
val df = data.toDF("id","name")

1. Filter DataFrame Column contains() in a String

The contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not. Below example returns, all rows from DataFrame that contains string mes on the name column.

// Filter all rows that contains string 'mes' in a 'name' column
import org.apache.spark.sql.functions.col
| id|       name|
|  1|James Smith|
|  4| Rames Rose|
|  5| Rames rose|

// You can also use with like

If you wanted to filter by case insensitive refer to Spark rlike() function to filter by regular expression

2. Spark SQL contains() Example

// Using it on SQL to filter rows
spark.sql("select * from TAB where name like '%mes%'").show()

3. PySpark contains() Example

// PySpark contains() Example
from pyspark.sql.functions import col


In this Spark, PySpark article, I have covered examples of how to filter DataFrame rows based on columns contains in a string with examples.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium