Spark Filter Using contains() Examples

Spread the love

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

  • contains() – This method checks if string specified as an argument contains in a DataFrame column if contains it returns true otherwise false.
  • This function is available in Column class.

You can also match by wildcard character using like() & match by regular expression by using rlike() functions.

In order to explain contains() with examples first, let’s create a DataFrame with some test data.

// Make sure you create a SparkSession object.
import spark.implicits._

val data = Seq((1,"James Smith"), (2,"Michael Rose"),
  (3,"Robert Williams"), (4,"Rames Rose"),(5,"Rames rose")
val df = data.toDF("id","name")

1. Filter DataFrame Column contains() in a String

The contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not. Below example returns, all rows from DataFrame that contains string mes on the name column.

// Filter all rows that contains string 'mes' in a 'name' column
import org.apache.spark.sql.functions.col
| id|       name|
|  1|James Smith|
|  4| Rames Rose|
|  5| Rames rose|

// You can also use with like

If you wanted to filter by case insensitive refer to Spark rlike() function to filter by regular expression

2. Spark SQL contains() Example

// Using it on SQL to filter rows
spark.sql("select * from TAB where name like '%mes%'").show()

3. PySpark contains() Example

// PySpark contains() Example
from pyspark.sql.functions import col


In this Spark, PySpark article, I have covered examples of how to filter DataFrame rows based on columns contains in a string with examples.

Happy Learning !!

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Spark Filter Using contains() Examples