In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
contains()– This method checks if string specified as an argument contains in a DataFrame column if contains it returns true otherwise false.
- This function is available in
You can also match by wildcard character using like() & match by regular expression by using rlike() functions.
In order to explain contains() with examples first, let’s create a DataFrame with some test data.
//Make sure you create a SparkSession object. import spark.implicits._ val data = Seq((1,"James Smith"), (2,"Michael Rose"), (3,"Robert Williams"), (4,"Rames Rose"),(5,"Rames rose") ) val df = data.toDF("id","name")
1. Filter DataFrame Column contains() in a String
contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not. Below example returns, all rows from DataFrame that contains string
mes on the
//Filter all rows that contains string 'mes' in a 'name' column import org.apache.spark.sql.functions.col df.filter(col("name").contains("mes")).show() +---+-----------+ | id| name| +---+-----------+ | 1|James Smith| | 4| Rames Rose| | 5| Rames rose| +---+-----------+ //You can also use with like df.filter(col("name").like("%mes%")).show()
If you wanted to filter by case insensitive refer to Spark rlike() function to filter by regular expression
2. Spark SQL contains() Example
//Using it on SQL to filter rows df.createOrReplaceTempView("TAB") spark.sql("select * from TAB where name like '%mes%'").show()
3. PySpark contains() Example
# from pyspark.sql.functions import col df.filter(col("name").contains("mes")).show()
In this Spark, PySpark article, I have covered examples of how to filter DataFrame rows based on columns contains in a string with examples.
Happy Learning !!
- How to Filter Rows with NULL/NONE (IS NULL & IS NOT NULL) in Spark
- Spark Filter – startsWith(), endsWith() Examples
- Spark Filter using Multiple Conditions
- Spark Filter Rows with NULL Values in DataFrame
- Spark Data Frame Where () To Filter Rows
- Spark DataFrame Where Filter | Multiple Conditions
- Calculate Size of Spark DataFrame & RDD
- Spark Word Count Explained with Example