Similar to SQL
regexp_like() function Spark & PySpark also supports Regex (Regular expression matching) by using
rlike() function, This function is available in org.apache.spark.sql.Column class. Use regex expression with rlike() to filter rows by checking case insensitive (ignore case) and to filter rows that have only numeric/digits and more examples.
- rlike() is a function of org.apache.spark.sql.Column class.
- rlike() is similar to like() but with regex (regular expression) support.
- It can be used on Spark SQL Query expression as well.
- It is similar to regexp_like() function of SQL.
1. rlike() Syntax
Following is a syntax of
rlike() function, It takes a literal regex expression string as a parameter and returns a boolean column based on a regex match.
// Syntax def rlike(literal : _root_.scala.Predef.String) : org.apache.spark.sql.Column
2. rlike() Usage
rlike() function can be used to derive a new Spark/PySpark DataFrame column from an existing column, filter data by matching it with regular expressions, use with conditions, and many more.
// Usage import org.apache.spark.sql.functions.col col("alphanumeric").rlike("^[0-9]*$") df("alphanumeric").rlike("^[0-9]*$")
3. Spark rlike() Examples
Following are different examples of using rlike() function with Spark (with Scala) & PySpark (Spark with Python) and SQL. For PySpark use
from pyspark.sql.functions import col to use col() function.
3.1 Filter Rows that Contain Only Numbers
Using with DataFrame API
// Filter DataFrame rows that has only digits of 'alphanumeric' column import org.apache.spark.sql.functions.col df.filter(col("alphanumeric") .rlike("^[0-9]*$") ).show()
3.2 Filter Rows by Case Insensitive
Below is an example of a regular expression to filter the rows by comparing case insensitive (filter rows that contain
rose string in a column
// Filter rows by cheking value contains in anohter column by ignoring case import org.apache.spark.sql.functions.col df.filter(col("name").rlike("(?i)^*rose$")).show()
4. PySpark SQL rlike() Function Example
Let’s see an example of using rlike() to evaluate a regular expression, In the below examples, I use rlike() function to filter the PySpark DataFrame rows by matching on regular expression (regex) by ignoring case and filter column that has only numbers.
rlike() evaluates the regex on Column value and returns a Column of type Boolean.
rlike() is a function on Column type, for more examples refer to PySpark Column Type & it’s Functions
# PySpark Example from pyspark.sql.functions import col # Filter DataFrame rows that has only digits of 'alphanumeric' column df.filter(col("alphanumeric").rlike("^[0-9]*$")) .show() # Filter rows by cheking value contains in anohter column by ignoring case df.filter(col("name").rlike("(?i)^*rose$")) .show()
5. Spark SQL rlike() Function
Similar to SQL regexp_like(), Spark SQL have rlike() that takes regular expression (regex) as input and matches the input column value with the regular expression.
// Filter rows that only digits of 'alphanumeric' column df.createOrReplaceTempView("DATA") spark.sql("select * from DATA where rlike(alphanumeric,'^[0-9]*$')").show()
In this Spark, PySpark article, I have covered examples of how to rlike() regex expression to filter DataFrame rows by comparing case insensitive string contains in another string & filtering rows that have only numeric values e.t.c
Happy Learning !!
- How to Filter Rows with NULL/NONE (IS NULL & IS NOT NULL) in Spark
- Spark Filter – startsWith(), endsWith() Examples
- Spark Filter using Multiple Conditions
- Spark Check String Column Has Numeric Values
- Spark Check Column Present in DataFrame
- Spark Filter Using contains() Examples
- Spark regexp_replace() – Replace String Value
- Spark SQL like() Using Wildcard Example