• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:12 mins read
You are currently viewing PySpark startswith() and endswith() Functions

PySpark startswith() and endswith() are string functions that are used to check if a string or column begins with a specified string and if a string or column ends with a specified string, respectively. When used these functions with filter(), it filters DataFrame rows based on a column’s initial and final characters. startsWith() filters rows where a specified substring serves as the prefix, while endswith() filter rows where the column value concludes with a given substring.

Advertisements

Key Points

  • PySpark’s startswith() function checks if a string or column begins with a specified prefix, providing a boolean result. It’s useful for filtering or transforming data based on the initial characters of strings.
  • The endswith() function checks if a string or column ends with a specified suffix. It produces a boolean outcome, aiding in data processing involving the final characters of strings.
  • Both startswith() and endswith() functions in PySpark are case-sensitive by default. Users can employ additional functions like lower() or upper() for case-insensitive matching if needed.
  • These functions are commonly applied to DataFrame columns for efficient string manipulation. They assist in creating new columns or filtering rows based on the starting or ending patterns within strings.
  • Similar to PySpark contains(), both startswith() and endswith() functions yield boolean results, indicating whether the specified prefix or suffix is present in the strings or columns being evaluated.

Also, we can use PySpark SQL operators LIKE, RLIKE, and ILIKE to achieve similar functionality as startsWith() and endsWith() string matching.

Let’s create a DataFrame to explain these methods with examples.


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data
data = [(1, "Rose Alen"), (2, "Michael Peter"), (3, "Robert Liu"), 
        (4, "Rames Jordan"), (5, "Conner Kinsley")]

# Define schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True)
])

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Show the DataFrame
df.show()

Yields below output


# Output
+---+-----------------+
| id|             name|
+---+-----------------+
|  1|      Rose Alen  |
|  2|    Michael Peter|
|  3|     Robert Liu  |
|  4|     Rames Jordan|
|  5|   Conner Kinsley|
+---+-----------------+

PySpark startswith() Example

Column.startswith() function in PySpark is used to check if the DataFrame column begins with a specified string. When used with filter() or where() functions, this returns only the rows where a specified substring starts with a prefix.

Apache Spark 3.5 released a new function, pyspark.sql.functions.startswith. This function takes the input column/string and the prefix as arguments. Besides this, the behavior of this function is exactly the same as the Column function.

Syntax


# Column function
pyspark.sql.Column.startswith(prefix: ColumnOrName)

# SQL function
pyspark.sql.functions.startswith(str: ColumnOrName, prefix: ColumnOrName)

Parameters:

  • str: Column or str : A column of string.
  • prefix: Column or str: A column of string, the prefix.

Returns a boolean. The value is True if str starts with a prefix. Returns NULL if either input expression is NULL. Otherwise, it returns False. Both str or prefix must be of STRING or BINARY type.


# Applying starsWith()
df.filter(col("name").startswith("Ro")).show()

# SQL Function. Available since Spark verson 3.5
from pyspark.sql.functions import startswith
df.filter(startswith(df.name,"Ro")).show()

The above snippet returns the rows from DataFrame that start with “Ro” on the name column, as shown below.


# Output
+---+----------+
| id|      name|
+---+----------+
|  1| Rose Alen|
|  3|Robert Liu|
+---+----------+

The below code demonstrates how to apply “Not startsWith”. This code filters and shows the DataFrame, excluding names that start with “Ro.” The tilde (~) operator is used to negate the condition.


# Filter and show the DataFrame excluding names starting with "Ro"
df.filter(~col("name").startswith("Ro")).show()

# Filter and show the DataFrame excluding names starting with "Ro"
df.filter( col("name").startswith("Ro") == False).show()

# Output
+---+--------------+
| id|          name|
+---+--------------+
|  2| Michael Peter|
|  4|  Rames Jordan|
|  5|Conner Kinsley|
+---+--------------+

PySpark endswith() Example

The PySpark Column.endswith() function checks if a string or column ends with a specified suffix. When used with filter(), it filters DataFrame rows based on a specific column’s values ending with a given substring. This function is part of PySpark’s repertoire for string manipulation, allowing users to selectively retain or exclude rows based on the trailing characters of a particular column.

Apache Spark 3.5 released a new function, pyspark.sql.functions.endswith. This function takes the input column/string and the suffix as arguments. Besides this, the behavior of this function is exactly the same as the Column function.

Syntax


# Column function
pyspark.sql.Column.endswith(suffix: ColumnOrName)

# SQL Function. Available since Spark verson 3.5
pyspark.sql.functions.endswith(str: ColumnOrName, suffix: ColumnOrName)

Parameters:

  • str: Column or str: A column of string.
  • suffix: Column or str: A column of string, the suffix.

This method is case-sensitive. The below example returns all rows from DataFrame that end with the string “Alen" on the “name” column.


# Apply Filter with endswith()
df.filter(col("name").endswith("Alen")).show()

# Using SQL function. Available since Spark verson 3.5
df.filter(endswith(df.name,"Alen")).show()

Yields the below output.


# Output
+---+------------+
| id|        name|
+---+------------+
|  1|   Rose Alen|
+---+------------+

Using like() and rlike() to filter starts with

You can also use the LIKE or RLIKE operator to filter data based on a column starting with a certain string.


# Starts with using like()
result_df = df.filter(col("name").like("Ro%"))

# Starts with using rlike() for complex string matching patterns
result_df = df.filter(col("name").rlike("^a"))

In this example, the like("Ro%") condition filters rows where the ‘name’ column starts with the string ‘Ro’. You can modify the pattern in the like function to match your specific requirements.

If you want to perform a case-insensitive match, you can use the ilike function instead:


# Case insensitive to filter starts with
result_df = df.filter(col("name").ilike("ro%"))

For ends with use below


# Ends with using like()
result_df = df.filter(col("name").like("%Alen"))

Using PySpark SQL Query to filter starts with and ends with a string

To filter data using PySpark SQL to include rows where a column starts with a specific prefix and ends with a specific suffix, you can use either startswith(), endswith() or LIKE operator with the % wildcard for the prefix and suffix. Here’s an example:


# Register the DataFrame as a temporary table
df.createOrReplaceTempView("student")

result_df = spark.sql(""" SELECT * FROM student
    WHERE startswith(name,'Ro') and endswith(name,'Alen')""")
result_df.show()

# Use a SQL query to filter rows where the 'name' column starts with 'Ro' and ends with 'Alen'
result_df = spark.sql(""" SELECT * FROM student
    WHERE name LIKE 'Ro%' AND name LIKE '%Alen'
""")

# Show the result
result_df.show()

In this example, the condition filters rows where the ‘name’ column starts with ‘Ro’, and ends with ‘Alen’.

This example yields the below output.


# Output
+---+------------+
| id|        name|
+---+------------+
|  1|   Rose Alen|
+---+------------+

ILIKE is used for case-insensitive matching.


# Use a SQL query to filter rows where the 'name' column starts with 'ro' and ends with 'alen'
result_df = spark.sql("""
    SELECT *
    FROM student
    WHERE name ILIKE 'ro%' AND name ILIKE '%alen'
""")

Conclusion

PySpark startswith() ,endswith() functions offer powerful tools for precise string-based filtering in DataFrames. These functions empower users to tailor their analyses by selectively including or excluding rows based on specific prefixes or suffixes within a column. startsWith() filters rows where a specified substring serves as the prefix while endsWith() filter rows where the column value concludes with a given substring. Similarly, you can also use like operator in combination with % to achieve the same results.

Keep Learning!!

Prabha

Prabha is an accomplished data engineer with a wealth of experience in architecting, developing, and optimizing data pipelines and infrastructure. With a strong foundation in software engineering and a deep understanding of data systems, Prabha excels in building scalable solutions that handle diverse and large datasets efficiently. At SparkbyExamples.com Prabha writes her experience in Spark, PySpark, Python and Pandas.