• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing Pyspark – Get substring() from a column

In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.

Advertisements

In this tutorial, I have explained with an example of getting substring of a column using substring() from pyspark.sql.functions and using substr() from pyspark.sql.Column type.

Using SQL function substring()

Using the substring() function of pyspark.sql.functions module we can extract a substring or slice of a string from the DataFrame column by providing the position and length of the string you wanted to slice.


Syntax
substring(str, pos, len)

Note: Please note that the position is not zero based, but 1 based index.

Below is an example of Pyspark substring() using withColumn().


# Imports
import pyspark
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col, substring
spark=SparkSession.builder.appName("stringoperations").getOrCreate()
# Create Sample Data
data = [(1,"20200828"),(2,"20180525")]
columns=["id","date"]
df=spark.createDataFrame(data,columns)
df.withColumn('year', substring('date', 1,4))\
    .withColumn('month', substring('date', 5,2))\
    .withColumn('day', substring('date', 7,2))
df.printSchema()
df.show(truncate=False)

In the above example, we have created a DataFrame with two columns, id, and date. Here date is in the form “year month day”. Here I have used substring() on the date column to return sub-strings of date as year, month, and day respectively. Below is the output.


# Output
+---+--------+----+-----+---+
|id |date    |year|month|day|
+---+--------+----+-----+---+
|1  |20200828|2020|08   |28 |
|2  |20180525|2018|05   |25 |
+---+--------+----+-----+---+

2. Using substring() with select()

In Pyspark we can get substring() of a column using select. Above example can bed written as below.


# substring() with select()
df.select('date', substring('date', 1,4).alias('year'), \
                  substring('date', 5,2).alias('month'), \
                  substring('date', 7,2).alias('day'))  

3.Using substring() with selectExpr()

Sample example using selectExpr to get a substring of column(date) as year, month, day. Below is the code that gives the same output as above.


# substring() with selectExpr()
df.selectExpr('date', 'substring(date, 1,4) as year', \
                  'substring(date, 5,2) as month', \
                 'substring(date, 7,2) as day')

4. Using substr() from Column type

Below is the example of getting substring using substr() function from pyspark.sql.Column type in Pyspark.


df3=df.withColumn('year', col('date').substr(1, 4))\
  .withColumn('month',col('date').substr(5, 2))\
  .withColumn('day', col('date').substr(7, 2))

The above example gives output the same as the above-mentioned examples.

Complete Example of PySpark substring()


import pyspark
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col, substring
spark=SparkSession.builder.appName("stringoperations").getOrCreate()
data = [(1,"20200828"),(2,"20180525")]
columns=["id","date"]
df=spark.createDataFrame(data,columns)

#Using SQL function substring()
df.withColumn('year', substring('date', 1,4))\
    .withColumn('month', substring('date', 5,2))\
    .withColumn('day', substring('date', 7,2))
df.printSchema()
df.show(truncate=False)

#Using select    
df1=df.select('date', substring('date', 1,4).alias('year'), \
                  substring('date', 5,2).alias('month'), \
                  substring('date', 7,2).alias('day'))
    
#Using with selectExpr
df2=df.selectExpr('date', 'substring(date, 1,4) as year', \
                  'substring(date, 5,2) as month', \
                  'substring(date, 7,2) as day')

#Using substr from Column type
df3=df.withColumn('year', col('date').substr(1, 4))\
  .withColumn('month',col('date').substr(5, 2))\
  .withColumn('day', col('date').substr(7, 2))

df3.show()

Conclusion

In this session, we have learned different ways of getting substring of a column in PySpark DataFarme. I hope you liked it! Keep practicing. And do comment in the comment section for any kind of questions!!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 2 Comments

  1. NNK

    Thanks Giri.

  2. Giri

    Good work

Comments are closed.