In PySpark, the substring()
function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.
In this tutorial, I have explained with an example of getting substring of a column using substring()
from pyspark.sql.functions and using substr()
from pyspark.sql.Column
type.
Using SQL function substring()
Using the substring()
function of pyspark.sql.functions
module we can extract a substring or slice of a string from the DataFrame column by providing the position and length of the string you wanted to slice.
Syntax
substring(str, pos, len)
Note: Please note that the position is not zero based, but 1 based index.
Below is an example of Pyspark substring() using withColumn().
# Imports
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring
spark=SparkSession.builder.appName("stringoperations").getOrCreate()
# Create Sample Data
data = [(1,"20200828"),(2,"20180525")]
columns=["id","date"]
df=spark.createDataFrame(data,columns)
df.withColumn('year', substring('date', 1,4))\
.withColumn('month', substring('date', 5,2))\
.withColumn('day', substring('date', 7,2))
df.printSchema()
df.show(truncate=False)
In the above example, we have created a DataFrame with two columns, id, and date. Here date is in the form “year month day”. Here I have used substring() on the date column to return sub-strings of date as year, month, and day respectively. Below is the output.
# Output
+---+--------+----+-----+---+
|id |date |year|month|day|
+---+--------+----+-----+---+
|1 |20200828|2020|08 |28 |
|2 |20180525|2018|05 |25 |
+---+--------+----+-----+---+
2. Using substring() with select()
In Pyspark we can get substring() of a column using select. Above example can bed written as below.
# substring() with select()
df.select('date', substring('date', 1,4).alias('year'), \
substring('date', 5,2).alias('month'), \
substring('date', 7,2).alias('day'))
3.Using substring() with selectExpr()
Sample example using selectExpr to get a substring of column(date) as year, month, day. Below is the code that gives the same output as above.
# substring() with selectExpr()
df.selectExpr('date', 'substring(date, 1,4) as year', \
'substring(date, 5,2) as month', \
'substring(date, 7,2) as day')
4. Using substr() from Column type
Below is the example of getting substring using substr()
function from pyspark.sql.Column
type in Pyspark.
df3=df.withColumn('year', col('date').substr(1, 4))\
.withColumn('month',col('date').substr(5, 2))\
.withColumn('day', col('date').substr(7, 2))
The above example gives output the same as the above-mentioned examples.
Complete Example of PySpark substring()
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring
spark=SparkSession.builder.appName("stringoperations").getOrCreate()
data = [(1,"20200828"),(2,"20180525")]
columns=["id","date"]
df=spark.createDataFrame(data,columns)
#Using SQL function substring()
df.withColumn('year', substring('date', 1,4))\
.withColumn('month', substring('date', 5,2))\
.withColumn('day', substring('date', 7,2))
df.printSchema()
df.show(truncate=False)
#Using select
df1=df.select('date', substring('date', 1,4).alias('year'), \
substring('date', 5,2).alias('month'), \
substring('date', 7,2).alias('day'))
#Using with selectExpr
df2=df.selectExpr('date', 'substring(date, 1,4) as year', \
'substring(date, 5,2) as month', \
'substring(date, 7,2) as day')
#Using substr from Column type
df3=df.withColumn('year', col('date').substr(1, 4))\
.withColumn('month',col('date').substr(5, 2))\
.withColumn('day', col('date').substr(7, 2))
df3.show()
Conclusion
In this session, we have learned different ways of getting substring of a column in PySpark DataFarme. I hope you liked it! Keep practicing. And do comment in the comment section for any kind of questions!!
Good work
Thanks Giri.