Site icon Spark By {Examples}

Pyspark – Get substring() from a column

pyspark substring example

In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.

In this tutorial, I have explained with an example of getting substring of a column using substring() from pyspark.sql.functions and using substr() from pyspark.sql.Column type.

Using SQL function substring()

Using the substring() function of pyspark.sql.functions module we can extract a substring or slice of a string from the DataFrame column by providing the position and length of the string you wanted to slice.


Syntax
substring(str, pos, len)

Note: Please note that the position is not zero based, but 1 based index.

Below is an example of Pyspark substring() using withColumn().


# Imports
import pyspark
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col, substring
spark=SparkSession.builder.appName("stringoperations").getOrCreate()
# Create Sample Data
data = [(1,"20200828"),(2,"20180525")]
columns=["id","date"]
df=spark.createDataFrame(data,columns)
df.withColumn('year', substring('date', 1,4))\
    .withColumn('month', substring('date', 5,2))\
    .withColumn('day', substring('date', 7,2))
df.printSchema()
df.show(truncate=False)

In the above example, we have created a DataFrame with two columns, id, and date. Here date is in the form “year month day”. Here I have used substring() on the date column to return sub-strings of date as year, month, and day respectively. Below is the output.


# Output
+---+--------+----+-----+---+
|id |date    |year|month|day|
+---+--------+----+-----+---+
|1  |20200828|2020|08   |28 |
|2  |20180525|2018|05   |25 |
+---+--------+----+-----+---+

2. Using substring() with select()

In Pyspark we can get substring() of a column using select. Above example can bed written as below.


# substring() with select()
df.select('date', substring('date', 1,4).alias('year'), \
                  substring('date', 5,2).alias('month'), \
                  substring('date', 7,2).alias('day'))  

3.Using substring() with selectExpr()

Sample example using selectExpr to get a substring of column(date) as year, month, day. Below is the code that gives the same output as above.


# substring() with selectExpr()
df.selectExpr('date', 'substring(date, 1,4) as year', \
                  'substring(date, 5,2) as month', \
                 'substring(date, 7,2) as day')

4. Using substr() from Column type

Below is the example of getting substring using substr() function from pyspark.sql.Column type in Pyspark.


df3=df.withColumn('year', col('date').substr(1, 4))\
  .withColumn('month',col('date').substr(5, 2))\
  .withColumn('day', col('date').substr(7, 2))

The above example gives output the same as the above-mentioned examples.

Complete Example of PySpark substring()


import pyspark
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col, substring
spark=SparkSession.builder.appName("stringoperations").getOrCreate()
data = [(1,"20200828"),(2,"20180525")]
columns=["id","date"]
df=spark.createDataFrame(data,columns)

#Using SQL function substring()
df.withColumn('year', substring('date', 1,4))\
    .withColumn('month', substring('date', 5,2))\
    .withColumn('day', substring('date', 7,2))
df.printSchema()
df.show(truncate=False)

#Using select    
df1=df.select('date', substring('date', 1,4).alias('year'), \
                  substring('date', 5,2).alias('month'), \
                  substring('date', 7,2).alias('day'))
    
#Using with selectExpr
df2=df.selectExpr('date', 'substring(date, 1,4) as year', \
                  'substring(date, 5,2) as month', \
                  'substring(date, 7,2) as day')

#Using substr from Column type
df3=df.withColumn('year', col('date').substr(1, 4))\
  .withColumn('month',col('date').substr(5, 2))\
  .withColumn('day', col('date').substr(7, 2))

df3.show()

Conclusion

In this session, we have learned different ways of getting substring of a column in PySpark DataFarme. I hope you liked it! Keep practicing. And do comment in the comment section for any kind of questions!!

Exit mobile version