You are currently viewing Spark Trim String Column on DataFrame

Problem: In Spark or PySpark how to remove white spaces (blanks) in DataFrame string column similar to trim() in SQL that removes left and right white spaces.

PySpark – How to Trim String Column on DataFrame

Solution: Spark Trim String Column on DataFrame (Left & Right)

In Spark & PySpark (Spark with Python) you can remove whitespaces or trim by using pyspark.sql.functions.trim() SQL functions. To remove only left white spaces use ltrim() and to remove right side use rtim() functions, let’s see with examples.

Spark Example to Remove White Spaces

In Spark with Scala use org.apache.spark.sql.functions.trim() to remove white spaces on DataFrame columns.


import spark.implicits._
val data = Seq((1,"ABC    "), (2,"     DEF"),(3,"        GHI    ") )
val df = data.toDF("col1","col2")
df.show()

import org.apache.spark.sql.functions.{trim,ltrim,rtrim,col}

//using withColumn to remove white spaces
df.withColumn("col2",trim(col("col2"))).show()
df.withColumn("col2",ltrim(col("col2"))).show()
df.withColumn("col2",rtrim(col("col2"))).show()

//Using select to remove white spaces
df.select(col("col1"),trim(col("col2")).as("col2")).show()

//Using SQL Expression to remove white spaces
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()

Trim All String Columns

In case if you have multiple string columns and you wanted to trim all columns you below approach. Here first we should filter out non string columns into list and use column from the filter list to trim all string columns.


val trimColumns=df.schema.fields.filter(_.dataType.isInstanceOf[StringType])
trimColumns.foreach(f=>{
    df=df.withColumn(f.name,trim(col(f.name)))
})
df.show()

PySpark Trim String Column on DataFrame

Similarly, trim(), rtrim(), ltrim() are available in PySpark,Below examples explains how to use these functions.


data = [(1,"ABC    "), (2,"     DEF"),(3,"        GHI    ") ]
df = spark.createDataFrame(data = data, schema = ["col1","col2"])
df.show()

from pyspark.sql.functions import trim,ltrim,rtrim

#using withColumn
df.withColumn("col2",trim("col2")).show()
df.withColumn("col2",ltrim("col2")).show()
df.withColumn("col2",rtrim("col2")).show()

#Using select
df.select("col1",trim("col2").alias('col2')).show()

# Using SQL Expression
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()

Conclusion

In this simple article you have learned how to remove all white spaces using trim(), only right spaces using rtrim() and left spaces using ltrim() on Spark & PySpark DataFrame string columns with examples.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 2 Comments

  1. Irving

    Would be better if you post the results of the script. I simply enjoy every explanation of this site, but that one was not that good :/

  2. Irving

    Would be better if you post the results of the script. I simply enjoy every explanation of this site, but that one was not that good :/

Comments are closed.