Spark Trim String Column on DataFrame

Problem: In Spark or PySpark how to remove white spaces (blanks) in DataFrame string column similar to trim() in SQL that removes left and right white spaces.

PySpark – How to Trim String Column on DataFrame

Solution: Spark Trim String Column on DataFrame (Left & Right)

In Spark & PySpark (Spark with Python) you can remove whitespaces or trim by using pyspark.sql.functions.trim() SQL functions. To remove only left white spaces use ltrim() and to remove right side use rtim() functions, let’s see with examples.

Spark Example to Remove White Spaces

In Spark with Scala use org.apache.spark.sql.functions.trim() to remove white spaces on DataFrame columns.


import spark.implicits._
val data = Seq((1,"ABC    "), (2,"     DEF"),(3,"        GHI    ") )
val df = data.toDF("col1","col2")
df.show()

import org.apache.spark.sql.functions.{trim,ltrim,rtrim,col}

//using withColumn to remove white spaces
df.withColumn("col2",trim(col("col2"))).show()
df.withColumn("col2",ltrim(col("col2"))).show()
df.withColumn("col2",rtrim(col("col2"))).show()

//Using select to remove white spaces
df.select(col("col1"),trim(col("col2")).as("col2")).show()

//Using SQL Expression to remove white spaces
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()

Trim All String Columns

In case if you have multiple string columns and you wanted to trim all columns you below approach. Here first we should filter out non string columns into list and use column from the filter list to trim all string columns.


val trimColumns=df.schema.fields.filter(_.dataType.isInstanceOf[StringType])
trimColumns.foreach(f=>{
    df=df.withColumn(f.name,trim(col(f.name)))
})
df.show()

PySpark Trim String Column on DataFrame

Similarly, trim(), rtrim(), ltrim() are available in PySpark,Below examples explains how to use these functions.


data = [(1,"ABC    "), (2,"     DEF"),(3,"        GHI    ") ]
df = spark.createDataFrame(data = data, schema = ["col1","col2"])
df.show()

from pyspark.sql.functions import trim,ltrim,rtrim

#using withColumn
df.withColumn("col2",trim("col2")).show()
df.withColumn("col2",ltrim("col2")).show()
df.withColumn("col2",rtrim("col2")).show()

#Using select
df.select("col1",trim("col2").alias('col2')).show()

# Using SQL Expression
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()

Conclusion

In this simple article you have learned how to remove all white spaces using trim(), only right spaces using rtrim() and left spaces using ltrim() on Spark & PySpark DataFrame string columns with examples.

Happy Learning !!

Reference

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

This Post Has 2 Comments

  1. Irving

    Would be better if you post the results of the script. I simply enjoy every explanation of this site, but that one was not that good :/

  2. Irving

    Would be better if you post the results of the script. I simply enjoy every explanation of this site, but that one was not that good :/

You are currently viewing Spark Trim String Column on DataFrame