Spark Trim String Column on DataFrame

Problem: In Spark or PySpark how to remove white spaces (blanks) in DataFrame string column similar to trim() in SQL that removes left and right white spaces.

PySpark – How to Trim String Column on DataFrame

Solution: Spark Trim String Column on DataFrame (Left & Right)

In Spark & PySpark (Spark with Python) you can remove whitespaces or trim by using pyspark.sql.functions.trim() SQL functions. To remove only left white spaces use ltrim() and to remove right side use rtim() functions, let’s see with examples.

Spark Example to Remove White Spaces

In Spark with Scala use org.apache.spark.sql.functions.trim() to remove white spaces on DataFrame columns.


import spark.implicits._
val data = Seq((1,"ABC    "), (2,"     DEF"),(3,"        GHI    ") )
val df = data.toDF("col1","col2")
df.show()

import org.apache.spark.sql.functions.{trim,ltrim,rtrim,col}

//using withColumn to remove white spaces
df.withColumn("col2",trim(col("col2"))).show()
df.withColumn("col2",ltrim(col("col2"))).show()
df.withColumn("col2",rtrim(col("col2"))).show()

//Using select to remove white spaces
df.select(col("col1"),trim(col("col2")).as("col2")).show()

//Using SQL Expression to remove white spaces
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()

Trim All String Columns

In case if you have multiple string columns and you wanted to trim all columns you below approach. Here first we should filter out non string columns into list and use column from the filter list to trim all string columns.


val trimColumns=df.schema.fields.filter(_.dataType.isInstanceOf[StringType])
trimColumns.foreach(f=>{
    df=df.withColumn(f.name,trim(col(f.name)))
})
df.show()

PySpark Trim String Column on DataFrame

Similarly, trim(), rtrim(), ltrim() are available in PySpark,Below examples explains how to use these functions.


data = [(1,"ABC    "), (2,"     DEF"),(3,"        GHI    ") ]
df = spark.createDataFrame(data = data, schema = ["col1","col2"])
df.show()

from pyspark.sql.functions import trim,ltrim,rtrim

#using withColumn
df.withColumn("col2",trim("col2")).show()
df.withColumn("col2",ltrim("col2")).show()
df.withColumn("col2",rtrim("col2")).show()

#Using select
df.select("col1",trim("col2").alias('col2')).show()

# Using SQL Expression
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()

Conclusion

In this simple article you have learned how to remove all white spaces using trim(), only right spaces using rtrim() and left spaces using ltrim() on Spark & PySpark DataFrame string columns with examples.

Happy Learning !!

Reference

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

Spark Trim String Column on DataFrame