Problem: In Spark or PySpark how to remove white spaces (blanks) in DataFrame string column similar to trim() in SQL that removes left and right white spaces.
PySpark – How to Trim String Column on DataFrame
Solution: Spark Trim String Column on DataFrame (Left & Right)
In Spark & PySpark (Spark with Python) you can remove whitespaces or trim by using pyspark.sql.functions.trim()
SQL functions. To remove only left white spaces use ltrim()
and to remove right side use rtim()
functions, let’s see with examples.
Spark Example to Remove White Spaces
In Spark with Scala use org.apache.spark.sql.functions.trim()
to remove white spaces on DataFrame columns.
import spark.implicits._
val data = Seq((1,"ABC "), (2," DEF"),(3," GHI ") )
val df = data.toDF("col1","col2")
df.show()
import org.apache.spark.sql.functions.{trim,ltrim,rtrim,col}
//using withColumn to remove white spaces
df.withColumn("col2",trim(col("col2"))).show()
df.withColumn("col2",ltrim(col("col2"))).show()
df.withColumn("col2",rtrim(col("col2"))).show()
//Using select to remove white spaces
df.select(col("col1"),trim(col("col2")).as("col2")).show()
//Using SQL Expression to remove white spaces
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()
Trim All String Columns
In case if you have multiple string columns and you wanted to trim all columns you below approach. Here first we should filter out non string columns into list and use column from the filter list to trim all string columns.
val trimColumns=df.schema.fields.filter(_.dataType.isInstanceOf[StringType])
trimColumns.foreach(f=>{
df=df.withColumn(f.name,trim(col(f.name)))
})
df.show()
PySpark Trim String Column on DataFrame
Similarly, trim()
, rtrim()
, ltrim()
are available in PySpark,Below examples explains how to use these functions.
data = [(1,"ABC "), (2," DEF"),(3," GHI ") ]
df = spark.createDataFrame(data = data, schema = ["col1","col2"])
df.show()
from pyspark.sql.functions import trim,ltrim,rtrim
#using withColumn
df.withColumn("col2",trim("col2")).show()
df.withColumn("col2",ltrim("col2")).show()
df.withColumn("col2",rtrim("col2")).show()
#Using select
df.select("col1",trim("col2").alias('col2')).show()
# Using SQL Expression
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()
Conclusion
In this simple article you have learned how to remove all white spaces using trim(), only right spaces using rtrim() and left spaces using ltrim() on Spark & PySpark DataFrame string columns with examples.
Happy Learning !!
Related Articles
- Spark regexp_replace() – Replace String Value
- Spark Check String Column Has Numeric Values
- Spark Check Column Data Type is Integer or String
- Spark Find Count of NULL, Empty String Values
- Spark Cast String Type to Integer Type (int)
- Spark – Convert array of String to a String column
- Spark split() function to convert string to Array column
Would be better if you post the results of the script. I simply enjoy every explanation of this site, but that one was not that good :/
Would be better if you post the results of the script. I simply enjoy every explanation of this site, but that one was not that good :/