In order to replace empty string value with NULL on Spark DataFrame use when().otherwise() SQL functions. In this article, I will explain how to replace an empty value with null on a single column, all columns, selected list of columns of DataFrame with Scala examples.
Related: How to get Count of NULL, Empty String Values in Spark DataFrame
Let’s create a DataFrame with empty values on some rows.
val data = Seq(("","CA"), ("Julia",""),("Robert",""),("","NJ"))
import spark.sqlContext.implicits._
val df = data.toDF("name","state")
df.show()
//+------+-----+
//| name|state|
//+------+-----+
//| | CA|
//| Julia| |
//|Robert| |
//| | NJ|
//+------+-----+
Spark Replace Empty Value with NULL
In order to replace empty value with null on single DataFrame column, you can use withColumn() and when().otherwise() function.
//Replace empty string with null value
import org.apache.spark.sql.functions.{col,when}
df.withColumn("name",
when(col("name")==="" ,null)
.otherwise(col("name")))
.show()
//+------+-----+
//| name|state|
//+------+-----+
//| null| CA|
//| Julia| |
//|Robert| |
//| null| NJ|
//+------+-----+
Replace Empty Value with NULL on All DataFrame Columns
To replace an empty value with null on all DataFrame columns, use df.columns
to get all DataFrame columns as Array[String]
, loop through this by applying conditions and create an Array[Column]
.
(colon underscore star) :_* is a Scala operator which “unpacked” as a Array[Column]*.
//Replace empty string with null for all columns
def replaceEmptyCols(columns:Array[String]):Array[Column]={
columns.map(c=>{
when(col(c)==="" ,null).otherwise(col(c)).alias(c)
})
}
df.select(replaceEmptyCols(df.columns):_*).show()
//+------+-----+
//| name|state|
//+------+-----+
//| null| CA|
//| Julia| null|
//|Robert| null|
//| null| NJ|
//+------+-----+
Replace Empty Value with NULL on Selected Columns
Similarly, you can also replace a selected list of columns, specify all columns you wanted to replace in a list and pass that list to replaceEmptyCols()
funciton.
//Replace empty string with null on selected columns
val selCols=List("name","state")
df.select(replaceEmptyCols(selCols.toArray):_*).show()
Complete Example
Following is a complete example of replace empty value with null.
import org.apache.spark.sql.{Column, SparkSession}
object SparkReplaceEmptyNull extends App{
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val data = Seq(("","CA"), ("Julia",""),("Robert",""),("","NJ"))
import spark.sqlContext.implicits._
val df = data.toDF("name","state")
df.show()
//Replace empty string with null value
import org.apache.spark.sql.functions.{col,when}
df.withColumn("name",
when(col("name")==="" ,null)
.otherwise(col("name")))
.show()
//Replace empty string with null for all columns
def replaceEmptyCols(columns:Array[String]):Array[Column]={
columns.map(c=>{
when(col(c)==="" ,null).otherwise(col(c)).alias(c)
})
}
df.select(replaceEmptyCols(df.columns):_*).show()
//Replace empty string with null on selected columns
val selCols=List("name","state")
df.select(replaceEmptyCols(selCols.toArray):_*).show()
}
Conclusion
In summary, you have learned how to replace empty string value with NULL on single, all and selected DataFrame columns using Scala example.