Spark Replace Empty Value With NULL on DataFrame

In order to replace empty string value with NULL on Spark DataFrame use when().otherwise() SQL functions. In this article, I will explain how to replace an empty value with null on a single column, all columns, selected list of columns of DataFrame with Scala examples.

Related: How to get Count of NULL, Empty String Values in Spark DataFrame

Let’s create a DataFrame with empty values on some rows.


val data = Seq(("","CA"), ("Julia",""),("Robert",""),("","NJ"))
import spark.sqlContext.implicits._
val df = data.toDF("name","state")
df.show()

//+------+-----+
//|  name|state|
//+------+-----+
//|      |   CA|
//| Julia|     |
//|Robert|     |
//|      |   NJ|
//+------+-----+

Spark Replace Empty Value with NULL

In order to replace empty value with null on single DataFrame column, you can use withColumn() and when().otherwise() function.


//Replace empty string with null value
import org.apache.spark.sql.functions.{col,when}
df.withColumn("name",
       when(col("name")==="" ,null)
          .otherwise(col("name")))
  .show()

//+------+-----+
//|  name|state|
//+------+-----+
//|  null|   CA|
//| Julia|     |
//|Robert|     |
//|  null|   NJ|
//+------+-----+

Replace Empty Value with NULL on All DataFrame Columns

To replace an empty value with null on all DataFrame columns, use df.columns to get all DataFrame columns as Array[String], loop through this by applying conditions and create an Array[Column].

(colon underscore star) :_* is a Scala operator which “unpacked” as a Array[Column]*.


//Replace empty string with null for all columns
def replaceEmptyCols(columns:Array[String]):Array[Column]={
    columns.map(c=>{
      when(col(c)==="" ,null).otherwise(col(c)).alias(c)
    })
}
df.select(replaceEmptyCols(df.columns):_*).show()

//+------+-----+
//|  name|state|
//+------+-----+
//|  null|   CA|
//| Julia| null|
//|Robert| null|
//|  null|   NJ|
//+------+-----+

Replace Empty Value with NULL on Selected Columns

Similarly, you can also replace a selected list of columns, specify all columns you wanted to replace in a list and pass that list to replaceEmptyCols() funciton.


//Replace empty string with null on selected columns
val selCols=List("name","state")
df.select(replaceEmptyCols(selCols.toArray):_*).show()

Complete Example

Following is a complete example of replace empty value with null.


import org.apache.spark.sql.{Column, SparkSession}

object SparkReplaceEmptyNull extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("","CA"), ("Julia",""),("Robert",""),("","NJ"))
  import spark.sqlContext.implicits._
  val df = data.toDF("name","state")
  df.show()

  //Replace empty string with null value
  import org.apache.spark.sql.functions.{col,when}
  df.withColumn("name",
       when(col("name")==="" ,null)
          .otherwise(col("name")))
    .show()

  //Replace empty string with null for all columns
  def replaceEmptyCols(columns:Array[String]):Array[Column]={
    columns.map(c=>{
      when(col(c)==="" ,null).otherwise(col(c)).alias(c)
    })
  }
  df.select(replaceEmptyCols(df.columns):_*).show()

  //Replace empty string with null on selected columns
  val selCols=List("name","state")
  df.select(replaceEmptyCols(selCols.toArray):_*).show()
}

Conclusion

In summary, you have learned how to replace empty string value with NULL on single, all and selected DataFrame columns using Scala example.

Reference

Spark replace empty null

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Spark Replace Empty Value With NULL on DataFrame