Spark Replace Empty Value With NULL on DataFrame

In order to replace empty string value with NULL on Spark DataFrame use when().otherwise() SQL functions. In this article, I will explain how to replace an empty value with null on a single column, all columns, selected list of columns of DataFrame with Scala examples.

Related: How to get Count of NULL, Empty String Values in Spark DataFrame

Let’s create a DataFrame with empty values on some rows.


val data = Seq(("","CA"), ("Julia",""),("Robert",""),("","NJ"))
import spark.sqlContext.implicits._
val df = data.toDF("name","state")
df.show()

//+------+-----+
//|  name|state|
//+------+-----+
//|      |   CA|
//| Julia|     |
//|Robert|     |
//|      |   NJ|
//+------+-----+

Spark Replace Empty Value with NULL

In order to replace empty value with null on single DataFrame column, you can use withColumn() and when().otherwise() function.


//Replace empty string with null value
import org.apache.spark.sql.functions.{col,when}
df.withColumn("name",
       when(col("name")==="" ,null)
          .otherwise(col("name")))
  .show()

//+------+-----+
//|  name|state|
//+------+-----+
//|  null|   CA|
//| Julia|     |
//|Robert|     |
//|  null|   NJ|
//+------+-----+

Replace Empty Value with NULL on All DataFrame Columns

To replace an empty value with null on all DataFrame columns, use df.columns to get all DataFrame columns as Array[String], loop through this by applying conditions and create an Array[Column].

(colon underscore star) :_* is a Scala operator which “unpacked” as a Array[Column]*.


//Replace empty string with null for all columns
def replaceEmptyCols(columns:Array[String]):Array[Column]={
    columns.map(c=>{
      when(col(c)==="" ,null).otherwise(col(c)).alias(c)
    })
}
df.select(replaceEmptyCols(df.columns):_*).show()

//+------+-----+
//|  name|state|
//+------+-----+
//|  null|   CA|
//| Julia| null|
//|Robert| null|
//|  null|   NJ|
//+------+-----+

Replace Empty Value with NULL on Selected Columns

Similarly, you can also replace a selected list of columns, specify all columns you wanted to replace in a list and pass that list to replaceEmptyCols() funciton.


//Replace empty string with null on selected columns
val selCols=List("name","state")
df.select(replaceEmptyCols(selCols.toArray):_*).show()

Complete Example

Following is a complete example of replace empty value with null.


import org.apache.spark.sql.{Column, SparkSession}

object SparkReplaceEmptyNull extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("","CA"), ("Julia",""),("Robert",""),("","NJ"))
  import spark.sqlContext.implicits._
  val df = data.toDF("name","state")
  df.show()

  //Replace empty string with null value
  import org.apache.spark.sql.functions.{col,when}
  df.withColumn("name",
       when(col("name")==="" ,null)
          .otherwise(col("name")))
    .show()

  //Replace empty string with null for all columns
  def replaceEmptyCols(columns:Array[String]):Array[Column]={
    columns.map(c=>{
      when(col(c)==="" ,null).otherwise(col(c)).alias(c)
    })
  }
  df.select(replaceEmptyCols(df.columns):_*).show()

  //Replace empty string with null on selected columns
  val selCols=List("name","state")
  df.select(replaceEmptyCols(selCols.toArray):_*).show()
}

Conclusion

In summary, you have learned how to replace empty string value with NULL on single, all and selected DataFrame columns using Scala example.

Reference

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply