You are currently viewing Spark Replace NULL Values on DataFrame

In Spark, fill() function of DataFrameNaFunctions class is used to replace NULL values on the DataFrame column with either with zero(0), empty string, space, or any constant literal values.

While working on Spark DataFrame we often need to replace null values as certain operations on null values return NullpointerException hence, we need to graciously handle nulls as the first step before processing. Also, while writing to a file, it’s always best practice to replace null values, not doing this result nulls on the output file.

As part of the cleanup, some times you may need to Drop Rows with NULL Values in Spark DataFrame and Filter Rows by checking IS NULL/NOT NULL

DataFrame API provides DataFrameNaFunctions class with fill() function to replace null values on DataFrame. This function has several overloaded signatures that take different data types as parameters.

In this article, we use a subset of these and learn different ways to replace null values with an empty string, constant value, and zero(0) on Dataframe columns integer, string, array, and map with Scala examples.

Before we start, Let’s read a CSV into Spark DataFrame file, where we have no values on certain rows of String and Integer columns, spark assigns null values to these no value columns.

The file we are using here is available at GitHub small_zipcode.csv


val filePath="src/main/resources/small_zipcode.csv"
val df=spark.read.options(
  Map("inferSchema"->"true","delimiter"->",","header"->"true"))
  .csv(filePath)
  df.printSchema()
  df.show(false)

This yields the below output. As you see columns type, city and population columns have null values.


root
 |-- id: integer (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- population: integer (nullable = true)

+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Now, let’s see how to replace these null values.

Spark Replace NULL Values with Zero (0)

Spark fill(value:Long) signatures that are available in DataFrameNaFunctions is used to replace NULL values with numeric values either zero(0) or any constant value for all integer and long datatype columns of Spark DataFrame or Dataset.

Syntax:


fill(value : scala.Long) : org.apache.spark.sql.DataFrame
fill(value : scala.Long, cols : scala.Array[scala.Predef.String]) :  
      org.apache.spark.sql.DataFrame

for example


//Replace all integer and long columns
df.na.fill(0)
    .show(false)

//Replace with specific columns
df.na.fill(0,Array("population"))
  .show(false)

Above both statements yields the same below output. Note that it replaces only Integer columns.


+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |0         |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |0         |
+---+-------+--------+-------------------+-----+----------+

Spark Replace Null Values with Empty String

Spark fill(value:String) signatures are used to replace null values with an empty string or any constant values String on DataFrame or Dataset columns.

Syntax:


fill(value : scala.Predef.String) : org.apache.spark.sql.DataFrame
fill(value : scala.Predef.String, cols : scala.Array[scala.Predef.String]) :
      org.apache.spark.sql.DataFrame

The first syntax replaces all nulls on all String columns with a given value, from our example it replaces nulls on columns type and city with an empty string.


  df.na.fill("").show(false)

Yields below output. This replaces all NULL values with empty/blank string


+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|                   |PR   |30100     |
|2  |704    |        |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |        |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|                   |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Now, let’s use the second syntax to replace the specific value on specific columns, below example replace column typewith empty string and column city with value “unknown”.


  df.na.fill("unknown",Array("city"))
    .na.fill("",Array("type"))
    .show(false)

Yields below output. This replaces null values with an empty string for type column and replaces with a constant value “unknown” for city column.


+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|unknown            |PR   |30100     |
|2  |704    |        |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |        |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|unknown            |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Replace NULL Values on List and Map Columns

Compelte Code

Below is complete code with Scala example. You can use it by copying it from here or use the GitHub to download the source code.


package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.SparkSession

object HandleNullExample extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val filePath="src/main/resources/small_zipcode.csv"
  val df = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv(filePath)
  df.printSchema()
  df.show(false)

  df.na.fill(0)
    .show(false)

  df.na.fill(0,Array("population"))
    .show(false)

  df.na.fill("")
    .show(false)

  df.na.fill("unknown",Array("city"))
    .na.fill("",Array("type"))
    .show(false)
  
  // Array and map columns
}

Source code is also available at GitHub project for reference.

Conclusion

In this Spark article, you have learned how to replace null values with zero or an empty string on integer and string columns respectively.

Thanks for reading. If you recognize my effort or like articles here please do comment or provide any suggestions for improvements in the comments sections! 

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium