You are currently viewing Spark Find Count of NULL, Empty String Values

Problem: Could you please explain how to find/calculate the count of NULL or Empty string values of all columns or a list of selected columns in Spark DataFrame using the Scala example?

Solution: In Spark DataFrame you can find the count of Null or Empty/Blank string values in a column by using isNull() of Column class & Spark SQL functions count() and when(). if a column value is empty or a blank can be check by using col("col_name") === ''

Related: How to Drop Rows with NULL Values in Spark DataFrame

First let’s create a DataFrame with some Null and Empty/Blank string values.


val data = Seq(("","CA"), ("Julia",""),("Robert",null),(null,""))
import spark.sqlContext.implicits._
val df = data.toDF("name","state")
df.show(false)

//+------+-----+
//|name  |state|
//+------+-----+
//|      |CA   |
//|Julia |     |
//|Robert|null |
//|null  |     |
//+------+-----+

1. Spark Find Count of Null, Empty String of a DataFrame Column

To find null or empty on a single column, simply use Spark DataFrame filter() with multiple conditions and apply count() action. The below example finds the number of records with null or empty for the name column.


//Find count of null & empty string for single column
import org.apache.spark.sql.functions.col
print(df.filter(col("name").isNull || col("name") === "").count())
//2

2. Find Count of Null on All DataFrame Columns

df.columns returns all DataFrame columns as a list, you need to loop through the list, and check each column has Null or NaN values. On below example isNull() is a Column class function that is used to check for Null values.


// Find Count of Null, None, NaN of All DataFrame Columns
import org.apache.spark.sql.functions.{col,when, count}
def countCols(columns:Array[String]):Array[Column]={
    columns.map(c=>{
      count(when(col(c).isNull,c)).alias(c)
    })
}
df.select(countCols(df.columns):_*).show()

//+----+-----+
//|name|state|
//+----+-----+
//|   1|    1|
//+----+-----+

Note: If you have NULL as a string literal, this example doesn’t count, I have covered this in the next section so keep reading.

To find count for a list of selected columns, use a list of column names instead of df.columns. The below example yields the same output as above.


//Find count from selected columns
val selCols=List("name","state")
df.select(countCols(selCols.toArray):_*).show()

3. Spark Find Count of NULL, Empty String Literal Values

In case if you have NULL string literal and empty values, use contains() of Spark Column class to find the count of all or selected DataFrame columns.


// Find count for empty, None, Null, Nan with string literals.
import org.apache.spark.sql.functions.{col,when, count}
def countCols2(columns:Array[String]):Array[Column]={
    columns.map(c=>{
      count(when(col(c).isNull || 
                 col(c)==="" || 
                 col(c).contains("NULL") || 
                 col(c).contains("null"),c)
           ).alias(c)
    })
}
df.select(countCols(df.columns):_*).show()

//+----+-----+
//|name|state|
//+----+-----+
//|   2|    3|
//+----+-----+

4. Complete Example of Calculating NULL or Empty String

Following is complete example of how to calculate NULL or empty string of DataFrame columns.


import org.apache.spark.sql.{Column, SparkSession}

object SparkReplaceEmptyNull extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("","CA"), ("Julia",""),("Robert",null),(null,""))
  import spark.sqlContext.implicits._
  val df = data.toDF("name","state")
  df.show(false)

  import org.apache.spark.sql.functions.{col,when, count}
  def countCols(columns:Array[String]):Array[Column]={
    columns.map(c=>{
      count(when(col(c).isNull,c)).alias(c)
    })
  }
  df.select(countCols(df.columns):_*).show()

  //Find count from selected columns
  val selCols=List("name","state")
  df.select(countCols2(selCols.toArray):_*).show()

  def countCols2(columns:Array[String]):Array[Column]={
    columns.map(c=>{
      count(when(col(c).isNull ||
                 col(c)==="" ||
                 col(c).contains("NULL") ||
                 col(c).contains("null"),c)
           ).alias(c)
    })
  }
}

Conclusion

In this Spark article, I have explained how to find a count of Null, null literal, and Empty/Blank values of all DataFrame columns & selected columns by using scala examples.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium