Problem: Could you please explain how to find/calculate the count of NULL or Empty string values of all columns or a list of selected columns in Spark DataFrame using the Scala example?
Solution: In Spark DataFrame you can find the count of Null or Empty/Blank string values in a column by using isNull()
of Column class & Spark SQL functions count()
and when(). if a column value is empty or a blank can be check by using col("col_name") === ''
Related: How to Drop Rows with NULL Values in Spark DataFrame
First let’s create a DataFrame with some Null and Empty/Blank string values.
val data = Seq(("","CA"), ("Julia",""),("Robert",null),(null,""))
import spark.sqlContext.implicits._
val df = data.toDF("name","state")
df.show(false)
//+------+-----+
//|name |state|
//+------+-----+
//| |CA |
//|Julia | |
//|Robert|null |
//|null | |
//+------+-----+
1. Spark Find Count of Null, Empty String of a DataFrame Column
To find null or empty on a single column, simply use Spark DataFrame filter() with multiple conditions and apply count()
action. The below example finds the number of records with null or empty for the name
column.
//Find count of null & empty string for single column
import org.apache.spark.sql.functions.col
print(df.filter(col("name").isNull || col("name") === "").count())
//2
2. Find Count of Null on All DataFrame Columns
df.columns
returns all DataFrame columns as a list, you need to loop through the list, and check each column has Null or NaN values. On below example isNull()
is a Column class function that is used to check for Null values.
// Find Count of Null, None, NaN of All DataFrame Columns
import org.apache.spark.sql.functions.{col,when, count}
def countCols(columns:Array[String]):Array[Column]={
columns.map(c=>{
count(when(col(c).isNull,c)).alias(c)
})
}
df.select(countCols(df.columns):_*).show()
//+----+-----+
//|name|state|
//+----+-----+
//| 1| 1|
//+----+-----+
Note: If you have NULL as a string literal, this example doesn’t count, I have covered this in the next section so keep reading.
To find count for a list of selected columns, use a list of column names instead of df.columns
. The below example yields the same output as above.
//Find count from selected columns
val selCols=List("name","state")
df.select(countCols(selCols.toArray):_*).show()
3. Spark Find Count of NULL, Empty String Literal Values
In case if you have NULL
string literal and empty values, use contains() of Spark Column class to find the count of all or selected DataFrame columns.
// Find count for empty, None, Null, Nan with string literals.
import org.apache.spark.sql.functions.{col,when, count}
def countCols2(columns:Array[String]):Array[Column]={
columns.map(c=>{
count(when(col(c).isNull ||
col(c)==="" ||
col(c).contains("NULL") ||
col(c).contains("null"),c)
).alias(c)
})
}
df.select(countCols(df.columns):_*).show()
//+----+-----+
//|name|state|
//+----+-----+
//| 2| 3|
//+----+-----+
4. Complete Example of Calculating NULL or Empty String
Following is complete example of how to calculate NULL or empty string of DataFrame columns.
import org.apache.spark.sql.{Column, SparkSession}
object SparkReplaceEmptyNull extends App{
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val data = Seq(("","CA"), ("Julia",""),("Robert",null),(null,""))
import spark.sqlContext.implicits._
val df = data.toDF("name","state")
df.show(false)
import org.apache.spark.sql.functions.{col,when, count}
def countCols(columns:Array[String]):Array[Column]={
columns.map(c=>{
count(when(col(c).isNull,c)).alias(c)
})
}
df.select(countCols(df.columns):_*).show()
//Find count from selected columns
val selCols=List("name","state")
df.select(countCols2(selCols.toArray):_*).show()
def countCols2(columns:Array[String]):Array[Column]={
columns.map(c=>{
count(when(col(c).isNull ||
col(c)==="" ||
col(c).contains("NULL") ||
col(c).contains("null"),c)
).alias(c)
})
}
}
Conclusion
In this Spark article, I have explained how to find a count of Null, null literal, and Empty/Blank values of all DataFrame columns & selected columns by using scala examples.
Happy Learning !!
Related Articles
- Spark Word Count Explained with Example
- Spark SQL – Count Distinct from DataFrame
- What is Spark Streaming Checkpoint?
- Spark Read Multiple CSV Files
- Spark Drop DataFrame from Cache
- Difference in DENSE_RANK and ROW_NUMBER in Spark
- Spark printSchema() Example
- Spark Merge Two DataFrames with Different Columns or Schema