Spark – Check if DataFrame or Dataset is empty?

In Spark, isEmpty of the DataFrame class is used to check if the DataFrame or Dataset is empty, this returns true when empty otherwise return false. Besides this, Spark also has multiple ways to check if DataFrame is empty. In this article, I will explain all different ways and compare these with the performance see which one is best to use.

First, let’s create an empty DataFrame


val df = spark.emptyDataFrame

Using isEmpty of the DataFrame or Dataset

isEmpty function of the DataFrame or Dataset returns true when the dataset empty and false when it’s not empty.


df.isEmpty

Alternatively, you can also check for DataFrame empty.


df.head(1).isEmpty

Note that calling df.head() and df.first() on empty DataFrame returns java.util.NoSuchElementException: next on empty iterator exception.

You can also use the below but this is not efficient as above hence use it wisely when you have a small dataset. df.count calculates the count from all partitions from all nodes hence do not use it when you have millions of records.


print(df.count > 0)

Using isEmpty of the RDD

This is most performed way of check if DataFrame or Dataset is empty.


df.rdd.isEmpty()

Conclusion

In Summary, we can check the Spark DataFrame empty or not by using isEmpty function of the DataFrame, Dataset and RDD. if you have performance issues calling it on DataFrame, you can try using df.rdd.isempty

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply