Spark – How to create an empty Dataset?

In this article, I will explain how to create empty Spark Dataset with several Scala examples. Before we start, I have explained one of the many scenarios where we need to create empty Dataset.

While working with files in Spark, some times we may not receive a file for processing, however, we still need to create a Dataset similar to the Dataset we create when we receive a file. If we don’t create with the same schema, our operations/transformations on DS fails as we refer the columns that may not present.

To handle situations similar to these, we always need to create a Dataset with the same schema, which means the same column names and datatypes regardless of the file exists or empty file processing.

First, let’s create a SparkSession and StructType schemas and case class which we will be using throughout our examples.


  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  import spark.implicits._

  val schema = StructType(
    StructField("firstName", StringType, true) ::
      StructField("lastName", IntegerType, false) ::
      StructField("middleName", IntegerType, false) :: Nil)

  val colSeq = Seq("firstName","lastName","middleName")

  case class Name(firstName: String, lastName: String, middleName:String)

using emptyDataset() to create empty Dataset

SparkSession provides an emptyDataset() method, which returns the empty Dataset with empty schema, but this is not what we wanted. next section, explains with StructType schema.

val ds = spark.emptyDataset[Name]

Using createDataset() to create empty Dataset

We can create an empty Dataset using createDataset() method from SparkSession. below snippet provides examples using rdd, seq objects.


spark.createDataset(Seq.empty[Name])
spark.createDataset(Seq.empty[(String,String,String)])
spark.createDataset(spark.sparkContext.emptyRDD[Name])

Using implicit encoder

Let’s see another way, which uses implicit encoders.


val ds = Seq.empty[(String,String,String)].toDS()

Using case class

We can also create empty Dataset with the schema we wanted from scala case class.


val ds = Seq.empty[Name].toDS()

All examples above have the below schema with zero records in Dataset.


root
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- middleName: string (nullable = true)

Happy Learning !!

NNK

SparkByExamples.com is a BigData and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment using Scala and Maven.

Leave a Reply

Close Menu