Spark - How to create an empty DataFrame?

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark
Post last modified:March 27, 2024
Reading time:4 mins read

You are currently viewing Spark – How to create an empty DataFrame?

In this article, I will explain how to create empty Spark DataFrame with several Scala examples. Below I have explained one of the many scenarios where we need to create empty DataFrame.

While working with files, some times we may not receive a file for processing, however, we still need to create a DataFrame similar to the DataFrame we create when we receive a file. If we don’t create with the same schema, our operations/transformations on DF fail as we refer to the columns that may not present.

To handle situations similar to these, we always need to create a DataFrame with the same schema, which means the same column names and datatypes regardless of the file exists or empty file processing.

First let’s create the schema, columns and case class which I will use in the rest of the article.


  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  import spark.implicits._

  val schema = StructType(
    StructField("firstName", StringType, true) ::
      StructField("lastName", IntegerType, false) ::
      StructField("middleName", IntegerType, false) :: Nil)

  val colSeq = Seq("firstName","lastName","middleName")

  case class Name(firstName: String, lastName: String, middleName:String)

1. Creating an empty DataFrame (Spark 2.x and above)

SparkSession provides an emptyDataFrame() method, which returns the empty DataFrame with empty schema, but we wanted to create with the specified StructType schema.

val df = spark.emptyDataFrame

2. Create empty DataFrame with schema (StructType)

Use createDataFrame() from SparkSession


val df = spark.createDataFrame(spark.sparkContext
      .emptyRDD[Row], schema)

3. Using implicit encoder

Let’s see another way, which uses implicit encoders.


Seq.empty[(String,String,String)].toDF(colSeq:_*)

4. Using case class

We can also create empty DataFrame with the schema we wanted from the scala case class.


Seq.empty[Name].toDF()

All examples above have the below schema with zero records in DataFrame.


root
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- middleName: string (nullable = true)

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

1. Creating an empty DataFrame (Spark 2.x and above)

2. Create empty DataFrame with schema (StructType)

3. Using implicit encoder

4. Using case class

Related Articles

Naveen Nelamali