Spark – How to create an empty RDD?

We often need to create empty RDD in Spark, and empty RDD can be created in several ways, for example, with partition, without partition, and with pair RDD. In this article, we will see these with Scala, Java and Pyspark examples.

1. Spark sc.emptyRDD – Creates empty RDD with no partition

In Spark, using emptyRDD() function on the SparkContext object creates an empty RDD with no partitions or elements. The below examples create an empty RDD.


// Spark sc.emptyRDD - Creates empty RDD with no partition
val spark:SparkSession = SparkSession.builder()
    .master("local[3]")
    .appName("SparkByExamples.com")
    .getOrCreate()
val rdd = spark.sparkContext.emptyRDD // creates EmptyRDD[0]
val rddString = spark.sparkContext.emptyRDD[String] // creates EmptyRDD[1]
println(rdd)
println(rddString)
println("Num of Partitions: "+rdd.getNumPartitions) // returns o partition

From the above spark.sparkContext.emptyRDD creates an EmptyRDD[0] and spark.sparkContext.emptyRDD[String] creates EmptyRDD[1] of String type. And both of these empty RDD’s created with 0 partitions. Statements println() from this example yields below output.


EmptyRDD[0] at emptyRDD at CreateEmptyRDD.scala:12
EmptyRDD[1] at emptyRDD at CreateEmptyRDD.scala:13
Num of Partitions: 0

Note that writing an empty RDD creates a folder with ._SUCCESS.crc file and _SUCCESS file with zero size.


rdd.saveAsTextFile("test.txt")
//outputs
java.io.IOException: (null) entry in command string: null chmod 0644

Once we have empty RDD, we can easily create an empty DataFrame from rdd object.

2. Create an Empty RDD with Partition

Using Spark sc.parallelize() we can create an empty RDD with partitions, writing partitioned RDD to a file results in the creation of multiple part files.


// Create an Empty RDD with Partition
  val rdd2 = spark.sparkContext.parallelize(Seq.empty[String])
  println(rdd2)
  println("Num of Partitions: "+rdd2.getNumPartitions)

From the above spark.sparkContext.parallelize(Seq.empty[String]) creates an ParallelCollectionRDD[2] with 3 partitions.


ParallelCollectionRDD[2] at parallelize at CreateEmptyRDD.scala:21
Num of Partitions: 3

Here is another example using sc.parallelize()


val emptyRDD = sc.parallelize(Seq(""))

3. Creating an Empty pair RDD

Most we use RDD with pair hence, here is another example of creating an RDD with pair. This example creates an empty RDD with String & Int pair.


type pairRDD = (String,Int)
var resultRDD = sparkContext.emptyRDD[pairRDD]

Yields below output.


// Output:
EmptyRDD[3] at emptyRDD at CreateEmptyRDD.scala:30

4. Java – creating an empty RDD

Similar to Scala, In Java also we can create an empty RDD by call emptyRDD() function on JavaSparkContext object.


JavaSparkContext jsc;
// create java spark context and assign it to jsc object.
JavaRDD<T> emptyRDD = jsc.emptyRDD();

PySpark – creating an empty RDD

5. Complete example in Scala

The complete code can also refer from GitHub project


package com.sparkbyexamples.spark.rdd

import org.apache.spark.sql.SparkSession

object CreateEmptyRDD extends App{

  val spark:SparkSession = SparkSession.builder()
    .master("local[3]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val rdd = spark.sparkContext.emptyRDD // creates EmptyRDD[0]
  val rddString = spark.sparkContext.emptyRDD[String] // creates EmptyRDD[1]

  println(rdd)
  println(rddString)
  println("Num of Partitions: "+rdd.getNumPartitions) // returns o partition

  // RddString.saveAsTextFile("test.txt") 

  val rdd2 = spark.sparkContext.parallelize(Seq.empty[String])
  println(rdd2)
  println("Num of Partitions: "+rdd2.getNumPartitions)

  // Rdd2.saveAsTextFile("test2.txt")

  // Pair RDD

  type dataType = (String,Int)
  var pairRDD = spark.sparkContext.emptyRDD[dataType]
  println(pairRDD)

}

In this article, you have learned how to create an empty RDD in Spark with partition, no partition and finally with pair RDD. Hope it helps you.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has 3 Comments

  1. sagar

    Writing Empty RDD won’t give error it will create a folder with ._SUCCESS.crc and _SUCCESS files.

    val emptyRdd= sc.emptyRDD

    emptyRdd.saveAsTextFile(“C:/Users/sagar/Desktop/emptyRdd1.txt”)

    1. NNK

      Thanks Sagar. I agree with you.

  2. Ritesh

    Hi
    Why we need empty RDD what is its real time use case. Could you pleas explain..