PySpark – Create an Empty DataFrame & RDD

  • Post author:
  • Post category:PySpark
  • Post last modified:January 17, 2024
  • Reading time:6 mins read

In this article, I will explain how to create an empty PySpark DataFrame/RDD manually with or without schema (column names) in different ways. Below I have explained one of the many scenarios where we need to create an empty DataFrame.

While working with files, sometimes we may not receive a file for processing, however, we still need to create a DataFrame manually with the same schema we expect. If we don’t create with the same schema, our operations/transformations (like union’s) on DataFrame fail as we refer to the columns that may not present.

To handle situations similar to these, we always need to create a DataFrame with the same schema, which means the same column names and datatypes regardless of the file exists or empty file processing.

1. Create Empty RDD in PySpark

Create an empty RDD by using emptyRDD() of SparkContext for example spark.sparkContext.emptyRDD().


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#Creates Empty RDD
emptyRDD = spark.sparkContext.emptyRDD()
print(emptyRDD)

#Diplays
#EmptyRDD[188] at emptyRDD

Alternatively you can also get empty RDD by using spark.sparkContext.parallelize([]).


#Creates Empty RDD using parallelize
rdd2= spark.sparkContext.parallelize([])
print(rdd2)

#EmptyRDD[205] at emptyRDD at NativeMethodAccessorImpl.java:0
#ParallelCollectionRDD[206] at readRDDFromFile at PythonRDD.scala:262

Note: If you try to perform operations on empty RDD you going to get ValueError("RDD is empty").

2. Create Empty DataFrame with Schema (StructType)

In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField .


#Create Schema
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])

Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.


#Create empty DataFrame from empty RDD
df = spark.createDataFrame(emptyRDD,schema)
df.printSchema()

This yields below schema of the empty DataFrame.


root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)

3. Convert Empty RDD to DataFrame

You can also create empty DataFrame by converting empty RDD to DataFrame using toDF().


#Convert empty RDD to Dataframe
df1 = emptyRDD.toDF(schema)
df1.printSchema()

4. Create Empty DataFrame with Schema.

So far I have covered creating an empty DataFrame from RDD, but here will create it manually with schema and without RDD.


#Create empty DataFrame directly.
df2 = spark.createDataFrame([], schema)
df2.printSchema()

5. Create Empty DataFrame without Schema (no columns)

To create empty DataFrame with out schema (no columns) just create a empty schema and use it while creating PySpark DataFrame.


#Create empty DatFrame with no schema (no columns)
df3 = spark.createDataFrame([], StructType([]))
df3.printSchema()

#print below empty schema
#root

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has 4 Comments

  1. Gopi

    thanks for your efforts in creating such an elegant article

  2. Anonymous

    Hello, I want to create an empty Dataframe without writing the schema, just as you show here (df3 = spark.createDataFrame([], StructType([]))) to append many dataframes in it. However it doesn’t let me. I am just getting an output of zero.
    Is there a way where it automatically recognize the schema from the csv files?

    Thanks

  3. Anonymous

    hi, your teaching is amazing i am a non coder person but i am learning easily

    1. NNK

      Thank you. I am blessed 🙂