PySpark - Create an Empty DataFrame & RDD

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:PySpark
Post last modified:March 27, 2024
Reading time:6 mins read

You are currently viewing PySpark – Create an Empty DataFrame & RDD

In this article, I will explain how to create an empty PySpark DataFrame/RDD manually with or without schema (column names) in different ways. Below I have explained one of the many scenarios where we need to create an empty DataFrame.

1. Create Empty RDD in PySpark

Create an empty RDD by using emptyRDD() of SparkContext for example spark.sparkContext.emptyRDD().


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#Creates Empty RDD
emptyRDD = spark.sparkContext.emptyRDD()
print(emptyRDD)

#Diplays
#EmptyRDD[188] at emptyRDD

Alternatively you can also get empty RDD by using spark.sparkContext.parallelize([]).


#Creates Empty RDD using parallelize
rdd2= spark.sparkContext.parallelize([])
print(rdd2)

#EmptyRDD[205] at emptyRDD at NativeMethodAccessorImpl.java:0
#ParallelCollectionRDD[206] at readRDDFromFile at PythonRDD.scala:262

Note: If you try to perform operations on empty RDD you going to get ValueError("RDD is empty").

2. Create Empty DataFrame with Schema (StructType)

In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField .


#Create Schema
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])

Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.


#Create empty DataFrame from empty RDD
df = spark.createDataFrame(emptyRDD,schema)
df.printSchema()

This yields below schema of the empty DataFrame.


root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)

3. Convert Empty RDD to DataFrame

You can also create empty DataFrame by converting empty RDD to DataFrame using toDF().


#Convert empty RDD to Dataframe
df1 = emptyRDD.toDF(schema)
df1.printSchema()

4. Create Empty DataFrame with Schema.

So far I have covered creating an empty DataFrame from RDD, but here will create it manually with schema and without RDD.


#Create empty DataFrame directly.
df2 = spark.createDataFrame([], schema)
df2.printSchema()

5. Create Empty DataFrame without Schema (no columns)

To create empty DataFrame with out schema (no columns) just create a empty schema and use it while creating PySpark DataFrame.


#Create empty DatFrame with no schema (no columns)
df3 = spark.createDataFrame([], StructType([]))
df3.printSchema()

#print below empty schema
#root

Happy Learning !!

Tags: empty DataFrame, emptyRDD

This Post Has 4 Comments

Gopi February 15, 2023

thanks for your efforts in creating such an elegant article
Anonymous November 22, 2021

Hello, I want to create an empty Dataframe without writing the schema, just as you show here (df3 = spark.createDataFrame([], StructType([]))) to append many dataframes in it. However it doesn’t let me. I am just getting an output of zero.
Is there a way where it automatically recognize the schema from the csv files?

Thanks
NNK August 31, 2021

Thank you. I am blessed :)
Anonymous August 31, 2021

hi, your teaching is amazing i am a non coder person but i am learning easily

Comments are closed.