• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:5 mins read
You are currently viewing PySpark Convert DataFrame to RDD

PySpark dataFrameObject.rdd is used to convert PySpark DataFrame to RDD; there are several transformations that are not available in DataFrame but present in RDD hence you often required to convert PySpark DataFrame to RDD.

Since PySpark 1.3, it provides a property .rdd on DataFrame which returns the PySpark RDD class object of DataFrame (converts DataFrame to RDD).


rddObj=df.rdd

Convert PySpark DataFrame to RDD

PySpark DataFrame is a list of Row objects, when you run df.rdd, it returns the value of type RDD<Row>, let’s see with an example. First create a simple DataFrame


data = [('James',3000),('Anna',4001),('Robert',6200)]
df = spark.createDataFrame(data,["name","salary"])
df.show()

#Displays
+------+------+
|  name|salary|
+------+------+
| James|  3000|
|  Anna|  4001|
|Robert|  6200|
+------+------+

Example

The below example converts DataFrame to RDD and displays the RDD after collect().


#converts DataFrame to rdd
rdd=df.rdd

print(rdd.collect())
#Displays
[Row(name='James', salary=3000), Row(name='Anna', salary=4001), Row(name='Robert', salary=6200)]

PySpark doesn’t have a partitionBy(), map(), mapPartitions() transformations and these are present in RDD so let’s see an example of converting DataFrame to RDD and applying map() transformation. Read PySpark map() vs mapPartitions() to learn the differences.


#Apply map() transformation
rdd2=df.rdd.map(lambda x: [x[0],x[1]*20/100])
print(rdd2.collect())

#Displays
[['James', 600.0], ['Anna', 800.2], ['Robert', 1240.0]]

In case if you wanted to Convert PySpark DataFrame to Python List

Below is an example of Convert RDD back to PySpark DataFrame by using toDF() function.


#Conver back to DataFrame
df2=rdd2.toDF(["name","bonus"])
df2.show()

#Displays
+------+------+
|  name| bonus|
+------+------+
| James| 600.0|
|  Anna| 800.2|
|Robert|1240.0|
+------+------+

Conclusion

Spark DataFrame doesn’t have methods like map(), mapPartitions() and partitionBy() instead they are available on RDD hence you often need to convert DataFrame to RDD and back to DataFrame.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium