PySpark dataFrameObject.rdd
is used to convert PySpark DataFrame to RDD; there are several transformations that are not available in DataFrame but present in RDD hence you often required to convert PySpark DataFrame to RDD.
Since PySpark 1.3, it provides a property .rdd
on DataFrame which returns the PySpark RDD class object of DataFrame (converts DataFrame to RDD).
rddObj=df.rdd
Convert PySpark DataFrame to RDD
PySpark DataFrame is a list of Row
objects, when you run df.rdd
, it returns the value of type RDD<Row>
, let’s see with an example. First create a simple DataFrame
data = [('James',3000),('Anna',4001),('Robert',6200)]
df = spark.createDataFrame(data,["name","salary"])
df.show()
#Displays
+------+------+
| name|salary|
+------+------+
| James| 3000|
| Anna| 4001|
|Robert| 6200|
+------+------+
Example
The below example converts DataFrame to RDD and displays the RDD after collect()
.
#converts DataFrame to rdd
rdd=df.rdd
print(rdd.collect())
#Displays
[Row(name='James', salary=3000), Row(name='Anna', salary=4001), Row(name='Robert', salary=6200)]
PySpark doesn’t have a partitionBy(), map(), mapPartitions() transformations and these are present in RDD so let’s see an example of converting DataFrame to RDD and applying map() transformation. Read PySpark map() vs mapPartitions() to learn the differences.
#Apply map() transformation
rdd2=df.rdd.map(lambda x: [x[0],x[1]*20/100])
print(rdd2.collect())
#Displays
[['James', 600.0], ['Anna', 800.2], ['Robert', 1240.0]]
In case if you wanted to Convert PySpark DataFrame to Python List
Below is an example of Convert RDD back to PySpark DataFrame by using toDF()
function.
#Conver back to DataFrame
df2=rdd2.toDF(["name","bonus"])
df2.show()
#Displays
+------+------+
| name| bonus|
+------+------+
| James| 600.0|
| Anna| 800.2|
|Robert|1240.0|
+------+------+
Conclusion
Spark DataFrame doesn’t have methods like map(), mapPartitions() and partitionBy() instead they are available on RDD hence you often need to convert DataFrame to RDD and back to DataFrame.
Happy Learning !!
Related Articles
- PySpark RDD Actions with examples
- PySpark RDD Transformations with examples
- Convert PySpark RDD to DataFrame
- PySpark Create RDD with Examples
- PySpark Row using on DataFrame and RDD
- PySpark – Create an Empty DataFrame & RDD
- PySpark parallelize() – Create RDD from a list data
- PySpark repartition() – Explained with Examples
- PySpark SparkContext Explained
- PySpark RDD Actions with examples