(Spark with Python) PySpark DataFrame can be converted to Python pandas DataFrame using a function toPandas()
, In this article, I will explain how to create Pandas DataFrame from PySpark (Spark) DataFrame with examples.
Before we start first understand the main differences between the Pandas & PySpark, operations on Pyspark run faster than Pandas due to its distributed nature and parallel execution on multiple cores and machines.
In other words, pandas run operations on a single node whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark processes operations many times faster than pandas. Refer to pandas DataFrame Tutorial beginners guide with examples
After processing data in PySpark we would need to convert it back to Pandas DataFrame for a further procession with Machine Learning application or any Python applications.
Prepare PySpark DataFrame
In order to explain with an example first let’s create a PySpark DataFrame.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","","Smith","36636","M",60000),
("Michael","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Jen","Mary","Brown","","F",0)]
columns = ["first_name","middle_name","last_name","dob","gender","salary"]
pysparkDF = spark.createDataFrame(data = data, schema = columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)
This yields below schema and result of the DataFrame.
root
|-- first_name: string (nullable = true)
|-- middle_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- dob: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: long (nullable = true)
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob |gender|salary|
+----------+-----------+---------+-----+------+------+
|James | |Smith |36636|M |60000 |
|Michael |Rose | |40288|M |70000 |
|Robert | |Williams |42114| |400000|
|Maria |Anne |Jones |39192|F |500000|
|Jen |Mary |Brown | |F |0 |
+----------+-----------+---------+-----+------+------+
Convert PySpark Dataframe to Pandas DataFrame
PySpark DataFrame provides a method toPandas()
to convert it to Python Pandas DataFrame.
toPandas()
 results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data. running on larger dataset’s results in memory error and crashes the application. To deal with a larger dataset, you can also try increasing memory on the driver.
pandasDF = pysparkDF.toPandas()
print(pandasDF)
This yields the below panda’s DataFrame. Note that pandas add a sequence number to the result as a row Index. You can rename pandas columns by using rename() function.
first_name middle_name last_name dob gender salary
0 James Smith 36636 M 60000
1 Michael Rose 40288 M 70000
2 Robert Williams 42114 400000
3 Maria Anne Jones 39192 F 500000
4 Jen Mary Brown F 0
I have dedicated Python pandas Tutorial with Examples where I explained pandas concepts in detail.
Convert Spark Nested Struct DataFrame to Pandas
Most of the time data in PySpark DataFrame will be in a structured format meaning one column contains other columns so let’s see how it convert to Pandas. Here is an example with nested struct where we have firstname
, middlename
and lastname
are part of the name
column.
# Nested structure elements
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
dataStruct = [(("James","","Smith"),"36636","M","3000"), \
(("Michael","Rose",""),"40288","M","4000"), \
(("Robert","","Williams"),"42114","M","4000"), \
(("Maria","Anne","Jones"),"39192","F","4000"), \
(("Jen","Mary","Brown"),"","F","-1") \
]
schemaStruct = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('dob', StringType(), True),
StructField('gender', StringType(), True),
StructField('salary', StringType(), True)
])
df = spark.createDataFrame(data=dataStruct, schema = schemaStruct)
df.printSchema()
pandasDF2 = df.toPandas()
print(pandasDF2)
Converting structured DataFrame to Pandas DataFrame results below output.
name dob gender salary
0 (James, , Smith) 36636 M 3000
1 (Michael, Rose, ) 40288 M 4000
2 (Robert, , Williams) 42114 M 4000
3 (Maria, Anne, Jones) 39192 F 4000
4 (Jen, Mary, Brown) F -1
Conclusion
In this simple article, you have learned to convert Spark DataFrame to pandas using toPandas()
function of the Spark DataFrame. also have seen a similar example with complex nested structure elements. toPandas()
 results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data.
Happy Learning !!
Reference: https://docs.databricks.com/spark/latest/spark-sql/spark-pandas.html