Pandas vs PySpark DataFrame With Examples

  • Post author:
  • Post category:PySpark
  • Post last modified:October 5, 2023
  • Reading time:16 mins read

Let’s learn the difference between Pandas vs PySpark DataFrame, their definitions, features, advantages, how to create them and transform one to another with Examples.

What is Pandas?

Pandas is one of the most used open-source Python libraries to work with Structured tabular data for analysis. Pandas library is heavily used for Data Analytics, Machine learning, data science projects, and many more.

Pandas can load the data by reading CSV, JSON, SQL, many other formats and creates a DataFrame which is a structured object containing rows and columns (similar to SQL table).

It doesn’t support distributed processing hence you would always need to increase the resources when you need additional horsepower to support your growing data.

Pandas DataFrame’s are mutable and are not lazy, statistical functions are applied on each column by default. You can learn more on pandas at pandas DataFrame Tutorial For Beginners Guide.

Pandas DataFrame Example

In order to use Pandas library in Python, you need to import it using import pandas as pd.

The below example creates a Pandas DataFrame from the list.


import pandas as pd    
data = [["James","","Smith",30,"M",60000], 
        ["Michael","Rose","",50,"M",70000], 
        ["Robert","","Williams",42,"",400000], 
        ["Maria","Anne","Jones",38,"F",500000], 
        ["Jen","Mary","Brown",45,None,0]] 
columns=['First Name','Middle Name','Last Name','Age','Gender','Salary']

# Create the pandas DataFrame 
pandasDF=pd.DataFrame(data=data, columns=columns) 
  
# print dataframe. 
print(pandasDF)

Outputs below data on the console. Note that Pandas add an index sequence number to every data frame.

Pandas vs PySpark DataFrame

Pandas Transformations

Below are some transformations you can perform on Pandas DataFrame. Note that statistical functions calculate at each column by default. you don’t have to explicitly specify on what column you wanted to apply the statistical functions. Even count() function returns count of each column (by ignoring null/None values).

  • df.count() – Returns the count of each column (the count includes only non-null values).
  • df.corr() – Returns the correlation between columns in a data frame.
  • df.head(n) – Returns first n rows from the top.
  • df.max() – Returns the maximum of each column.
  • df.mean() – Returns the mean of each column.
  • df.median() – Returns the median of each column.
  • df.min() – Returns the minimum value in each column.
  • df.std() – Returns the standard deviation of each column
  • df.tail(n) – Returns last n rows.

print(pandasDF.count())
First Name     5
Middle Name    5
Last Name      5
Age            5
Gender         4
Salary         5

print(pandasDF.max())
First Name       Robert
Middle Name        Rose
Last Name      Williams
Age                  50
Salary           500000

print(pandasDF.mean())
Age           41.0
Salary    206000.0

What is PySpark?

In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is a best fit which could processes operations many times(100x) faster than Pandas.

PySpark is very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow also used due to their efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.

Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.

Difference Between Pandas vs PySpark
source: https://databricks.com/

Spark basically written in Scala and later on due to its industry adaptation it’s API PySpark released for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDEJupyter notebook to run PySpark applications.

PySpark Features

  • In-memory computation
  • Distributed processing using parallelize
  • Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
  • Fault-tolerant
  • Immutable
  • Lazy evaluation
  • Cache & persistence
  • Inbuild-optimization when using DataFrames
  • Supports ANSI SQL

PySpark Advantages

  • PySpark is a general-purpose, in-memory, distributed processing engine that allows you to process data efficiently in a distributed fashion.
  • Applications running on PySpark are 100x faster than traditional systems.
  • You will get great benefits from using PySpark for data ingestion pipelines.
  • Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems.
  • PySpark also is used to process real-time data using Streaming and Kafka.
  • Using PySpark streaming you can also stream files from the file system and also stream from the socket.
  • PySpark natively has machine learning and graph libraries.

PySpark Modules & Packages

  • PySpark RDD (pyspark.RDD)
  • PySpark DataFrame and SQL (pyspark.sql)
  • PySpark Streaming (pyspark.streaming)
  • PySpark MLib (pyspark.ml, pyspark.mllib)
  • PySpark GraphFrames (GraphFrames)
  • PySpark Resource (pyspark.resource) It’s new in PySpark 3.0

PySpark DataFrame Example

PySpark DataFrame is immutable (cannot be changed once created), fault-tolerant and Transformations are Lazy evaluation (they are not executed until actions are called). PySpark DataFrame’s are distributed in the cluster (meaning the data in PySpark DataFrame’s are stored in different machines in a cluster) and any operations in PySpark executes in parallel on all machines.


from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
               .appName('SparkByExamples.com') \
               .getOrCreate()

data = [("James","","Smith",30,"M",60000),
        ("Michael","Rose","",50,"M",70000),
        ("Robert","","Williams",42,"",400000),
        ("Maria","Anne","Jones",38,"F",500000),
        ("Jen","Mary","Brown",45,"F",0)]

columns = ["first_name","middle_name","last_name","Age","gender","salary"]
pysparkDF = spark.createDataFrame(data = data, schema = columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)

Outputs Below Schema & DataFrame.

pandas vs pyspark dataframe

Reading a CSV file.


#Read a CSV file
df = spark.read.csv("/tmp/resources/zipcodes.csv")

PySpark Transformations

PySpark transformations are Lazy in nature meaning they do not execute until actions are called.


from pyspark.sql.functions import mean, col, max
#Example 1
df2=pysparkDF.select(mean("age"),mean("salary"))
             .show()
#Example 2
pysparkDF.groupBy("gender") \
         .agg(mean("age"),mean("salary"),max("salary")) \
         .show()

PySpark SQL Compatible

PySpark supports SQL queries to run transformations. All you need to do is create a Table/View from the PySpark DataFrame.


pysparkDF.createOrReplaceTempView("Employee")
spark.sql("select * from Employee where salary > 100000").show()
#Prints result
+----------+-----------+---------+---+------+------+
|first_name|middle_name|last_name|Age|gender|salary|
+----------+-----------+---------+---+------+------+
|    Robert|           | Williams| 42|      |400000|
|     Maria|       Anne|    Jones| 38|     F|500000|
+----------+-----------+---------+---+------+------+
spark.sql("select mean(age),mean(salary) from Employee").show()
#Prints result
+---------+------------+
|mean(age)|mean(salary)|
+---------+------------+
|     41.0|    206000.0|
+---------+------------+

Create PySpark DataFrame from Pandas

Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than Pandas, hence we often required to covert Pandas DataFrame to PySpark (Spark with Python) for better performance. This is one of the major differences between Pandas vs PySpark DataFrame.


#Create PySpark DataFrame from Pandas
pysparkDF2 = spark.createDataFrame(pandasDF) 
pysparkDF2.printSchema()
pysparkDF2.show()

Create Pandas from PySpark DataFrame

Once the transformations are done on Spark, you can easily convert it back to Pandas using toPandas() method.

Note: toPandas() method is an action that collects the data into Spark Driver memory so you have to be very careful while dealing with large datasets. You will get OutOfMemoryException if the collected data doesn’t fit in Spark Driver memory.


#Convert PySpark to Pandas
pandasDF = pysparkDF.toPandas()
print(pandasDF)

Use Apache Arrow to Transfer between Python & JVM

Apache Spark uses Apache Arrow which is an in-memory columnar format to transfer the data between Python and JVM. You need to enable to use Arrow as this is disabled by default. You also need to have Apache Arrow (PyArrow) install on all Spark cluster nodes using pip install pyspark[sql] or by directly downloading from Apache Arrow for Python.


spark.conf.set("spark.sql.execution.arrow.enabled","true")

You need to have Spark compatible Apache Arrow installed to use the above statement, In case if you have not installed Apache Arrow you get the below error.


\apps\Anaconda3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  PyArrow >= 0.15.1 must be installed; however, it was not found.
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

When an error occurs, Spark automatically fallback to non-Arrow optimization implementation, this can be controlled by spark.sql.execution.arrow.pyspark.fallback.enabled.


spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")

Note: Apache Arrow currently support all Spark SQL data types are except MapTypeArrayType of TimestampType, and nested StructType.

How to Decide Between Pandas vs PySpark

Below are the few considerations when to choose PySpark over Pandas

  • If your data is huge and grows significantly over the years and you wanted to improve your processing time.
  • If you want fault-tolerant.
  • ANSI SQL compatibility.
  • Language to choose (Spark supports Python, Scala, Java & R)
  • When you want Machine-learning capability.
  • Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
  • If you wanted to stream the data and process it real-time.

Conclusion

In this article, at a very high level I have covered the difference between Pandas vs PySpark DataFrame, features, how to create each one and convert to one another as needed.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has One Comment

  1. Varuna

    Awesome Naveen! Your explanation is crystal clear, made learning very easy! Happy Learning 🙂