Let’s learn the difference between Pandas vs PySpark DataFrame, their definitions, features, advantages, how to create them and transform one to another with Examples.
What is Pandas?
Pandas is one of the most used open-source Python libraries to work with Structured tabular data for analysis. Pandas library is heavily used for Data Analytics, Machine learning, data science projects, and many more.
Pandas can load the data by reading CSV, JSON, SQL, many other formats and creates a DataFrame which is a structured object containing rows and columns (similar to SQL table).
It doesn’t support distributed processing hence you would always need to increase the resources when you need additional horsepower to support your growing data.
Pandas DataFrame’s are mutable and are not lazy, statistical functions are applied on each column by default. You can learn more on pandas at pandas DataFrame Tutorial For Beginners Guide.
Pandas DataFrame Example
In order to use Pandas library in Python, you need to import it using
import pandas as pd.
The below example creates a Pandas DataFrame from the list.
import pandas as pd data = [["James","","Smith",30,"M",60000], ["Michael","Rose","",50,"M",70000], ["Robert","","Williams",42,"",400000], ["Maria","Anne","Jones",38,"F",500000], ["Jen","Mary","Brown",45,None,0]] columns=['First Name','Middle Name','Last Name','Age','Gender','Salary'] # Create the pandas DataFrame pandasDF=pd.DataFrame(data=data, columns=columns) # print dataframe. print(pandasDF)
Outputs below data on the console. Note that Pandas add an index sequence number to every data frame.
Below are some transformations you can perform on Pandas DataFrame. Note that statistical functions calculate at each column by default. you don’t have to explicitly specify on what column you wanted to apply the statistical functions. Even count() function returns count of each column (by ignoring null/None values).
df.count()– Returns the count of each column (the count includes only non-null values).
df.corr()– Returns the correlation between columns in a data frame.
df.head(n)– Returns first n rows from the top.
df.max()– Returns the maximum of each column.
df.mean()– Returns the mean of each column.
df.median()– Returns the median of each column.
df.min()– Returns the minimum value in each column.
df.std()– Returns the standard deviation of each column
df.tail(n)– Returns last n rows.
print(pandasDF.count()) First Name 5 Middle Name 5 Last Name 5 Age 5 Gender 4 Salary 5 print(pandasDF.max()) First Name Robert Middle Name Rose Last Name Williams Age 50 Salary 500000 print(pandasDF.mean()) Age 41.0 Salary 206000.0
What is PySpark?
In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is a best fit which could processes operations many times(100x) faster than Pandas.
PySpark is very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow also used due to their efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.
PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.
Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.
Spark basically written in Scala and later on due to its industry adaptation it’s API PySpark released for Python using Py4J.
Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.
Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run PySpark applications.
- In-memory computation
- Distributed processing using parallelize
- Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
- Lazy evaluation
- Cache & persistence
- Inbuild-optimization when using DataFrames
- Supports ANSI SQL
- PySpark is a general-purpose, in-memory, distributed processing engine that allows you to process data efficiently in a distributed fashion.
- Applications running on PySpark are 100x faster than traditional systems.
- You will get great benefits from using PySpark for data ingestion pipelines.
- Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems.
- PySpark also is used to process real-time data using Streaming and Kafka.
- Using PySpark streaming you can also stream files from the file system and also stream from the socket.
- PySpark natively has machine learning and graph libraries.
PySpark Modules & Packages
- PySpark RDD (pyspark.RDD)
- PySpark DataFrame and SQL (pyspark.sql)
- PySpark Streaming (pyspark.streaming)
- PySpark MLib (pyspark.ml, pyspark.mllib)
- PySpark GraphFrames (GraphFrames)
- PySpark Resource (pyspark.resource) It’s new in PySpark 3.0
PySpark DataFrame Example
PySpark DataFrame is immutable (cannot be changed once created), fault-tolerant and Transformations are Lazy evaluation (they are not executed until actions are called). PySpark DataFrame’s are distributed in the cluster (meaning the data in PySpark DataFrame’s are stored in different machines in a cluster) and any operations in PySpark executes in parallel on all machines.
from pyspark.sql import SparkSession # Create SparkSession spark = SparkSession.builder \ .appName('SparkByExamples.com') \ .getOrCreate() data = [("James","","Smith",30,"M",60000), ("Michael","Rose","",50,"M",70000), ("Robert","","Williams",42,"",400000), ("Maria","Anne","Jones",38,"F",500000), ("Jen","Mary","Brown",45,"F",0)] columns = ["first_name","middle_name","last_name","Age","gender","salary"] pysparkDF = spark.createDataFrame(data = data, schema = columns) pysparkDF.printSchema() pysparkDF.show(truncate=False)
Outputs Below Schema & DataFrame.
Reading a CSV file.
#Read a CSV file df = spark.read.csv("/tmp/resources/zipcodes.csv")
PySpark transformations are Lazy in nature meaning they do not execute until actions are called.
from pyspark.sql.functions import mean, col, max #Example 1 df2=pysparkDF.select(mean("age"),mean("salary")) .show() #Example 2 pysparkDF.groupBy("gender") \ .agg(mean("age"),mean("salary"),max("salary")) \ .show()
PySpark SQL Compatible
PySpark supports SQL queries to run transformations. All you need to do is create a Table/View from the PySpark DataFrame.
pysparkDF.createOrReplaceTempView("Employee") spark.sql("select * from Employee where salary > 100000").show() #Prints result +----------+-----------+---------+---+------+------+ |first_name|middle_name|last_name|Age|gender|salary| +----------+-----------+---------+---+------+------+ | Robert| | Williams| 42| |400000| | Maria| Anne| Jones| 38| F|500000| +----------+-----------+---------+---+------+------+ spark.sql("select mean(age),mean(salary) from Employee").show() #Prints result +---------+------------+ |mean(age)|mean(salary)| +---------+------------+ | 41.0| 206000.0| +---------+------------+
Create PySpark DataFrame from Pandas
Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than Pandas, hence we often required to covert Pandas DataFrame to PySpark (Spark with Python) for better performance. This is one of the major differences between Pandas vs PySpark DataFrame.
#Create PySpark DataFrame from Pandas pysparkDF2 = spark.createDataFrame(pandasDF) pysparkDF2.printSchema() pysparkDF2.show()
Create Pandas from PySpark DataFrame
Once the transformations are done on Spark, you can easily convert it back to Pandas using
toPandas() method is an action that collects the data into Spark Driver memory so you have to be very careful while dealing with large datasets. You will get OutOfMemoryException if the collected data doesn’t fit in Spark Driver memory.
#Convert PySpark to Pandas pandasDF = pysparkDF.toPandas() print(pandasDF)
Use Apache Arrow to Transfer between Python & JVM
Apache Spark uses Apache Arrow which is an in-memory columnar format to transfer the data between Python and JVM. You need to enable to use Arrow as this is disabled by default. You also need to have Apache Arrow (PyArrow) install on all Spark cluster nodes using
pip install pyspark[sql] or by directly downloading from Apache Arrow for Python.
You need to have Spark compatible Apache Arrow installed to use the above statement, In case if you have not installed Apache Arrow you get the below error.
\apps\Anaconda3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: PyArrow >= 0.15.1 must be installed; however, it was not found. Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
When an error occurs, Spark automatically fallback to non-Arrow optimization implementation, this can be controlled by
Note: Apache Arrow currently support all Spark SQL data types are except
TimestampType, and nested StructType.
How to Decide Between Pandas vs PySpark
Below are the few considerations when to choose PySpark over Pandas
- If your data is huge and grows significantly over the years and you wanted to improve your processing time.
- If you want fault-tolerant.
- ANSI SQL compatibility.
- Language to choose (Spark supports Python, Scala, Java & R)
- When you want Machine-learning capability.
- Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
- If you wanted to stream the data and process it real-time.
In this article, at a very high level I have covered the difference between Pandas vs PySpark DataFrame, features, how to create each one and convert to one another as needed.
Happy Learning !!