Before we learn what is PySpark first understand what is Apache Spark? In simple words, Apache Spark is an open-source framework written in Scala for processing large datasets in a distributed manner (in a cluster). Spark runs 100 times faster than traditional processing due to its in-memory processing.
What is PySpark?
PySpark is a Python API for Apache Spark to process larger datasets in a distributed cluster. It is written in Python to run a Python application using Apache Spark capabilities.

As mentioned in the beginning, Spark basically is written in Scala, and due to its adaptation in industry, it’s equivalent PySpark API has been released for Python Py4J.
Py4J
is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.
Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run PySpark applications.
Who uses it?
PySpark is very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. PySpark brings robust and cost-effective ways to run machine learning applications on billions and trillions of data on distributed clusters 100 times faster than the traditional python applications.
PySpark has been used by many organizations like Amazon, Walmart, Trivago, Sanofi, Runtastic, and many more. PySpark also used in different sectors.
- Health
- Financials
- Education
- Entertainment
- Utilities
- E-commerce and many more
PySpark modules
- PySpark RDD (pyspark.RDD)
- PySpark DataFrame and SQL (pyspark.sql)
- PySpark Streaming (pyspark.streaming)
- PySpark MLib (pyspark.ml, pyspark.mllib)
- PySpark GraphFrames (GraphFrames)
- PySpark Resource (pyspark.resource) It’s new in PySpark 3.0
PySpark use case
Batch processing
PySpark RDD and DataFrame’s are used to process batch pipelines where you would need high throughput.
Realtime processing
PySpark Streaming is used to for real time processing.
Machine Learning
PySpark ML and MLlib is used for machine learning.
Graph processing
PySpark GraphX and GraphFrames are used for Graph processing.