• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing What is PySpark and who uses it?

Before we learn what is PySpark first understand what is Apache Spark? In simple words, Apache Spark is an open-source framework written in Scala for processing large datasets in a distributed manner (in a cluster). Spark runs 100 times faster than traditional processing due to its in-memory processing.

Advertisements

What is PySpark?

PySpark is a Python API for Apache Spark to process larger datasets in a distributed cluster. It is written in Python to run a Python application using Apache Spark capabilities.

What is PySpark
source: https://databricks.com/

As mentioned in the beginning, Spark basically is written in Scala, and due to its adaptation in industry, it’s equivalent PySpark API has been released for Python Py4J. 

Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDEJupyter notebook to run PySpark applications.

Who uses it?

PySpark is very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. PySpark brings robust and cost-effective ways to run machine learning applications on billions and trillions of data on distributed clusters 100 times faster than the traditional python applications.

PySpark has been used by many organizations like Amazon, Walmart, Trivago, Sanofi, Runtastic, and many more. PySpark also used in different sectors.

  • Health
  • Financials
  • Education
  • Entertainment
  • Utilities
  • E-commerce and many more

PySpark modules

  • PySpark RDD (pyspark.RDD)
  • PySpark DataFrame and SQL (pyspark.sql)
  • PySpark Streaming (pyspark.streaming)
  • PySpark MLib (pyspark.ml, pyspark.mllib)
  • PySpark GraphFrames (GraphFrames)
  • PySpark Resource (pyspark.resource) It’s new in PySpark 3.0

PySpark use case

Batch processing

PySpark RDD and DataFrame’s are used to process batch pipelines where you would need high throughput.

Realtime processing

PySpark Streaming is used to for real time processing.

Machine Learning

PySpark ML and MLlib is used for machine learning.

Graph processing

PySpark GraphX and GraphFrames are used for Graph processing.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has One Comment

  1. Ashfaque

    Excellent article…

Comments are closed.