• Post author:
  • Post category:PySpark
  • Post last modified:April 2, 2024
  • Reading time:11 mins read
You are currently viewing Install Pyspark 3.5 using pip or conda

There are multiple ways to install PySpark depending on your environment and use case. You can install just a PySpark package and connect to an existing cluster or Install complete Apache Spark (includes PySpark package) to setup your own cluster.

In this article, I will cover step-by-step installing pyspark by using pip, Anaconda(conda command), manually on Windows and Mac.

Ways to Install

  1. Manually download and install by yourself.
  2. Use Python PIP to setup PySpark and connect to an existing cluster.
  3. Use Anaconda to setup PySpark with all it’s features.

1. Install Python

Regardless of which process you use you need to install Python to run PySpark. If you already have Python skip this step. Check if you have Python by using python --version or python3 --version from the command line.

On Windows – Download Python from Python.org and install it.

On Mac – Install python using the below command. If you don’t have a brew, install it first by following https://brew.sh/.


# install Python
brew install python

2. Install Java

PySpark uses Java underlying hence you need to have Java on your Windows or Mac. Since Java is a third party, you can install it using Homebrew for Mac and manually download and install it for Windows. Since Oracle Java is not open source anymore, I am using the OpenJDK version 11.

On Windows – Download OpenJDK from adoptopenjdk and install it.

On Mac – Run the below command on the terminal to install Java.


# install Java
brew install openjdk@11

3. Install PySpark

3.1. Manually Download & Install PySpark

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. hence, you can install PySpark with all its features by installing Apache Spark.

On Apache Spark download page, select the link “Download Spark (point 3)” to download. If you wanted to use a different version of Spark & Hadoop, select the one you wanted from drop-downs, and the link on point 3 changes to the selected version and provides you with an updated link to download.

install pyspark

After download, untar the binary and copy the underlying folder spark-3.5.0-bin-hadoop3 to /your/home/directory/

On Windows – untar the binary using 7zip.

On Mac – Run the following command


# Untar the tar file
tar -xzf spark-3.5.0-bin-hadoop3.tgz

Now set the following environment variables.

On Windows – set the following environment variables. If you are not sure, Google it.


SPARK_HOME  = c:\your\home\directory\spark-3.5.0-bin-hadoop3
HADOOP_HOME = c:\your\home\directory\spark-3.5.0-bin-hadoop3
PATH = %PATH%;%SPARK_HOME%\bin

On Mac – Depending on your version open .bash_profile or .bashrc or .zshrc file and add the following lines. After adding re-open the session/terminal.


export SPARK_HOME  = /your/home/directory/spark-3.5.0-bin-hadoop3
export HADOOP_HOME = /your/home/directory/spark-3.5.0-bin-hadoop3
export PATH = $PATH:$SPARK_HOME/bin

The following step is required only for windows. Download winutils.exe file from winutils, and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version from winutils

This completes installing Apache Spark to run PySpark on Windows.

3.2. PySpark Install Using pip

Alternatively, you can install just a PySpark package by using the pip python installer.

Note that using Python pip you can install only the PySpark package which is used to test your jobs locally or run your jobs on an existing cluster running with Yarn, Standalone, or Mesos. It does not contain features/libraries to set up your own cluster. If you want PySpark with all its features including starting your own cluster then install it from Anaconda or by using the above approach.

Install pip on Mac & Windows – Follow the instructions from the below link to install pip.


#Install pip
https://pip.pypa.io/en/stable/installing/

For Python users, PySpark provides pip installation from PyPI. Python pip is a package manager that is used to install and uninstall third-party packages that are not part of the Python standard library. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of the Python Package Index.

If you already have pip installed, upgrade pip to the latest version before installing PySpark.


# Install pyspark from pip
pip install pyspark

This pip command starts collecting the PySpark package and installing it. You should see something like this below on the console if you are using Mac.

pyspark install pip

As I said earlier this does not contain all features of Apache Spark hence you can not setup your own cluster but use this to connect to the existing cluster to run jobs and run jobs locally.

3.3 Using Anaconda

Follow Install PySpark using Anaconda & run Jupyter notebook

4. Test PySpark Install from Shell

Regardless of which method you have used, once successfully install PySpark, launch pyspark shell by entering pyspark from the command line. PySpark shell is a REPL that is used to test and learn pyspark statements.

pyspark install pip

To submit a job on the cluster, use a spark-submit command that comes with install.

If you encounter any issues setting up PySpark on Mac and Windows following the above steps, please comment. I will be happy to help you and correct the steps.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium