• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:14 mins read
You are currently viewing PySpark Install on Linux Ubuntu

How to install PySpark on an Ubuntu server running a Linux-based operating system? This article will walk you through the installation process of PySpark on Ubuntu, and the same instructions can be applied to other Linux distributions like CentOS and Debian. Since most PySpark applications run on Linux-based operating systems in real-time, it’s essential to understand how to install and run PySpark applications on Unix-based systems such as the Ubuntu server.

While this article focuses on Ubuntu, the steps outlined can be adapted to install PySpark on any Linux-based operating system, such as CentOS and Debian. I used the following steps to establish my PySpark cluster on an Ubuntu server.

Prerequisites:

  • Ubuntu Server running
  • Root access to the Ubuntu server
  • If you want to install PySpark on Hadoop & Yarn installation, please Install and Setup Hadoop cluster and setup Yarn on Cluster before proceeding with this article.
  • If you just want to Install PySpark and run it in standalone, proceed with this article.

1. Java Install On Ubuntu

Java or the Java Development Kit (JDK) is required to run PySpark because PySpark is a Python library that provides an interface to the Apache Spark platform, which is primarily implemented in Scala and runs on the Java Virtual Machine (JVM).

Since Oracle Java is licensed here I am using openJDK Java. If you want to use Java from other vendors or Oracle please do so. Here I will be using JDK 8.


# Install JDK
sudo apt-get -y install openjdk-8-jdk-headless

Post JDK install, check if it installed successfully by running java -version

pyspark install ubuntu

2. Python Install On Ubuntu

Python Installation is needed to run PySpark examples (Spark with Python) on the Ubuntu server. Most Ubuntu systems come with Python pre-installed. You can check your Python version using python --version or python3 --version. If it’s not installed, you can install Python 3 with the following command.


# Install Python3
sudo apt install python3

3. Install PySpark on Linux Ubuntu

PySpark relies on Apache Spark, which you can download from the official Apache Spark website or use a package manager. I recommend using the spark package from the Apache Spark website for the latest version. In order to install PySpark on Linux based Ubuntu, access Apache Spark Download site go to the Download Apache Spark section, and click on the link from point 3, this takes you to the page with mirror URLs to download. copy the link from one of the mirror site.

PySpark install linux

If you want to use a different version of Spark & Hadoop, select the one you wanted from the drop-down (point 1 and 2); the link on point 3 changes to the selected version and provides you with an updated link to download.

Use wget command to download the PySpark to your Ubuntu server.


# Download Apache Spark
wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

This downloads the tar archive file. Now untar the archive file contents using tar command, tar is a file archiving tool. Once untar complete, rename the folder to spark.


# Untar the downloaded file
tar -xzf spark-3.5.0-bin-hadoop3.tgz

# Rename folder to spark
mv spark-3.5.0-bin-hadoop3 spark

Alternatively, you can also use pip to install PySpark


# Install PySpark using pip
pip install pyspark

4. PySpark Environment Variables

To make PySpark accessible from the command line, add the following lines to your ~/.bashrc file or ~/.zshrc file. I am using .bashrc file, so I am adding the following lines.


sparkuser@sparknode:~$ vi ~/.bashrc

# Add below lines at the end of the .bashrc file.
export SPARK_HOME=/home/sparkuser/spark
export PATH=$PATH:$SPARK_HOME/bin

Now load the environment variables to the opened session by running the below command


# Source the bashrc file to reload
sparkuser@sparknode:~$ source ~/.bashrc

In case you added to the .profile file then restart your session by closing and re-opening the session.

5. Validate PySpark Installation on Ubuntu

With this, the Apache Spark install on Linux Ubuntu completes. Now let’s run a sample example that comes with Spark binary distribution.

Here I will be using Spark-Submit to submit example Python file which calculates PI value for 10 places. You can find spark-submit at $SPARK_HOME/bin directory.


# Run spark example
cd $SPARK_HOME
./bin/spark-submit examples/src/main/python/pi.py 10

6. PySpark Shell

Apache Spark binary comes with an interactive PySpark shell. In order to start a shell to use Scala language, go to your $SPARK_HOME/bin directory and type “pyspark“. This command loads the Spark and displays what version of Spark you are using.

pyspark shell

In order to run PySpark, you need to open pyspark shell by running $SPARK_HOME/bin/pyspark . Make sure you have Python installed before running pyspark shell.

By default, PySpark shell provides with spark (SparkSession) and sc (SparkContext) objects to use. Let’s see some examples.

PySpark shell also creates a Spark context web UI and by default, it can access from http://ip-address:4040.

7. Spark Web UI

Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) with the Install to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations on the Linux Ubuntu server. On Spark Web UI, you can see how the Spark Actions and Transformation operations are executed. You can access by opening http://ip-address:4040/. replace ip-address with your server IP.

pyspark installation ubuntu

8. Spark History server

Spark History server, keep a log of all completed Python applications you submit by spark-submit, and pyspark shell.

Create $SPARK_HOME/conf/spark-defaults.conf file and add the below configurations.


# Enable to store the event log
spark.eventLog.enabled true

# Location where to store event log
spark.eventLog.dir file:///tmp/spark-events

# Location from where history server to read event log
spark.history.fs.logDirectory file:///tmp/spark-events

Create an Event Log directory. Spark keeps logs for all applications you submit.


sparkuser@sparknode:~$ mkdir /tmp/spark-events

Run $SPARK_HOME/sbin/start-history-server.sh to start the history server.


sparkuser@sparknode:~$ $SPARK_HOME/sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /home/sparkuser/spark/logs/spark-sparkuser-org.apache.spark.deploy.history.HistoryServer-1-sparknode.out

As per the configuration, the history server by default runs on the 18080 port.

Apache Spark install Linux ubuntu

Run PI example again by using spark-submit Python file, and refresh the History server which should show the recent run.

Conclusion

You’ve now successfully installed PySpark on your Ubuntu system. You can start using it for distributed data processing and analysis. In Summary, you have learned the steps to install PySpark on a Linux-based Ubuntu Server and also learned how to start the History Server and access web UI.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium