How to install Apache Spark on Linux based Ubuntu server? In this article, I will guide you through the step-by-step installation of Apache Spark on the Ubuntu server, the same steps can be used to set up Centos, Debian etc. In real-time all Spark application runs on Linux-based OS hence it is good to have knowledge on how to Install and run Spark applications on some Unix-based OS like Ubuntu server.
Though this article explains Ubuntu, you can follow these steps to install Spark on any Linux-based OS like Centos, Debian, etc, I followed the below steps to setup my Apache Spark cluster on Ubuntu server.
Prerequisites:
- Ubuntu Server running
- Root access to the Ubuntu server
- If you want to install Apache Spark on Hadoop & Yarn installation, please Install and Setup Hadoop cluster and setup Yarn on Cluster before proceeding with this article.
If you just want to run Spark in standalone, proceed with this article.
1. Java Install On Ubuntu
Apache Spark is written in Scala which is a language of Java hence to run Spark you need to have Java Installed. Since Oracle Java is licensed here I am using openJDK Java. If you want to use Java from other vendors or Oracle please do so. Here I will be using JDK 8.
# Install JDK
sudo apt-get -y install openjdk-8-jdk-headless
Post JDK install, check if it installed successfully by running java -version
2. Python Install On Ubuntu
You can skip this section if you want to run Spark with Scala & Java on an Ubuntu server.
Python Installation is needed if you want to run PySpark examples (Spark with Python) on the Ubuntu server.
# Install Python3
sudo apt install python3
3. Install Apache Spark on Linux Ubuntu
In order to install Apache Spark on Linux based Ubuntu, access Apache Spark Download site and go to the Download Apache Spark section, and click on the link from point 3, this takes you to the page with mirror URLs to download. copy the link from one of the mirror site.
If you want to use a different version of Spark & Hadoop, select the one you wanted from the drop-down (point 1 and 2); the link on point 3 changes to the selected version and provides you with an updated link to download.
Use wget
command to download the Apache Spark to your Ubuntu server.
# Download Apache Spark
wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Once your download is complete, untar the archive file contents using tar command, tar is a file archiving tool. Once untar complete, rename the folder to spark.
# Untar the downloaded file
tar -xzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 spark
4. Spark Environment Variables
Add Apache Spark environment variables to the .bashrc or .profile file. open the file in the vi editor and add the below variables.
sparkuser@sparknode:~$ vi ~/.bashrc
# Add below lines at the end of the .bashrc file.
export SPARK_HOME=/home/sparkuser/spark
export PATH=$PATH:$SPARK_HOME/bin
Now load the environment variables to the opened session by running the below command
# Source the bashrc file to reload
sparkuser@sparknode:~$ source ~/.bashrc
In case you added to .profile file then restart your session by closing and re-opening the session.
5. Test Apache Spark Install on Ubuntu
With this, the Apache Spark install on Linux Ubuntu completes. Now let’s run a sample example that comes with Spark binary distribution.
Here I will be using Spark-Submit Command to calculate PI value for 10 places by running org.apache.spark.examples.SparkPi example. You can find spark-submit at $SPARK_HOME/bin
directory.
# Run spark example
spark-submit --class org.apache.spark.examples.SparkPi spark/examples/jars/spark-examples_2.12-3.5.0.jar 10
6. Spark Shell
Apache Spark binary comes with an interactive spark-shell. In order to start a shell to use Scala language, go to your $SPARK_HOME/bin
directory and type “spark-shell
“. This command loads the Spark and displays what version of Spark you are using.
Note: In spark-shell you can run only Spark with Scala. In order to run PySpark, you need to open pyspark shell by running $SPARK_HOME/bin/pyspark
. Make sure you have Python installed before running pyspark shell.
By default, spark-shell provides with spark
(SparkSession) and sc
(SparkContext) objects to use. Let’s see some examples.
Spark-shell also creates a Spark context web UI and by default, it can access from http://ip-address:4040.
7. Spark Web UI
Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) with the Install to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations on the Linux Ubuntu server. On Spark Web UI, you can see how the Spark Actions and Transformation operations are executed. You can access by opening http://ip-address:4040/
. replace ip-address with your server IP.
8. Spark History server
Spark History server, keep a log of all completed Spark applications you submit by spark-submit, and spark-shell.
Create $SPARK_HOME/conf/spark-defaults.conf
file and add the below configurations.
# Enable to store the event log
spark.eventLog.enabled true
# Location where to store event log
spark.eventLog.dir file:///tmp/spark-events
# Location from where history server to read event log
spark.history.fs.logDirectory file:///tmp/spark-events
Create a Spark Event Log directory. Spark keeps logs for all applications you submit.
sparkuser@sparknode:~$ mkdir /tmp/spark-events
Run $SPARK_HOME/sbin/start-history-server.sh
to start the history server.
sparkuser@sparknode:~$ $SPARK_HOME/sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /home/sparkuser/spark/logs/spark-sparkuser-org.apache.spark.deploy.history.HistoryServer-1-sparknode.out
As per the configuration, the history server by default runs on the 18080 port.
Run PI example again by using spark-submit command, and refresh the History server which should show the recent run.
Conclusion
In Summary, you have learned the steps to install Apache Spark on a Linux based Ubuntu Server and also learned how to start History Server, and access web UI.
Related Articles
- Apache Spark Setup with Scala and IntelliJ
- Apache Spark Installation on Windows
- Spark Installation on Mac OS
- Spark Hello World Example in IntelliJ IDEA
- Spark Setup on Hadoop Cluster with Yarn
- Spark Start History Server
- Spark Shell Command Usage with Examples
- What is SparkSession and How to create it?
- What is SparkContext and How to create it?
- How to Check Spark Version
- Install PySpark on Ubuntu running on Linux
- Install PySpark in Jupyter on Mac using Homebrew
- Install PySpark in Anaconda & Jupyter Notebook
- How to Install PySpark on Mac
- How to Install PySpark on Windows
- Install Pyspark using pip or condo
nice article