Spark Step-by-Step Setup on Hadoop Yarn Cluster

This post explains how to setup Apache Spark and run Spark applications on the Hadoop with the Yarn cluster manager that is used to run spark examples as deployment mode client and master as yarn. You can also try running the Spark application in cluster mode.

Prerequisites :

If you don’t have Hadoop & Yarn installed, please Install and Setup Hadoop cluster and setup Yarn on Cluster before proceeding with this article..

Spark Install and Setup

In order to install and setup Apache Spark on Hadoop cluster, access Apache Spark Download site and go to the Download Apache Spark section and click on the link from point 3, this takes you to the page with mirror URL’s to download. copy the link from one of the mirror site.

If you wanted to use a different version of Spark & Hadoop, select the one you wanted from the drop-down (point 1 and 2); the link on point 3 changes to the selected version and provides you with an updated link to download.

1. Download Apache spark latest version.


wget http://apache.claz.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

2. Once your download is complete, unzip the file’s contents using tar, a file archiving tool and rename the folder to spark


tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
mv spark-2.4.0-bin-hadoop2.7 spark

3. Add spark environment variables to .bashrc or .profile file. open file in vi editor and add below variables.


vi ~/.bashrc

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/home/ubuntu/spark
export PATH=$PATH:$SPARK_HOME/bin
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

Now load the environment variables to the opened session by running below command


source ~/.bashrc

In case if you added to .profile file then restart your session by logging out and logging in again.

4. Finally, edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn


spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m

With this, Spark setup completes with Yarn. Now let’s try to run sample job that comes with Spark binary distribution.

5. Run Sample spark job


spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10

Spark History server

1. Configure history server

edit $SPARK_HOME/conf/spark-defaults.conf file and add below properties.


spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://namenode:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080

2. Run history server


$SPARK_HOME/sbin/start-history-server.sh

As per the configuration, history server runs on 18080 port.

3. Run spark job again, and access below Spark UI to check the logs and status of the job.

spark setup hadoop yarn cluster

Conclusion

In this article you have learned Apache Spark setup on Hadoop cluster, running sample PI example and finally running History Server to monitor the application.

You Should Also Read:

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has 5 Comments

  1. Gee

    thanks for the article, very helpful. just a small comment you are saying at the very beginning in this article that the deployment mode is Cluster but you are using Client mode in spark submit command.

    1. NNK

      Thanks for pointing it out. I have corrected it now.

  2. Anonymous

    how to share data on executors

  3. Nonjob

    also

    hadoop fs -mkdir /spark-logs