Spark History Server to Monitor Applications

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark / PySpark
Post last modified:October 5, 2023
Reading time:7 mins read

You are currently viewing Spark History Server to Monitor Applications

The Spark History Server is a User Interface that is used to monitor the metrics and performance of the completed Spark applications, In this article, I will explain what is history server? how to enable it to collect the even log, starting the server, and finally access and navigate the Interface.

1. What is Spark History Server?

When you submit a Spark application, Spark context is created which ideally gives you Spark Web UI to monitor the execution of the application. Monitoring includes the following.

Spark configurations used
Spark Jobs, stages, and tasks details
DAG execution
Driver and Executor resource utilization
Application logs and many more

When your application is done with the processing, Spark context will be terminated so your Web UI as well. and if you wanted to see the monitoring for already finished application, we cannot do it

This is where Spark history Server comes into the picture, where it keeps the history (event logs) of all completed applications and its runtime information which allows you to review metrics and monitor the application later in time.

History metrics are very helpful when you are trying to improve the performance of the application where you can compare the previous runs metrics with the latest run.

Spark History server can keep the history of event logs for the following

All applications submitted via spark-submit
Submitted via REST API
Every spark-shell you run
Every pyspark shell you run
Submitted via Notebooks

2. History Server Configurations

In order to store event logs for all submitted applications, first, Spark needs to collect the information while applications are running. By default, the spark doesn’t collect event log information. You can enable this by setting the below configs on spark-defaults.conf

Enable by setting the configuration spark.eventLog.enabled to true.
Specify where to store the event log history using spark.history.fs.logDirectory and spark.eventLog.dir, by default the location is file:///tmp/spark-events. You need to create the directory in advance.


// Enable to store the event log
spark.eventLog.enabled true

// Location where to store event log
spark.eventLog.dir file:///user/spark/spark-events

// Location from where history server to read event log
spark.history.fs.logDirectory file:///user/spark/spark-events

Spark keeps a history of every application you run by creating a sub-directory for each application and logs the events specific to the application in this directory.

You can also set the location like an HDFS directory so history files can be read by the history server.


spark.eventLog.dir hdfs://namenode_host:namenode_port/user/spark/spark-events

3. Spark Start History Server

Now, start history server on Linux or mac by running.


$SPARK_HOME/sbin/start-history-server.sh

If you are running Spark on windows, you can start the history server by starting the below command.


$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer

4. Monitor the Spark Application

By default History server listens at 18080 port and you can access it from browser using http://localhost:18080/

spark history server — Spark History Server

By clicking on each App ID, you will get the Spark application job, stage, task, executor’s environment details.

5. Spark Stop History Server

You can stop the history server by running the below command.


$SPARK_HOME/sbin/stop-history-server.sh

Conclusion

Using the History server, you can keep track of all completed applications, you need to enable this in order to keep the history. These metrics come in handy when you are doing performance tuning.

Happy Learning !!

Reference

https://spark.apache.org/docs/latest/monitoring.html

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has One Comment

Amara December 27, 2021

Hi,

Very nice articles you are writing. Thank you so much for your efforts and your content is different then other blogs. Keep it up. I have a question. How to enable history server for the application i am running in IntelliJ? I tried to trigger for local[*] but it is not registered in history server. Pls help.