The Spark History Server is a User Interface that is used to monitor the metrics and performance of the completed Spark applications, In this article, I will explain what is history server? how to enable it to collect the even log, starting the server, and finally access and navigate the Interface.
1. What is Spark History Server?
When you submit a Spark application, Spark context is created which ideally gives you Spark Web UI to monitor the execution of the application. Monitoring includes the following.
- Spark configurations used
- Spark Jobs, stages, and tasks details
- DAG execution
- Driver and Executor resource utilization
- Application logs and many more
When your application is done with the processing, Spark context will be terminated so your Web UI as well. and if you wanted to see the monitoring for already finished application, we cannot do it
This is where Spark history Server comes into the picture, where it keeps the history (event logs) of all completed applications and its runtime information which allows you to review metrics and monitor the application later in time.
History metrics are very helpful when you are trying to improve the performance of the application where you can compare the previous runs metrics with the latest run.
Spark History server can keep the history of event logs for the following
- All applications submitted via spark-submit
- Submitted via REST API
- Every spark-shell you run
- Every pyspark shell you run
- Submitted via Notebooks
2. History Server Configurations
In order to store event logs for all submitted applications, first, Spark needs to collect the information while applications are running. By default, the spark doesn’t collect event log information. You can enable this by setting the below configs on spark-defaults.conf
- Enable by setting the configuration
spark.eventLog.enabled
totrue
. - Specify where to store the event log history using
spark.history.fs.logDirectory
andspark.eventLog.dir
, by default the location isfile:///tmp/spark-events
. You need to create the directory in advance.
// Enable to store the event log
spark.eventLog.enabled true
// Location where to store event log
spark.eventLog.dir file:///user/spark/spark-events
// Location from where history server to read event log
spark.history.fs.logDirectory file:///user/spark/spark-events
Spark keeps a history of every application you run by creating a sub-directory for each application and logs the events specific to the application in this directory.
You can also set the location like an HDFS directory so history files can be read by the history server.
spark.eventLog.dir hdfs://namenode_host:namenode_port/user/spark/spark-events
3. Spark Start History Server
Now, start history server on Linux or mac by running.
$SPARK_HOME/sbin/start-history-server.sh
If you are running Spark on windows, you can start the history server by starting the below command.
$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer
4. Monitor the Spark Application
By default History server listens at 18080 port and you can access it from browser using http://localhost:18080/
By clicking on each App ID, you will get the Spark application job, stage, task, executor’s environment details.
5. Spark Stop History Server
You can stop the history server by running the below command.
$SPARK_HOME/sbin/stop-history-server.sh
Conclusion
Using the History server, you can keep track of all completed applications, you need to enable this in order to keep the history. These metrics come in handy when you are doing performance tuning.
Happy Learning !!
Related Articles
- Apache Spark Installation on Windows
- Spark groupByKey()
- Apache Spark Installation on Linux Ubuntu Server
- What is Apache Spark and Why It Is Ultimate for Working with Big Data
- Spark Groupby Example with DataFrame
- Spark groupByKey()
Hi,
Very nice articles you are writing. Thank you so much for your efforts and your content is different then other blogs. Keep it up. I have a question. How to enable history server for the application i am running in IntelliJ? I tried to trigger for local[*] but it is not registered in history server. Pls help.