Apache Spark provides a suite of Web UI/User Interfaces (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark/PySpark application, resource consumption of Spark cluster, and Spark configurations.
To better understand how Spark executes the Spark/PySpark Jobs, these set of user interfaces comes in handy. In this article, I will run a small application and explain how Spark executes this by using different sections in Spark Web UI.
Before going into Spark UI first, learn about these two concepts.
Let me give a small brief on those two, Your application code is the set of instructions that instructs the driver to do a Spark Job and let the driver decide how to achieve it with the help of executors.
Instructions to the driver are called Transformations and action will trigger the execution.
I had written a small application which does transformation and action.
Spark UI is separated into below tabs.
If you are running the Spark application locally, Spark UI can be accessed using the http://localhost:4040/ . Spark UI by default runs on port 4040 and below are some of the additional UI’s that would be helpful to track Spark application.
- Spark Application UI: http://localhost:4040/
- Resource Manager: http://localhost:9870
- Spark JobTracker: http://localhost:8088/
- Node Specific Info: http://localhost:8042/
Note: To access these URLs, Spark application should in running state. If you wanted to access this URL regardless of your Spark application status and wanted to access Spark UI all the time, you would need to start Spark History server.
1. Spark Jobs Tab
1.1 Scheduling Mode
We have three Scheduling modes.
- Standalone mode
- YARN mode
As I was running in a local machine, I tried using
1.2 Number of Spark Jobs:
Always keep in mind, the number of Spark jobs is equal to the number of actions in the application and each Spark job should have at least one Stage.
In our above application, we have performed 3 Spark jobs (0,1,2)
- Job 0. read the CSV file.
- Job 1. Inferschema from the file.
- Job 2. Count Check
So if we look at the fig it clearly shows 3 Spark jobs result of 3 actions.
1.3 Number of Stages
Each Wide Transformation results in a separate Number of Stages. In our case, Spark job0 and Spark job1 have individual single stages but when it comes to Spark job 3 we can see two stages that are because of the partition of data. Data is partitioned into two files by default.
Description links the complete details of the associated SparkJob like Spark Job Status, DAG Visualization, Completed Stages
I had explained the description part in the coming part.
2. Stages Tab
We can navigate into Stage Tab in two ways.
- Select the Description of the respective Spark job (Shows stages only for the Spark job opted)
- On the top of Spark Job tab select Stages option (Shows all stages in Application)
In our application, we have a total of 4 Stages.
The Stage tab displays a summary page that shows the current state of all stages of all Spark jobs in the spark application
The number of tasks you could see in each stage is the number of partitions that spark is going to work on and each task inside a stage is the same work that will be done by spark but on a different partition of data.
Details of stage showcase Directed Acyclic Graph (DAG) of this stage, where vertices represent the RDDs or DataFrame and edges represent an operation to be applied.
let us analyze operations in Stages
Operations in Stage0 are
FileScan represents reading the data from a file.
It is given FilePartitions that are custom RDD partitions with PartitionedFiles (file blocks)
In our scenario, the CSV file is read
MapPartitionsRDD will be created when you use map Partition transformation
Operation in Stage(1) are
As File Scan and
MapPartitionsRDD is already explained, let us look at
SQLExecutionRDD is Spark property that is used to track multiple Spark jobs that should all together constitute a single structured query execution.
Operation in Stage(2) and Stage(3) are
A physical query optimizer in Spark SQL that fuses multiple physical operators
Exchange is performed because of the COUNT method.
As data is divided into partitions and shared among executors, to get count there should be adding of the count of from individual partition.
Represents the shuffle i.e data movement across the cluster(Executors).
It is the most expensive operation and if number of partitions is more exchange of data between executors will also be more.
Tasks are located at the bottom space in the respective stage.
Key things to look task page are:
1. Input Size – Input for the Stage
2. Shuffle Write-Output is the stage written.
The Storage tab displays the persisted RDDs and DataFrames, if any, in the application. The summary page shows the storage levels, sizes and partitions of all RDDs, and the details page shows the sizes and using executors for all partitions in an RDD or DataFrame.
5. Environment Tab
This environment page has five parts. It is a useful place to check whether your properties have been set correctly.
- Runtime Information: simply contains the runtime properties like versions of Java and Scala.
- Spark Properties: lists the application properties like ‘spark.app.name’ and ‘
- Hadoop Properties: displays properties relative to Hadoop and YARN. Note: Properties like ‘
spark.hadoop’ are shown not in this part but in ‘Spark Properties’.
- System Properties: shows more details about the JVM.
- Classpath Entries: lists the classes loaded from different sources, which is very useful to resolve class conflicts.
The Environment tab displays the values for the different environment and configuration variables, including JVM, Spark, and system properties.
6. Executors Tab
The Executors tab displays summary information about the executors that were created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the amount of memory used and reserved for caching data.
The Executors tab provides not only resource information like amount of memory, disk, and cores used by each executor but also performance information.
Number of cores = 3 as I gave master as local with 3 threads
Number of tasks = 4
7. SQL Tab
If the application executes Spark SQL queries then the SQL tab displays information, such as the duration, Spark jobs, and physical and logical plans for the queries.
In our application, we performed read and count operation on files and DataFrame. So both read and count are listed SQL Tab
Some of the resources are gathered from https://spark.apache.org/ thanks for the information.
“…………….Keep learning and keep growing…………………”