Spark Web UI - Understanding Spark Execution

Apache Spark provides a suite of Web UI/User Interfaces (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark/PySpark application, resource consumption of Spark cluster, and Spark configurations.

1. Spark Jobs Tab

The details that I want you to be aware of under the jobs section are Scheduling mode, the number of Spark Jobs, the number of stages it has, and Description in your Spark job.

1.1 Scheduling Mode

We have three Scheduling modes.

Standalone mode
YARN mode
Mesos

As I was running on a local machine, I tried using Standalone mode

1.2 Number of Spark Jobs:

Always keep in mind, that the number of Spark jobs is equal to the number of actions in the application and each Spark job should have at least one Stage.
In our above application, we have performed 3 Spark jobs (0,1,2)

Job 0. read the CSV file.
Job 1. Inferschema from the file.
Job 2. Count Check

So if we look at the fig it clearly shows 3 Spark jobs result of 3 actions.

1.3 Number of Stages

Each Wide Transformation results in a separate Number of Stages. In our case, Spark job0 and Spark job1 have individual single stages but when it comes to Spark job 3 we can see two stages that are because of the partition of data. Data is partitioned into two files by default.

1.4 Description

Description links the complete details of the associated SparkJob like Spark Job Status, DAG Visualization, Completed Stages
I have explained the description part in the coming part.

2. Stages Tab

We can navigate into the Stage Tab in two ways.

Select the Description of the respective Spark job (Shows stages only for the Spark job opted)
On the top of the Spark Job tab select the Stages option (Shows all stages in the Application)

In our application, we have a total of 4 Stages.

The Stage tab displays a summary page that shows the current state of all stages of all Spark jobs in the Spark application

The number of tasks you can see in each stage is the number of partitions that Spark is going to work on and each task inside a stage is the same work that will be done by Spark but on a different partition of data.

Stage detail

Details of the stage showcase the Directed Acyclic Graph (DAG) of this stage, where vertices represent the RDDs or DataFrame and edges represent an operation to be applied.

let us analyze operations in Stages
Operations in Stage0 are
1.FileScanRDD
2.MapPartitionsRDD

FileScanRDD

FileScan represents reading the data from a file.
It gives FilePartitions that are custom RDD partitions with PartitionedFiles (file blocks)
In our scenario, the CSV file is read

MapPartitionsRDD

MapPartitionsRDD will be created when you use map Partition transformation

Operation in Stage(1) are
1.FileScanRDD
2.MapPartitionsRDD
3.SQLExecutionRDD

As File Scan and MapPartitionsRDD is already explained, let us look at SQLExecutionRDD

SQLExecutionRDD

SQLExecutionRDD is Spark property that is used to track multiple Spark jobs that should all together constitute a single structured query execution.

Operation in Stage(2) and Stage(3) are
1.FileScanRDD
2.MapPartitionsRDD
3.WholeStageCodegen
4.Exchange

Wholestagecodegen

A physical query optimizer in Spark SQL that fuses multiple physical operators

Exchange

Exchange is performed because of the COUNT method.
As data is divided into partitions and shared among executors, to get count there should be adding of the count of from individual partition.

Represents the shuffle i.e data movement across the cluster(Executors).
It is the most expensive operation and if number of partitions is more exchange of data between executors will also be more.

3. Tasks

Tasks are located at the bottom space in the respective stage.
Key things to look task page are:
1. Input Size – Input for the Stage
2. Shuffle Write-Output is the stage written.

4. Storage

The Storage tab displays the persisted RDDs and DataFrames, if any, in the application. The summary page shows the storage levels, sizes, and partitions of all RDDs, and the details page shows the sizes and executors for all partitions in an RDD or DataFrame.

5. Environment Tab

Spark Job Environment UI — Spark Environment Tab

This environment page has five parts. It is a useful place to check whether your properties have been set correctly.

Runtime Information: simply contains the runtime properties like versions of Java and Scala.
Spark Properties: lists the application properties like ‘spark.app.name’ and ‘spark.driver.memory’.
Hadoop Properties: displays properties relative to Hadoop and YARN. Note: Properties like ‘spark.hadoop’ are shown not in this part but in ‘Spark Properties’.
System Properties: shows more details about the JVM.
Classpath Entries: lists the classes loaded from different sources, which is very useful to resolve class conflicts.

The Environment tab displays the values for the different environment and configuration variables, including JVM, Spark, and system properties.

6. Executors Tab

The Executors tab displays summary information about the executors that were created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the amount of memory used and reserved for caching data.

The Executors tab provides not only resource information like the amount of memory, disk, and cores used by each executor but also performance information.

In Executors
Number of cores = 3 as I gave master as local with 3 threads
Number of tasks = 4

7. SQL Tab

If the application executes Spark SQL queries then the SQL tab displays information, such as the duration, Spark jobs, and physical and logical plans for the queries.

In our application, we performed read and count operations on files and DataFrame. So both read and count are listed SQL Tab

Some of the resources are gathered from https://spark.apache.org/ thanks for the information.

“…………….Keep learning and keep growing…………………”

This Post Has 5 Comments

Karan September 16, 2022

thanks for nice explanation!
Chitra December 29, 2020

Thanks Sriram for this great job.It helped me a lot…
buvana December 17, 2020

You just cleared all my greeks & Latin understanding about Spark UI .Thanks a lot for the very nice write!
Anonymous November 7, 2020

Great job Sriram. This will be very helpful for lot of aspiring people who wants to learn Bigdata. Appreciate it.
Shobhit Verma October 19, 2020

Appreciate your effort and deep information . Really helpful and thank you so much . Keep writing.

Comments are closed.