You are currently viewing What is Spark Job

What is Spark Job? Spark/Pyspark Job refers to a set of tasks or computations that are executed in a distributed computing environment using the Apache Spark framework. In this article, we shall discuss in detail the Spark Job and some examples using the spark web-UI

Advertisements

1. What is a Spark Job?

A Spark job can be any task that needs to be performed on a large amount of data that is too big to be processed on a single machine.

Spark Job
Spark Job

In Apache Spark, a Spark job is divided into Spark stages, where each stage represents a set of tasks that can be executed in parallel. A stage consists of a set of tasks that are executed on a set of partitions of the data. Data is divided into smaller chunks, called partitions, and processed in parallel across multiple nodes in a cluster. This allows for faster processing and scalability.

A Spark job typically involves the following steps:

  • Loading data from a data source
  • Transforming or manipulating the data using Spark’s APIs such as Map, Reduce, Join, etc.
  • Storing the processed data back to a data store.

Spark jobs can be written in several programming languages including Java, Scala, Python, and R, and can be executed on a variety of platforms including Hadoop, Kubernetes, and cloud-based services like Amazon EMR, Google Dataproc, and Microsoft Azure HDInsight.

2. When does a job gets created in Spark

In Apache Spark, a job is created when a Spark action is called on an RDD (Resilient Distributed Dataset) or a DataFrame. An action is an operation that triggers the processing of data and the computation of a result that is returned to the driver program or saved to an external storage system.

  • Spark’s lazy evaluation model defers computation until it is necessary, so transformations such as map, filter, and groupBy do not immediately trigger the execution of code. Instead, these transformations build up a logical execution plan called a DAG (Directed Acyclic Graph) that describes the computation to be performed on the data.
  • When an action is called, Spark examines the DAG and schedules the necessary transformations and computations to be executed on the distributed data. This process creates a job, which is a collection of tasks that are sent to the worker nodes in the cluster for execution.
  • Each task processes a subset of the data and produces intermediate results that are combined to produce the final result of the action. The number of tasks created depends on the size of the data and the number of partitions it is divided into.

Therefore, a job in Spark is created when an action is called on an RDD or DataFrame, which triggers the execution of the transformation operations defined on the data.

3. Actions in Spark that can trigger the creation of a job

Here are some examples of actions in Spark that can trigger the creation of a job:

  1. count(): This action returns the number of elements in the RDD or DataFrame. When called, it triggers the computation of all the transformations leading up to the final count and creates a job to execute the computation.
  2. collect(): This action returns all the elements of the RDD or DataFrame as an array to the driver program. When collect() is called, it triggers the computation of all the transformations leading up to the final collection and creates a job to execute the computation.
  3. saveAsTextFile(): This action saves the contents of the RDD as a text file to a specified location. When called, it triggers the computation of all the transformations leading up to the final output and creates a job to execute the computation and write the output to the specified location.
  4. reduce(): This action applies a binary operator to the elements of the RDD and returns the result. When called, it triggers the computation of all the transformations leading up to the final reduce operation and creates a job to execute the computation.
  5. foreach(): This action applies a function to each element of the RDD or DataFrame, such as writing it to a database or sending it over the network. When called, it triggers the computation of all the transformations leading up to the final Foreach operation and creates a job to execute the computation.

In general, any action that requires the computation of the transformations applied to an RDD or DataFrame will trigger the creation of a job in Spark.

4. Examples of Spark Job

Always keep in mind, the number of Spark jobs is equal to the number of actions in the application and each Spark job should have at least one Stage.

Let us try reading a sample CSV file and call the action count on it to check the number of records in the file and see how many jobs are created.


val df = spark.read  //JOB0: read the CSV file
              .option("inferSchema", true) //JOB1: Inferschema from the file
              .option("header", true) 
              .csv("dbfs:/databricks-datasets/sfo_customer_survey/2013_SFO_Customer_Survey.csv")

df.count //JOB2: Count the records in the File

Here we are creating a DataFrame by reading a .csv file and checking the count of the DataFrame. Let’s understand how an application gets Jobs projected in Spark UI

If you are running the Spark application locally, Spark UI can be accessed using the http://localhost:4040/ . Spark UI by default runs on port 4040 and below are some of the additional UI’s that would be helpful to track Spark application. Since I am using Databricks for running this sample, Spark UI is accessible under cluster/cmd output

Spark Job
Spark Job

In our above application, we have performed 3 Spark jobs (0,1,2)

  • Job 0. read the CSV file.
  • Job 1. Inferschema from the file.
  • Job 2. Count Check

So if we look at the above screenshot it clearly shows 3 Spark jobs result of 3 actions.

5. Conclusion

To summarize, in Apache Spark, a job is created when an action is called on an RDD or DataFrame. When an action is called, Spark examines the DAG and schedules the necessary transformations and computations to be executed on the distributed data. This process creates a job, which is a collection of tasks that are sent to the worker nodes in the cluster for execution. Each task processes a subset of the data and produces intermediate results that are combined to produce the final result of the action.

Related Articles