The Lineage Graph is a directed acyclic graph (DAG) in Spark or PySpark that represents the dependencies between RDDs (Resilient Distributed Datasets) or DataFrames in a Spark application. In this article, we shall discuss in detail what is Lineage Graph in Spark/PySpark, and its properties, and illustrate the Lineage Graph with dependencies between RDDs with examples.
Table of contents
1. What is Spark Lineage Graph
- Tracks all the operations performed on the input data, including transformations and actions, and stores the metadata of the data transformation steps.
- It is a crucial component of Spark’s fault tolerance mechanism. Since RDDs are immutable, the Lineage Graph helps in reconstructing lost RDDs by recomputing their parent RDDs based on their transformations. This feature enables Spark to recover lost data in case of node failures or other issues in the cluster.
- This also helps in optimizing the execution plan of Spark applications. Spark uses the information in the Lineage Graph to optimize the DAG and perform transformations in a way that reduces data shuffling and increases parallelism. This optimization leads to faster execution times and efficient utilization of cluster resources.
- This can be visualized using the
toDebugString()method in Spark. This method prints a string representation of the lineage graph, including the dependencies between RDDs and the transformations applied to each RDD.
Overall, the Lineage Graph is a powerful and important component of Spark that enables fault tolerance, optimization, and efficient use of cluster resources. Understanding the Lineage Graph is crucial for developing efficient and fault-tolerant Spark applications.
2. Properties of Lineage Graph
The Lineage Graph in Spark has the following properties:
- Directed Acyclic Graph (DAG): The Lineage Graph is a directed acyclic graph, which means that it is a graph of vertices (RDDs or DataFrames) connected by directed edges that represent dependencies. It is acyclic because there are no circular dependencies between the vertices.
- Immutable: RDDs are immutable, meaning that once an RDD is created, it cannot be modified. This property is reflected in the Lineage Graph, which stores the lineage of each RDD as a series of transformations that were applied to the original data.
- Fault Tolerance: The Lineage Graph is used for Spark’s fault-tolerance mechanism. Since RDDs are immutable, Spark can recreate any lost RDDs by tracing them back to their parent RDDs and recomputing the transformations that were applied to them.
- Optimization: The Lineage Graph helps Spark optimize the execution plan of transformations by reducing data shuffling and increasing parallelism. This optimization leads to faster execution times and efficient utilization of cluster resources.
- Metadata: The Lineage Graph stores metadata about each RDD, including its data type, partitioning scheme, and dependencies. This information is used by Spark to optimize the execution plan and ensure that the correct transformations are applied to each RDD.
Overall, the Lineage Graph is a powerful and important component of Spark that enables fault tolerance, optimization, and efficient use of cluster resources.
3. Demo of Linege Graph
Let us try to create a sample RDD using the below data from a text file and apply word count logic to it and finally use
toDebugString method on the wordCount RDD to get the representation of the lineage graph, including the dependencies between RDDs and the transformations applied to each RDD.
Hello world This is a test Test successful
we create an RDD
lines by reading the above sample data in a text file. We then apply two transformations to this RDD to split the lines into words and count the occurrence of each word. Finally, we perform an action by calling
collect() to trigger the execution of the RDD lineage.
//Imports import org.apache.spark.sql.SparkSession // Create a SparkSession val spark = SparkSession.builder() .appName("LineageGraphExample") .master("local[*]") .getOrCreate() // Create an RDD from a text file val lines = spark.sparkContext.textFile("input.txt") // Apply a transformation to the RDD val words = lines.flatMap(line => line.split(" ")) // Apply another transformation to the RDD val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _) // Perform an action to trigger execution of the RDD lineage wordCounts.collect() // Print the RDD lineage println(wordCounts.toDebugString)
The output of the collect operation on the wordCount is as :
res11: Array[(String, Int)] = Array((Test,1), (is,1), (Hello,1), (This,1), (successful,1), (test,1), (world,1), (a,1))
Now call The
toDebugString method on the
wordCounts RDD to print the lineage graph. This method prints a string representation of the lineage graph, including the dependencies between RDDs and the transformations applied to each RDD.
When we run this program, we will see the following output:
This output shows the lineage graph for the
wordCounts RDD. We can see that this RDD depends on an ReduceByKey operation, which in turn depends on a
MapPartitionsRDD. These RDDs were created by applying transformations to the
lines RDD, which was created from the input text file.
The DAG dependency between the RDDs transformations and actions is available in visualization format as well in Spark UI. Spark UI can be accessed using the http://localhost:4040/ . Spark UI by default runs on port 4040 and below are some of the additional UI’s that would be helpful to track the Spark application. Here is DAG in the Spark UI showing the lineage graph for the
We can see from the output, Each Wide Transformation results in a separate Number of Stages. In our case, Spark creates two Stages one for reading and
MapPartitionsRDD the RDD and the other shuffle RDD for
ReduceByKey operation. We can see that this RDD depends on an
ReduceByKey operation(Stage1), which in turn depends on a
In conclusion, the Lineage Graph is a key concept in Apache Spark that represents the dependencies between RDDs or DataFrames in a Spark application. The Lineage Graph helps in fault tolerance by reconstructing lost RDDs based on their parent RDDs and their transformations. It also optimizes the execution plan of Spark applications by reducing data shuffling and increasing parallelism.
- What is DAG in Spark or PySpark
- Spark Internal Execution plan
- Spark Web UI – Understanding Spark Execution
- Spark map() vs flatMap() with Examples
- Collect() – Retrieve data from Spark RDD/DataFrame
- Spark RDD vs DataFrame vs Dataset