Why Spark RDDs are immutable? Spark Resilient Distributed Datasets (RDDs) are the fundamental data structures in Spark that allow for distributed data processing. Spark RDDs are immutable and fault-tolerant collections of objects that can be processed in parallel across a cluster of machines. In this article, we shall discuss Spark RDD, Reasons why is Spark RDD immutable, and an example to showcase RDDs immutability.
Table of contents
1. Create RDD
Spark RDD stands for Resilient Distributed Datasets, and it is a fundamental data structure in Apache Spark. To create an RDD in Spark Scala, you can use the spark contexts sc.parallelize
function to parallelize an existing collection of data or read data from a distributed file system. Here’s an example of creating an RDD with product data:
// Spark Imports
import org.apache.spark.sql.SparkSession
// Initialize SparkSession
val spark = SparkSession.builder.appName("create RDD").getOrCreate()
// Create Product Case class for the
case class Product(id: Int, name: String, price: Double)
// Initialize Seq of our data holding products details
val products = Seq(
Product(1, "Apple", 1.99),
Product(2, "Banana", 0.99),
Product(3, "Orange", 2.49),
Product(4, "Grapes", 3.99),
Product(5, "Watermelon", 5.99)
)
//Create RDD by Pass the product Seq to parallelize function
val rdd = spark.sparkContext.parallelize(products)
In this example,
- We define a case class
Product
that represents our product data. - We then create a sequence of product objects
products
. - Finally, we parallelize the sequence using the
sc.parallelize
function to create an RDDrdd
that can be processed in parallel across a cluster of machines.
Note that when creating an RDD, it’s important to consider the data size and distribution to ensure optimal performance in distributed processing. Additionally, in newer versions of Spark, it’s recommended to use DataFrames and Datasets instead of RDDs for improved performance and ease of use.
2. Reason for Spark RDD immutability
Spark Scala RDDs are immutable, which means that once an RDD is created, it cannot be changed. This is by design and is a key characteristic of RDDs in Spark.
There are a few reasons why RDDs are immutable:
- Simplicity: By being immutable, RDDs are simple and easy to reason about. Since RDDs cannot be changed once they are created, you can be confident that the data in the RDD will not change unexpectedly.
- Parallelism: RDD immutability enables Spark to process data in parallel across multiple machines. Since RDDs are immutable, Spark can partition them across multiple nodes in a cluster and process them in parallel without worrying about data consistency issues that can arise with mutable data structures.
- Fault tolerance: Since RDDs are immutable, Spark can recover lost or damaged data by rebuilding lost RDD partitions from other, intact partitions. This is done using lineage, which is a record of the transformations that were applied to the original RDD to create the lost partition. By using lineage, Spark can reconstruct the lost partition from the intact ones and continue processing without data loss.
Overall, RDD immutability is a fundamental aspect of Spark’s architecture that enables its scalability, parallelism, and fault tolerance.
3. Spark RDD immutability with Example
Spark RDDs are immutable, meaning they cannot be changed after they are created. Let’s look at an example to understand RDDs immutable in Spark.
Consider the following code that Uses the Products RDD created above and modifies it:
//Scene1: Showing the Content of the RDD
rdd.collect
//Scene2: Apply transformation on RDD to double the price
rdd.map(x => (x.id, x.name, x.price * 2)).collect
//Scene3: Showing the content of the RDD after transformation
rdd.collect
//Scene4: Apply transformation on RDD and assigning the result to new rdd
val new_rdd = rdd.map(x => (x.id, x.name, x.price * 2))
new_rdd.collect
In this code, we create an RDD rdd
using the parallelize() method having information about the Product Id, Name, and price, which creates an RDD from a collection of data. We then apply the collect() method to print all elements of the RDD.
Scene1: The output of the RDD created:
Next, in Scene2 we apply the map
method to the RDD to multiply the price of each product by 2.
After applying the mapping, The prices of each product were doubled as shown above.
However, In the scene3 where we apply collect
again to print the RDD, we see that the elements have not been modified.
This is because RDDs are immutable in Spark, which means that any operation that modifies an RDD actually creates a new RDD with the modified data. In this case, the map
the operation created a new RDD with the modified data, but the original RDD rdd
remained unchanged.
To get the modified RDD, we need to assign the result of the map
the operation to a new RDD, like this:
In this code in scene4, we create a new RDD new_rdd
that contains the elements of rdd
and price of products multiplied by 2. By creating a new RDD, we can see the modified data when we apply collect
to newRdd
.
This example illustrates why Spark RDDs are immutable in Spark: any operation that modifies an RDD actually creates a new RDD with the modified data, rather than modifying the original RDD. This is a fundamental characteristic of RDDs in Spark that enables their scalability, parallelism, and fault tolerance.
4. Conclusion
In conclusion, Spark RDDs are immutable for several reasons that are fundamental to Spark’s architecture. RDD immutability enables Spark to process data in parallel across multiple machines, which allows for scalability and efficient use of resources. Additionally, immutable RDDs make it easy to reason about the data in the RDD, since it will not change unexpectedly.
Finally, immutability enables Spark to provide fault tolerance, since lost or damaged data can be recovered by rebuilding lost RDD partitions from other, intact partitions using lineage. Overall, RDD immutability is a key characteristic of Spark that enables its scalability, parallelism, and fault tolerance.
Related Article
- Spark RDD vs DataFrame vs Dataset
- Get distinct values from Spark RDD
- Apache Spark RDD Tutorial | Learn with Scala Examples
- Spark Internal Execution plan
- Spark Web UI – Understanding Spark Execution
- Create Java DataFrame in Spark
- Spark – How to create an empty Dataset?
- Spark – How to create an empty DataFrame?
- Spark Query Table using JDBC
- Create Java DataFrame in Spark
- Change Column Position in Spark DataFrame
- Filter Spark DataFrame Based on Date