Why Spark RDDs are immutable? Spark Resilient Distributed Datasets (RDDs) are the fundamental data structures in Spark that allow for distributed data processing. Spark RDDs are immutable and fault-tolerant collections of objects that can be processed in parallel across a cluster of machines. In this article, we shall discuss Spark RDD, Reasons why is Spark RDD immutable, and an example to showcase RDDs immutability.

Advertisements

1. Create RDD

Spark RDD stands for Resilient Distributed Datasets, and it is a fundamental data structure in Apache Spark. To create an RDD in Spark Scala, you can use the spark contexts sc.parallelize function to parallelize an existing collection of data or read data from a distributed file system. Here’s an example of creating an RDD with product data:


// Spark Imports
import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark = SparkSession.builder.appName("create RDD").getOrCreate()

// Create Product Case class for the 
case class Product(id: Int, name: String, price: Double)

// Initialize Seq of our data holding products details
val products = Seq(
  Product(1, "Apple", 1.99),
  Product(2, "Banana", 0.99),
  Product(3, "Orange", 2.49),
  Product(4, "Grapes", 3.99),
  Product(5, "Watermelon", 5.99)
)

//Create RDD by Pass the product Seq to parallelize function
val rdd = spark.sparkContext.parallelize(products)

In this example,

  • We define a case class Product that represents our product data.
  • We then create a sequence of product objects products.
  • Finally, we parallelize the sequence using the sc.parallelize function to create an RDD rdd that can be processed in parallel across a cluster of machines.

Note that when creating an RDD, it’s important to consider the data size and distribution to ensure optimal performance in distributed processing. Additionally, in newer versions of Spark, it’s recommended to use DataFrames and Datasets instead of RDDs for improved performance and ease of use.

2. Reason for Spark RDD immutability

Spark Scala RDDs are immutable, which means that once an RDD is created, it cannot be changed. This is by design and is a key characteristic of RDDs in Spark.

There are a few reasons why RDDs are immutable:

  1. Simplicity: By being immutable, RDDs are simple and easy to reason about. Since RDDs cannot be changed once they are created, you can be confident that the data in the RDD will not change unexpectedly.
  2. Parallelism: RDD immutability enables Spark to process data in parallel across multiple machines. Since RDDs are immutable, Spark can partition them across multiple nodes in a cluster and process them in parallel without worrying about data consistency issues that can arise with mutable data structures.
  3. Fault tolerance: Since RDDs are immutable, Spark can recover lost or damaged data by rebuilding lost RDD partitions from other, intact partitions. This is done using lineage, which is a record of the transformations that were applied to the original RDD to create the lost partition. By using lineage, Spark can reconstruct the lost partition from the intact ones and continue processing without data loss.

Overall, RDD immutability is a fundamental aspect of Spark’s architecture that enables its scalability, parallelism, and fault tolerance.

3. Spark RDD immutability with Example

Spark RDDs are immutable, meaning they cannot be changed after they are created. Let’s look at an example to understand RDDs immutable in Spark.

Consider the following code that Uses the Products RDD created above and modifies it:


//Scene1: Showing the Content of the RDD
rdd.collect

//Scene2: Apply transformation on RDD to double the price
rdd.map(x => (x.id, x.name, x.price * 2)).collect

//Scene3: Showing the content of the RDD after transformation
rdd.collect

//Scene4: Apply transformation on RDD and assigning the result to new rdd
val new_rdd =  rdd.map(x => (x.id, x.name, x.price * 2))
new_rdd.collect

In this code, we create an RDD rdd using the parallelize() method having information about the Product Id, Name, and price, which creates an RDD from a collection of data. We then apply the collect() method to print all elements of the RDD.

Scene1: The output of the RDD created:

rdd immutable

Next, in Scene2 we apply the map method to the RDD to multiply the price of each product by 2.

After applying the mapping, The prices of each product were doubled as shown above.

However, In the scene3 where we apply collect again to print the RDD, we see that the elements have not been modified.

This is because RDDs are immutable in Spark, which means that any operation that modifies an RDD actually creates a new RDD with the modified data. In this case, the map the operation created a new RDD with the modified data, but the original RDD rdd remained unchanged.

To get the modified RDD, we need to assign the result of the map the operation to a new RDD, like this:

spark immutable

In this code in scene4, we create a new RDD new_rdd that contains the elements of rdd and price of products multiplied by 2. By creating a new RDD, we can see the modified data when we apply collect to newRdd.

This example illustrates why Spark RDDs are immutable in Spark: any operation that modifies an RDD actually creates a new RDD with the modified data, rather than modifying the original RDD. This is a fundamental characteristic of RDDs in Spark that enables their scalability, parallelism, and fault tolerance.

4. Conclusion

In conclusion, Spark RDDs are immutable for several reasons that are fundamental to Spark’s architecture. RDD immutability enables Spark to process data in parallel across multiple machines, which allows for scalability and efficient use of resources. Additionally, immutable RDDs make it easy to reason about the data in the RDD, since it will not change unexpectedly.

Finally, immutability enables Spark to provide fault tolerance, since lost or damaged data can be recovered by rebuilding lost RDD partitions from other, intact partitions using lineage. Overall, RDD immutability is a key characteristic of Spark that enables its scalability, parallelism, and fault tolerance.

Related Article

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.