You are currently viewing Get distinct values from Spark RDD

How to get distinct values from a Spark RDD? We are often required to get the distinct values from the Spark RDD, you can use the distinct() function of RDD to achieve this.

Quick Example


// Create an RDD with duplicate elements
val rdd = spark.sparkContext.parallelize([1, 2, 3, 1, 2, 4])

// Apply the distinct operation
val distinct_rdd = rdd.distinct()

// Print the distinct elements
print(distinct_rdd.collect())

// Output:
[1, 2, 3, 4]

1. What is a Spark RDD?

In Apache Spark, RDD stands for “Resilient Distributed Datasets”. RDD is a fundamental data structure in Spark that represents an immutable distributed collection of objects that can be processed in parallel across a cluster of computers.

  • RDDs are fault-tolerant and can be rebuilt if a node fails, making them resilient.
  • RDDs can be created in a variety of ways, such as by parallelizing an existing collection in your driver program or by loading data from an external storage system like Hadoop Distributed File System (HDFS), Azure Data Lake house (ADLS), Cassandra, or HBase.
  • RDDs can be transformed using various transformation operations like map, filter, groupByKey, and reduceByKey, among others. These operations create a new RDD as output, and the original RDD remains unchanged. RDDs can also be cached in memory for faster access during iterative computations.

In addition to transformation operations, RDDs also support actions, such as count, collect, reduce, and save, that trigger the computation and return results to the driver program or write output to external storage.

2. How to get distinct values from a spark RDD?

In Apache Spark, the distinct() method is a transformation operation that can be applied to an RDD to remove duplicates from the data.

When the distinct() operation is applied to an RDD, Spark evaluates the unique values present in the RDD and returns a new RDD containing only the distinct elements. The new RDD contains only the first occurrence of each distinct element in the original RDD.

The distinct() operation can be applied to RDDs of any data type, including RDDs of integers, strings, tuples, and more. It can also be applied to RDDs of key-value pairs, where the distinct operation is applied only to the keys.

Here is an example of how to use the distinct() operation in Spark:


// Create an RDD with duplicate elements
val rdd = spark.sparkContext.parallelize([1, 2, 3, 1, 2, 4])

// Apply the distinct operation
val distinct_rdd = rdd.distinct()

// Print the distinct elements
print(distinct_rdd.collect())

// Output:
[1, 2, 3, 4]

The output of the above code will be [1, 2, 3, 4], which contains only the distinct elements from the original RDD.

It’s important to note that the distinct() transformation can be an expensive operation, especially if the RDD is large because it requires shuffling the data across the network to ensure that only unique elements are retained. Therefore, it’s best to use distinct() only when necessary and to avoid using it on very large RDDs if possible.

3. Conclusion

To conclude the distinct values from a Spark RDD, we can use the distinct() transformation. The distinct() transformation returns a new RDD that contains only the distinct elements of the source RDD.

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.