Spark RDD join with Examples

Spark/Pyspark RDD join supports all basic Join Types like INNERLEFTRIGHT and OUTER JOIN. Spark RRD Joins are wider transformations that result in data shuffling over the network hence they have huge performance issues when not designed with care.

In order to join the data, Spark needs it to be present on the same partition. The default process of Join in Spark is called a Shuffled Hash join. The shuffled Hash join ensures that data on each partition has the same keys by partitioning the second dataset with the same default partitioner as the first.

1. RDD Join Types

Below is the list of all Spark RDD Join Types.

  • Inner Join
  • Left Outer Join
  • Right Outer Join
  • Cartesian/Cross Join

Join is a transformation and it is available in the package org.apache.spark.rdd.pairRDDFunction

2. RDD Inner Join


//Syntax Spark RDD Inner join 
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]Permalink

Spark inner join between an RDD containing all pairs of elements with matching keys(k) in self and other.

Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.

Example


//Spark RDD Inner join 
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.join(rdd2)
joinrdd.collect

//Output
res1: Array[(String, (Int, Int))] = Array((a,(55,60)), (b,(56,65)))

In the above example, we have created two RDDs (rdd1 and rdd2). rdd1 is having keys a,b, and c with their respective values 55,56,57. Similarly, we have rdd2 having keys a,b, and d with their respective values 60,65,61. Both the RDD have common keys a and b and the inner join among them should result in a tuple with matching keys(a and b)
i.e (a,(55,60)), (b,(56,65)).

Using the same RRDs below we have the left outer, right outer, and cartesian/cross join explained.

3. RDD Left Outer Join


//Syntax Spark RDD leftOuter join 
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.

Example


//Spark RDD leftOuter join 
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.leftOuterJoin(rdd2)
joinrdd.collect
//output
res2: Array[(String, (Int, Option[Int]))] = Array((a,(55,Some(60))), (b,(56,Some(65))), (c,(57,None)))

4. RDD Right Outer Join


//Syntax Spark RDD RightOuter join 
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]

Perform a right outer join of self and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k.

Example


//Spark RDD RightOuter join 
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.rightOuterJoin(rdd2)
joinrdd.collect
//Output
res3: Array[(String, (Option[Int], Int))] = Array((a,(Some(55),60)), (b,(Some(56),65)), (d,(None,61)))

5. RDD Cartesian Join


//Syntax Spark RDD Cartesian join 
def cartesian[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]

Spark RRD Cartesian returns the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.

Example


//Spark RDD Cartesian join 
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.cartesian(rdd2)
joinrdd.collect
//Output
res4: Array[((String, Int), (String, Int))] = Array(((a,55),(a,60)), ((a,55),(b,65)), ((a,55),(d,61)), ((b,56),(a,60)), ((b,56),(b,65)), ((b,56),(d,61)), ((c,57),(a,60)), ((c,57),(b,65)), ((c,57),(d,61)))

6. Conclusion

In this tutorial, you have learned Spark RDD Join types INNERLEFT OUTERRIGHT OUTERCROSS joins syntax, and examples with Scala.

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.

Leave a Reply