Spark RDD join with Examples

Spark/Pyspark RDD join supports all basic Join Types like INNER, LEFT, RIGHT and OUTER JOIN. Spark RRD Joins are wider transformations that result in data shuffling over the network hence they have huge performance issues when not designed with care.

1. RDD Join Types

Below is the list of all Spark RDD Join Types.

Inner Join
Left Outer Join
Right Outer Join
Cartesian/Cross Join

Join is a transformation and it is available in the package org.apache.spark.rdd.pairRDDFunction

2. RDD Inner Join


//Syntax Spark RDD Inner join 
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]Permalink

Spark inner join between an RDD containing all pairs of elements with matching keys(k) in self and other.

Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.

Example


//Spark RDD Inner join 
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.join(rdd2)
joinrdd.collect

//Output
res1: Array[(String, (Int, Int))] = Array((a,(55,60)), (b,(56,65)))

In the above example, we have created two RDDs (rdd1 and rdd2). rdd1 is having keys a,b, and c with their respective values 55,56,57. Similarly, we have rdd2 having keys a,b, and d with their respective values 60,65,61. Both the RDD have common keys a and b and the inner join among them should result in a tuple with matching keys(a and b)
i.e (a,(55,60)), (b,(56,65)).

Using the same RRDs below we have the left outer, right outer, and cartesian/cross join explained.

3. RDD Left Outer Join


//Syntax Spark RDD leftOuter join 
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.

Example


//Spark RDD leftOuter join 
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.leftOuterJoin(rdd2)
joinrdd.collect
//output
res2: Array[(String, (Int, Option[Int]))] = Array((a,(55,Some(60))), (b,(56,Some(65))), (c,(57,None)))

4. RDD Right Outer Join


//Syntax Spark RDD RightOuter join 
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]

Perform a right outer join of self and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k.

Example


//Spark RDD RightOuter join 
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.rightOuterJoin(rdd2)
joinrdd.collect
//Output
res3: Array[(String, (Option[Int], Int))] = Array((a,(Some(55),60)), (b,(Some(56),65)), (d,(None,61)))

5. RDD Cartesian Join


//Syntax Spark RDD Cartesian join 
def cartesian[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]

Spark RRD Cartesian returns the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.

Example


//Spark RDD Cartesian join 
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.cartesian(rdd2)
joinrdd.collect
//Output
res4: Array[((String, Int), (String, Int))] = Array(((a,55),(a,60)), ((a,55),(b,65)), ((a,55),(d,61)), ((b,56),(a,60)), ((b,56),(b,65)), ((b,56),(d,61)), ((c,57),(a,60)), ((c,57),(b,65)), ((c,57),(d,61)))

6. Conclusion

In this tutorial, you have learned Spark RDD Join types INNER, LEFT OUTER, RIGHT OUTER, CROSS joins syntax, and examples with Scala.

Table of contents

1. RDD Join Types

2. RDD Inner Join

Example

3. RDD Left Outer Join

Example

4. RDD Right Outer Join

Example

5. RDD Cartesian Join

Example

6. Conclusion

Related Articles