Spark/Pyspark RDD join supports all basic Join Types like INNER
, LEFT
, RIGHT
and OUTER
JOIN. Spark RRD Joins are wider transformations that result in data shuffling over the network hence they have huge performance issues when not designed with care.
In order to join the data, Spark needs it to be present on the same partition. The default process of Join in Spark is called a Shuffled Hash join
. The shuffled Hash join ensures that data on each partition has the same keys by partitioning the second dataset with the same default partitioner as the first.
Table of contents
1. RDD Join Types
Below is the list of all Spark RDD Join Types.
- Inner Join
- Left Outer Join
- Right Outer Join
- Cartesian/Cross Join
Join is a transformation and it is available in the package org.apache.spark.rdd.pairRDDFunction
2. RDD Inner Join
//Syntax Spark RDD Inner join
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]Permalink
Spark inner join between an RDD containing all pairs of elements with matching keys(k) in self and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
Example
//Spark RDD Inner join
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.join(rdd2)
joinrdd.collect
//Output
res1: Array[(String, (Int, Int))] = Array((a,(55,60)), (b,(56,65)))
In the above example, we have created two RDDs (rdd1 and rdd2). rdd1 is having keys a,b, and c with their respective values 55,56,57. Similarly, we have rdd2 having keys a,b, and d with their respective values 60,65,61. Both the RDD have common keys a and b and the inner join among them should result in a tuple with matching keys(a and b)
i.e (a,(55,60)), (b,(56,65)).
Using the same RRDs below we have the left outer, right outer, and cartesian/cross join explained.
3. RDD Left Outer Join
//Syntax Spark RDD leftOuter join
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.
Example
//Spark RDD leftOuter join
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.leftOuterJoin(rdd2)
joinrdd.collect
//output
res2: Array[(String, (Int, Option[Int]))] = Array((a,(55,Some(60))), (b,(56,Some(65))), (c,(57,None)))
4. RDD Right Outer Join
//Syntax Spark RDD RightOuter join
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
Perform a right outer join of self and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k.
Example
//Spark RDD RightOuter join
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.rightOuterJoin(rdd2)
joinrdd.collect
//Output
res3: Array[(String, (Option[Int], Int))] = Array((a,(Some(55),60)), (b,(Some(56),65)), (d,(None,61)))
5. RDD Cartesian Join
//Syntax Spark RDD Cartesian join
def cartesian[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
Spark RRD Cartesian returns the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b)
where a
is in self and b
is in other.
Example
//Spark RDD Cartesian join
val sc = spark.sparkContext
val rdd1 = sc.parallelize(Seq(("a",55),("b",56),("c",57)))
val rdd2 = sc.parallelize(Seq(("a",60),("b",65),("d",61)))
val joinrdd = rdd1.cartesian(rdd2)
joinrdd.collect
//Output
res4: Array[((String, Int), (String, Int))] = Array(((a,55),(a,60)), ((a,55),(b,65)), ((a,55),(d,61)), ((b,56),(a,60)), ((b,56),(b,65)), ((b,56),(d,61)), ((c,57),(a,60)), ((c,57),(b,65)), ((c,57),(d,61)))
6. Conclusion
In this tutorial, you have learned Spark RDD Join types INNER
, LEFT OUTER
, RIGHT OUTER
, CROSS
joins syntax, and examples with Scala.
Related Articles
- Apache Spark RDD Tutorial | Learn with Scala Examples
- Spark RDD Transformations with examples
- Spark RDD Actions with examples
- Different ways to create Spark RDD
- Create a Spark RDD using Parallelize