You are currently viewing Spark SQL Right Outer Join with Example

Spark SQL Right Outer Join returns all rows from the right DataFrame regardless of math found on the left DataFrame, when the join expression doesn’t match, it assigns null for that record and drops records from the left where match not found.

In this Spark article, I will explain how to do Right Outer Join (right, rightouter, right_outer) on two DataFrames with Scala Example.

Right Outer Join behaves exactly opposite to Left Join or Left Outer Join,

Before we jump into Spark Right Outer Join examples, first, let’s create an emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id is unique on the dept dataset’s and emp_dept_id from emp has a reference to dept_id on the dept dataset.


import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
       .appName("sparkbyexamples.com")
       .master("local")
       .getOrCreate()

val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
    (2,"Rose",1,"2010","20","M",4000),
    (3,"Williams",1,"2010","10","M",1000),
    (4,"Jones",2,"2005","10","F",2000),
    (5,"Brown",2,"2010","40","",-1),
      (6,"Brown",2,"2010","50","",-1)
  )
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
       "emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
  empDF.show(false)

val dept = Seq(("Finance",10),
    ("Marketing",20),
    ("Sales",30),
    ("IT",40)
  )

val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF("deptColumns")
deptDF.show(false)

This prints emp and dept DataFrame to the console.


#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

#Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Spark DataFrame Right Outer Join Example

Below is an example of Right Outer Join using Spark DataFrame


empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"right")
   .show(false)
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"rightouter")
   .show(false)

From our example, the right dataset dept_id 30 doesn’t have it on the left dataset emp hence, this record contains null on emp columns. and emp_dept_id 60 dropped as a match not found on left. Below is the result of the above Join expression.


+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|null  |null    |null           |null       |null       |null  |null  |Sales    |30     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

Using Spark SQL Right Outer Join

Let’s see how use Right Outer Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.


empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF2 = spark.sql("SELECT e.* FROM EMP e RIGHT OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id") 
  .show(truncate=False)

This also returns the same output as above.

Conclusion

In this Spark article, you have learned Right Outer Join is the opposite of the Left Outer Join and is used to get all rows from the right dataset regardless of math found on the left dataset. When join expression doesn’t match, it assigns null for that record and drops records from left where match not found.

Hope you Like it !!

References:

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply