
PySpark SQL Right Outer Join returns all rows from the right DataFrame regardless of math found on the left DataFrame, when the join expression doesn’t match, it assigns null for that record and drops records from left where match not found.
In this PySpark article, I will explain how to do Right Outer Join (right, right outer) on two DataFrames with PySpark Example.
Right Outer Join behaves exactly opposite to Left Join or Left Outer Join,
Before we jump into PySpark Right Outer Join examples, first, let’s create an emp
and dept
DataFrame’s. here, column emp_id
is unique on emp and dept_id
is unique on the dept dataset’s and emp_dept_id from emp has a reference to dept_id on the dept dataset.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
This prints emp and dept DataFrame to console.
Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
PySpark DataFrame Right Outer Join Example
Below is an example of Right Outer Join using on PySpark DataFrame
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"right") \
.show(truncate=False)
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"rightouter") \
.show(truncate=False)
From our example, the right dataset dept_id
30 doesn’t have it on the left dataset emp
hence, this record contains null on emp
columns. and emp_dept_id
60 dropped as a match not found on left. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |
|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |
|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |
|null |null |null |null |null |null |null |Sales |30 |
|5 |Brown |2 |2010 |40 | |-1 |IT |40 |
+------+--------+---------------+-----------+-----------+-----
Using PySpark SQL Right Outer Join
Let’s see how use Right Outer Join on PySpark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.
<pre><code class="language-python">
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
joinDF2 = spark.sql("SELECT e.* FROM EMP e RIGHT OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
.show(truncate=False)
</code></pre>
This also returns same output as above.
Conclusion
In this PySpark article, you have learned Right Outer Join is opposite of the Left Outer Join and is used to get all rows from the right dataset regardless of math found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found.
Hope you Like it !!
Related Articles
- PySpark SQL Inner Join Dataframe Example
- PySpark SQL Left Outer Join Examples
- PySpark SQL Self Join Examples
- PySpark SQL Left Anti Join Examples
- PySpark SQL Full Outer Join with Example
- PySpark SQL Left Semi Join Example
- PySpark alias() Column & DataFrame Examples
- PySpark repartition() – Explained with Examples