PySpark SQL Left Outer Join (left, left outer, left_outer) returns all rows from the left DataFrame regardless of the match found on the right DataFrame. When the join expression doesn’t match, it assigns null for that record, and when a match is not found it drops records from the right DataFrame.
In this PySpark article, I will explain how to do Left Outer Join (left, leftouter, left_outer) on two DataFrames with Python Example.
Before we jump into PySpark Left Outer Join examples, first, let’s create an emp
and dept
DataFrames. here, column emp_id
is unique on emp and dept_id
is unique on the dept datasets and emp_dept_id from emp has a reference to dept_id on dept dataset.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
This prints emp and dept DataFrame to the console.
# Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
# Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
PySpark DataFrame Left Outer Join Example
Below is an example of how to use Left Outer Join (left
, leftouter
, left_outer
) on PySpark DataFrame.
# Left outer join
empDF.join(deptDF,empDF.emp_dept_id === deptDF.dept_id,"left") \
.show(false)
# or using leftouter
empDF.join(deptDF,empDF.emp_dept_id === deptDF.dept_id,"leftouter") \
.show(false)
From our dataset, emp_dept_id
6o doesn’t have a record on dept
dataset hence, this record contains null on dept
columns (dept_name & dept_id). and dept_id
30 from dept
dataset dropped from the results. Below is the result of the above Join expression.
# Output:
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |
|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |
|5 |Brown |2 |2010 |40 | |-1 |IT |40 |
|6 |Brown |2 |2010 |50 | |-1 |null |null |
+------+--------+---------------+-----------+-----------+------+
Using PySpark SQL Left Outer Join
Let’s see how to use Left Outer Join on PySpark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.
# Using SQL
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
.show(truncate=False)
This also returns the same output as above.
Conclusion
In this PySpark article, you have learned Left Outer Join is used to get all rows from the left dataset regardless of the match found on the right dataset.
Hope you Like it !!
Related Articles
- PySpark SQL Inner Join Dataframe Example
- PySpark SQL Self Join Examples
- PySpark SQL Left Anti Join Examples
- PySpark SQL Full Outer Join with Example
- PySpark SQL Left Semi Join Example
- PySpark SQL Right Outer Join with Example
- PySpark MapType (Dict) Usage with Examples
- PySpark alias() Column & DataFrame Examples
- PySpark repartition() – Explained with Examples
- PySpark SQL Self Join With Example