In this Spark article, I will explain how to do Full Outer Join (outer
, full
,fullouter
, full_outer
) on two DataFrames with Scala Example and Spark SQL.
Before we jump into Spark Full Outer Join examples, first, let’s create an emp
and dept
DataFrame’s. here, column emp_id
is unique on emp and dept_id
is unique on the dept DataFrame and emp_dept_id from emp has a reference to dept_id on dept dataset.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("sparkbyexamples.com")
.master("local")
.getOrCreate()
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF("deptColumns")
deptDF.show(false)
This prints emp and dept DataFrames to the console.
#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
#Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
Spark DataFrame Full Outer Join Example
In order to use Full Outer Join on Spark SQL DataFrame, you can use either outer
, full
, fullouter
Join
as a join type.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer")
.show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"full")
.show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"fullouter")
.show(false)
From our emp
dataset’s emp_dept_id
with value 60 doesn’t have a record on dept
hence dept columns have null and dept_id
30 doesn’t have a record in emp
hence you see null’s on emp columns. Below is the result of the above Join expression.
Yields below output.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |
|5 |Brown |2 |2010 |40 | |-1 |IT |40 |
|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |
|6 |Brown |2 |2010 |50 | |-1 |null |null |
|null |null |null |null |null |null |null |Sales |30 |
+——+——–+—————+———–+———–+—–
Using Spark SQL Full Outer Join
Let’s see how to use Outer, Full, Full outer Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
joinDF2 = spark.sql("SELECT e.* FROM EMP e Full OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id")
.show(truncate=False)
This also returns the same output as above.
Conclusion
In this Spark article, you have learned Full Outer Join ( outer, full, fullouter, full_outer) returns all rows from both DataFrame/Datasets regardless of match found or not, where join expression doesn’t match it returns null on respective record columns.
Hope you Like it !!
Related Articles
- Spark SQL Inner Join Dataframe Example
- Spark SQL Left Outer Join Examples
- Spark SQL Self Join Examples
- Spark SQL Left Anti Join Examples
- Spark SQL Right Outer Join with Example
- Spark SQL Left Semi Join Example
- Spark SQL like() Using Wildcard Example
- Spark SQL – Select Columns From DataFrame