You are currently viewing Spark SQL Full Outer Join with Example

In this Spark article, I will explain how to do Full Outer Join (outer, full,fullouter, full_outer) on two DataFrames with Scala Example and Spark SQL.

Before we jump into Spark Full Outer Join examples, first, let’s create an emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id is unique on the dept DataFrame and emp_dept_id from emp has a reference to dept_id on dept dataset.


    import org.apache.spark.sql.SparkSession
    val spark = SparkSession.builder
       .appName("sparkbyexamples.com")
       .master("local")
       .getOrCreate()

  val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
    (2,"Rose",1,"2010","20","M",4000),
    (3,"Williams",1,"2010","10","M",1000),
    (4,"Jones",2,"2005","10","F",2000),
    (5,"Brown",2,"2010","40","",-1),
      (6,"Brown",2,"2010","50","",-1)
  )
  val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
       "emp_dept_id","gender","salary")
  import spark.sqlContext.implicits._
  val empDF = emp.toDF(empColumns:_*)
  empDF.show(false)

  val dept = Seq(("Finance",10),
    ("Marketing",20),
    ("Sales",30),
    ("IT",40)
  )

  val deptColumns = Seq("dept_name","dept_id")
  val deptDF = dept.toDF("deptColumns")
  deptDF.show(false)

This prints emp and dept DataFrames to the console.


#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

#Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Spark DataFrame Full Outer Join Example

In order to use Full Outer Join on Spark SQL DataFrame, you can use either outer, full, fullouter Join as a join type.


empDF.join(deptDF,empDF("emp_dept_id") ===   deptDF("dept_id"),"outer")
    .show(false)
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"full")
    .show(false)
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"fullouter")
    .show(false)

From our emp dataset’s emp_dept_id with value 60 doesn’t have a record on dept hence dept columns have null and dept_id 30 doesn’t have a record in emp hence you see null’s on emp columns. Below is the result of the above Join expression.

Yields below output.


+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|6     |Brown   |2              |2010       |50         |      |-1    |null     |null   |
|null  |null    |null           |null       |null       |null  |null  |Sales    |30     |
+——+——–+—————+———–+———–+—–

Using Spark SQL Full Outer Join

Let’s see how to use Outer, Full, Full outer Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.


empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF2 = spark.sql("SELECT e.* FROM EMP e Full OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id")
  .show(truncate=False)

This also returns the same output as above.

Conclusion

In this Spark article, you have learned Full Outer Join ( outer, full, fullouter, full_outer) returns all rows from both DataFrame/Datasets regardless of match found or not, where join expression doesn’t match it returns null on respective record columns.

Hope you Like it !!

References:

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply