PySpark SQL Full Outer Join with Example

A full outer join in PySpark SQL combines rows from two tables based on a matching condition, including all rows from both tables. If a row in one table has no corresponding match in the other table, null values are filled in for the missing columns. This ensures that no data is excluded from the result set.

Advertisements

First, let’s create an emp and dept DataFrames and I will use these to explain Full Outer join.


import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

emp = [(1,"Smith",-1,"2018","10","M",3000), \
    (2,"Rose",1,"2010","20","M",4000), \
    (3,"Williams",1,"2010","10","M",1000), \
    (4,"Jones",2,"2005","10","F",2000), \
    (5,"Brown",2,"2010","40","",-1), \
      (6,"Brown",2,"2010","50","",-1) \
  ]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
       "emp_dept_id","gender","salary"]

empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)

dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

You should see the following output in the console.



#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

#Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

PySpark DataFrame Full Outer Join Example

Use the join() transformation method with join type either outer, full, fullouter Join.

PySpark SQL full outer join combines data from two DataFrames, ensuring that all rows from both tables are included in the result set, regardless of matching conditions.
Rows that do not have corresponding matches in the other DataFrame are still included in the result, with null values filled in for missing columns.
This join type enables users to access all available data from both DataFrames, facilitating comprehensive data analysis and insights generation.
As it includes all rows from both DataFrames, the result set of a full outer join can be larger than that of other join types, potentially leading to increased memory usage.


empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"outer") \
    .show(truncate=False)
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"full") \
    .show(truncate=False)
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"fullouter") \
    .show(truncate=False)

In our “emp” dataset, the “emp_dept_id” with a value of 60 lacks a corresponding record in the “dept” dataset, resulting in null values in the “dept” columns. Similarly, the “dept_id” 30 does not have a record in the “emp” dataset, leading to null values in the “emp” columns.

Results in the below output.


# Output:
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|6     |Brown   |2              |2010       |50         |      |-1    |null     |null   |
|null  |null    |null           |null       |null       |null  |null  |Sales    |30     |
+------+--------+---------------+-----------+-----------+-----

Using PySpark SQL Outer, Full, Full Outer Join

Similarly, you can also get the same results using the SQL query instead of using the DataFrame API.


empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF2 = spark.sql("SELECT e.* FROM EMP e Full OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
  .show(truncate=False)

This also returns the same output as above.

Conclusion

In conclusion, the PySpark SQL full outer join is a powerful tool for combining data from two tables while ensuring that no information is lost. It includes all rows from both tables in the result set, regardless of whether there is a match based on the join condition. Null values are used to fill in missing data where there is no corresponding row in the other table.

This join type facilitates comprehensive data analysis by allowing users to retain all available information from both tables. However, it may result in a larger result set and should be used consciously based on specific analytical requirements.

Hope you Like it !!

References

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html

PySpark DataFrame Full Outer Join Example

Using PySpark SQL Outer, Full, Full Outer Join

Conclusion

Related Articles

References

Leave a Reply