You are currently viewing PySpark SQL Left Outer Join with Example

PySpark SQL Left Outer Join, also known as a left join, combines rows from two DataFrames based on a related column. All rows from the left DataFrame (the “left” side) are included in the result DataFrame, regardless of whether there is a matching row in the right DataFrame (the “right” side).

Advertisements

To understand better on PySpark Left Outer Join, first, let’s create an emp and dept DataFrames. Each record in the “emp” dataset has a unique “emp_id“, while each record in the “dept” dataset has a unique “dept_id”. The “emp_dept_id” column in the “emp” dataset serves as a reference to the “dept_id” column in the “dept” dataset.


import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()
  
emp = [(1,"Smith",-1,"2018","10","M",3000), \
    (2,"Rose",1,"2010","20","M",4000), \
    (3,"Williams",1,"2010","10","M",1000), \
    (4,"Jones",2,"2005","10","F",2000), \
    (5,"Brown",2,"2010","40","",-1), \
      (6,"Brown",2,"2010","50","",-1) \
  ]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
       "emp_dept_id","gender","salary"]

empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)

dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

Displays emp and dept DataFrames to console.


# Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

# Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Left Outer Join PySpark Example

When you apply a left outer join on two DataFrame.

  • All rows from the left table are included in the result set.
  • Rows from the right table are included in the result set only if they have a matching value in the join column with the left table.
  • If there is no match found in the right table for a row in the left table, NULL values are filled in for the columns from the right table in the result set.

Below is an example of Left Outer Join on PySpark DataFrame.


# Left outer join
empDF.join(deptDF,empDF.emp_dept_id ===  deptDF.dept_id,"left") \
    .show(false)

# or using leftouter
empDF.join(deptDF,empDF.emp_dept_id ===  deptDF.dept_id,"leftouter") \
    .show(false)

From our dataset, the “emp_dept_id” 60 does not have a corresponding record in the “dept” dataset. Consequently, this record contains NULL values in the “dept_name” and “dept_id” columns. Additionally, the “dept_id” 30 from the “dept” dataset has been omitted from the results. Below is the outcome of the aforementioned join expression.

Here, empDF.join(deptDF, empDF.emp_dept_id === deptDF.dept_id, "left") performs a left join between two DataFrames empDF and deptDF based on the equality condition between the “emp_dept_id” column from empDF and the “dept_id” column from deptDF.


# Output:
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|6     |Brown   |2              |2010       |50         |      |-1    |null     |null   |
+------+--------+---------------+-----------+-----------+------+

Using PySpark SQL Left Outer Join

To run the SQL query LEFT OUTER JOIN in PySpark, first, create a table/view from DataFrame using createOrReplaceTempView(). Then, use the SQL () function from SparkSession to run an SQL query.


# Using SQL
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
  .show(truncate=False)

This returns the same output as above.

Conclusion

In conclusion, the left outer join operation in PySpark SQL offers a versatile method for combining data from two DataFrames while ensuring that all rows from the left DataFrame are retained in the result set even if there are no matching records in the right DataFrame. If there is no match found in the right table for a row in the left table, NULL values are filled in for the columns from the right table in the result set.

Hope you Like it !!

References

Leave a Reply