You are currently viewing PySpark SQL Left Semi Join  Example

PySpark leftsemi join is similar to inner join difference being left semi-join returns all columns from the left DataFrame/Dataset and ignores all columns from the right DataFrame. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.

Advertisements

Alternatively, you can achieve the same output as Left Semi Join by selecting all columns from the Left DataFrame on the result of the inner join; however, using this join would be efficient.

To explain PySpark Left Semi Join first, I will create an emp DataFrame and dept DataFrame. In these DataFrames, each value in the column “emp_id” is unique within the “emp” DataFrame, while each value in the column “dept_id” is unique within the “dept” DataFrame. Additionally, the “emp_dept_id” column in the “emp” DataFrame references the “dept_id” column in the “dept” DataFrame.


import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkbyexamples.com').getOrCreate()

# Emp Data
emp = [(1,"Smith",-1,"2018","10","M",3000), \
    (2,"Rose",1,"2010","20","M",4000), \
    (3,"Williams",1,"2010","10","M",1000), \
    (4,"Jones",2,"2005","10","F",2000), \
    (5,"Brown",2,"2010","40","",-1), \
      (6,"Brown",2,"2010","50","",-1) \
  ]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
       "emp_dept_id","gender","salary"]

# Create Emp DataFrame
empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)

# Dept Data
dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
deptColumns = ["dept_name","dept_id"]

# Create dept DataFrame
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

Yields below output.


# Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

# Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

PySpark DataFrame Left Semi Join Example

A left semi join is a type of join operation in SQL that returns all the rows from the left DataFrame (or table) where there is a match in the right DataFrame (or table) based on the specified join condition. However, it does not include any columns from the right DataFrame in the result.

Here’s how a left semi join works:

  1. Join Condition: You specify a join condition based on which the left DataFrame is matched with the right DataFrame.
  2. Matching Rows: The left semi join returns all rows from the left DataFrame where there is a match in the right DataFrame based on the specified join condition.
  3. Excluded Columns: Unlike other join types, a left semi join does not include any columns from the right DataFrame in the result. It only includes columns from the left DataFrame.

Left semi joins are useful for filtering rows from the left DataFrame based on the existence of matching rows in the right DataFrame, without including any additional columns from the right DataFrame in the result.

To use the left semi-join, use the leftsemi join type.


# Left Semi Join Example
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftsemi") \
   .show(truncate=False)

This join returns all the rows from the empDF DataFrame where there is a match in the deptDF DataFrame on the condition specified, which is the equality of the “emp_dept_id” column from empDF with the “dept_id” column from deptDF. However, it does not include any columns from the deptDF DataFrame in the result.


# Output:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

PySpark SQL Left Semi Join

Alternatively, you can run the Left Semi Join query on PySpark SQL by creating temporary tables.


# Create temporary tables
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

# SQL query
joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT SEMI JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
  .show(truncate=False)

This yields the same output as above.

Conclusion

In conclusion, the left semi join operation in PySpark provides a powerful mechanism for filtering rows from a DataFrame based on the existence of matching rows in another DataFrame, while excluding columns from the second DataFrame in the result. By utilizing the left semi join, analysts and data engineers can efficiently extract subsets of data from large datasets, retaining only the relevant rows that satisfy the specified join condition. This operation is particularly useful when there is a need to identify and retain records from one DataFrame that have corresponding matches in another DataFrame, without including any additional columns from the latter.

Hope you Like it !!

References

Leave a Reply