PySpark leftsemi
join is similar to inner
join difference being left semi-join returns all columns from the left DataFrame/Dataset and ignores all columns from the right DataFrame. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.
Alternatively, you can achieve the same output as Left Semi Join by selecting all columns from the Left DataFrame on the result of the inner join; however, using this join would be efficient.
To explain PySpark Left Semi Join first, I will create an emp
DataFrame and dept
DataFrame. In these DataFrames, each value in the column “emp_id” is unique within the “emp” DataFrame, while each value in the column “dept_id” is unique within the “dept” DataFrame. Additionally, the “emp_dept_id” column in the “emp” DataFrame references the “dept_id” column in the “dept” DataFrame.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkbyexamples.com').getOrCreate()
# Emp Data
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
# Create Emp DataFrame
empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)
# Dept Data
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
# Create dept DataFrame
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
Yields below output.
# Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
# Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
PySpark DataFrame Left Semi Join Example
A left semi join is a type of join operation in SQL that returns all the rows from the left DataFrame (or table) where there is a match in the right DataFrame (or table) based on the specified join condition. However, it does not include any columns from the right DataFrame in the result.
Here’s how a left semi join works:
- Join Condition: You specify a join condition based on which the left DataFrame is matched with the right DataFrame.
- Matching Rows: The left semi join returns all rows from the left DataFrame where there is a match in the right DataFrame based on the specified join condition.
- Excluded Columns: Unlike other join types, a left semi join does not include any columns from the right DataFrame in the result. It only includes columns from the left DataFrame.
Left semi joins are useful for filtering rows from the left DataFrame based on the existence of matching rows in the right DataFrame, without including any additional columns from the right DataFrame in the result.
To use the left semi-join, use the leftsemi
join type.
# Left Semi Join Example
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"leftsemi") \
.show(truncate=False)
This join returns all the rows from the empDF
DataFrame where there is a match in the deptDF
DataFrame on the condition specified, which is the equality of the “emp_dept_id” column from empDF
with the “dept_id” column from deptDF
. However, it does not include any columns from the deptDF
DataFrame in the result.
# Output:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
PySpark SQL Left Semi Join
Alternatively, you can run the Left Semi Join query on PySpark SQL by creating temporary tables.
# Create temporary tables
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
# SQL query
joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT SEMI JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
.show(truncate=False)
This yields the same output as above.
Conclusion
In conclusion, the left semi join operation in PySpark provides a powerful mechanism for filtering rows from a DataFrame based on the existence of matching rows in another DataFrame, while excluding columns from the second DataFrame in the result. By utilizing the left semi join, analysts and data engineers can efficiently extract subsets of data from large datasets, retaining only the relevant rows that satisfy the specified join condition. This operation is particularly useful when there is a need to identify and retain records from one DataFrame that have corresponding matches in another DataFrame, without including any additional columns from the latter.
Hope you Like it !!
Related Articles
- PySpark SQL Inner Join Dataframe Example
- PySpark SQL Left Outer Join Examples
- PySpark SQL Self Join Examples
- PySpark SQL Left Anti Join Examples
- PySpark SQL Full Outer Join with Example
- PySpark SQL Right Outer Join with Example
- PySpark SQL Types (DataType) with Examples
- PySpark SQL Date and Timestamp Functions
- PySpark SQL Self Join With Example