In PySpark SQL, a leftanti
join selects only rows from the left table that do not have a match in the right table. Here’s how the leftanti join works:
Advertisements
- It compares each row from the left table with every row from the right table based on the specified join condition.
- If a row from the left table does not have a matching row in the right table based on the join condition, it includes that row in the result set.
- The result set excludes rows from the left table that have a matching row in the right table.
leftanti
join does the exact opposite of the leftsemi
join.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
Yields below output.
# Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
# Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
PySpark Left Anti Join (leftanti) Example
In order to use left anti join, you can use either anti
, leftanti
, left_anti
as a join type.
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"leftanti") \
.show(truncate=False)
Yields below output
+------+-----+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+-----+---------------+-----------+-----------+------+------+
|6 |Brown|2 |2010 |50 | |-1 |
+------+-----+---------------+-----------+-----------+------+------+
PySpark SQL Left Anti Join
Following is PySpark SQL expression of Left Anti Join on
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT ANTI JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
.show(truncate=False)
This also returns the same output as above.
Conclusion
In this PySpark article, you have learned left anti join which is used to get only columns from the left DataFrame for non-matched records.
Hope you Like it !!
Related Articles
- PySpark SQL Inner Join Dataframe Example
- PySpark SQL Left Outer Join Examples
- PySpark SQL Self Join Examples
- PySpark SQL Full Outer Join with Example
- PySpark SQL Left Semi Join Example
- PySpark SQL Right Outer Join with Example
- PySpark repartition() – Explained with Examples
- PySpark alias() Column & DataFrame Examples