When you join two DataFrames using Left Anti Join (leftanti), it returns only columns from the left DataFrame for non-matched records.
In this PySpark article, I will explain how to do Left Anti Join(leftanti/left_anti) on two DataFrames with PySpark & SQL query Examples.
leftanti
join does the exact opposite of the leftsemi
join.
Before we jump into PySpark Left Anti Join examples, first, let’s create an emp
and dept
DataFrames. here, column emp_id
is unique on emp and dept_id
is unique on the dept DataFrame and emp_dept_id from emp has a reference to dept_id on dept dataset.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
This prints emp and dept DataFrame to the console.
# Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
# Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
PySpark Left Anti Join (leftanti) Example
In order to use left anti join, you can use either anti
, leftanti
, left_anti
as a join type.
Yields below output
+------+-----+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+-----+---------------+-----------+-----------+------+------+
|6 |Brown|2 |2010 |50 | |-1 |
+------+-----+---------------+-----------+-----------+------+------+
Using PySpark SQL Left Anti Join
Let’s see how to use Left Anti Join on PySpark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT ANTI JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
.show(truncate=False)
This also returns the same output as above.
Conclusion
In this PySpark article, you have learned left anti join which is used to get only columns from the left DataFrame for non-matched records.
Hope you Like it !!
Related Articles
- PySpark SQL Inner Join Dataframe Example
- PySpark SQL Left Outer Join Examples
- PySpark SQL Self Join Examples
- PySpark SQL Full Outer Join with Example
- PySpark SQL Left Semi Join Example
- PySpark SQL Right Outer Join with Example
- PySpark repartition() – Explained with Examples
- PySpark alias() Column & DataFrame Examples