You are currently viewing Spark SQL Left Anti Join with Example

When you join two Spark DataFrames using Left Anti Join (left, left anti, left_anti), it returns only columns from the left DataFrame for non-matched records.

In this Spark article, I will explain how to do Left Anti Join(left, leftanti, left_anti) on two DataFrames with Scala Example.

leftanti join does the exact opposite of the leftsemi join.

Before we jump into Spark Left Anti Join examples, first, let’s create an emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id is unique on the dept DataFrame and emp_dept_id from emp has a reference to dept_id on dept dataset.


import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
       .appName("sparkbyexamples.com")
       .master("local")
       .getOrCreate()

val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
    (2,"Rose",1,"2010","20","M",4000),
    (3,"Williams",1,"2010","10","M",1000),
    (4,"Jones",2,"2005","10","F",2000),
    (5,"Brown",2,"2010","40","",-1),
      (6,"Brown",2,"2010","50","",-1)
  )

val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
       "emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)

val dept = Seq(("Finance",10),
    ("Marketing",20),
    ("Sales",30),
    ("IT",40)
  )

val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF("deptColumns")
deptDF.show(false)

This prints emp and dept DataFrame to the console.


#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

#Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Spark Left Anti Join (leftanti) Example

In order to use left anti join, you can use either antileftantileft_anti as a join type.


empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"left")
    .show(false)

Yields below output


+------+-----+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+-----+---------------+-----------+-----------+------+------+
|6     |Brown|2              |2010       |50         |      |-1    |
+------+-----+---------------+-----------+-----------+------+-

Using Spark SQL Left Anti Join

Let’s see how use Left Anti Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.


empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT ANTI JOIN DEPT d ON e.emp_dept_id == d.dept_id") 
  .show(truncate=False)

This also returns same output as above.

Conclusion

In this Spark article, you have learned Left Anti Join which is used to get only columns from the left DataFrame for non-matched records.

Hope you Like it !!

References:

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply