Spark SQL Left Semi Join Example

Spark SQL Left Semi Join

Spark Left Semi Join (semi, left semi, left_semi) is similar to inner join difference being left semi-join returns all columns from the left DataFrame/Dataset and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.

The same result can be achieved using select on the result of the inner join however, using this join would be efficient.

In this Spark article, I will explain how to do Left Semi Join (semi, leftsemi, left_semi) on two Spark DataFrames with Scala Example.

Before we jump into Spark Left Semi Join examples, first, let’s create an emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id is unique on the dept DataFrame and emp_dept_id from emp has a reference to dept_id on dept dataset.


     import org.apache.spark.sql.SparkSession
     val spark = SparkSession.builder
       .appName("sparkbyexamples.com")
       .master("local")
       .getOrCreate()

  val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
    (2,"Rose",1,"2010","20","M",4000),
    (3,"Williams",1,"2010","10","M",1000),
    (4,"Jones",2,"2005","10","F",2000),
    (5,"Brown",2,"2010","40","",-1),
      (6,"Brown",2,"2010","50","",-1)
  )
  val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
       "emp_dept_id","gender","salary")
  import spark.sqlContext.implicits._
  val empDF = emp.toDF(empColumns:_*)
  empDF.show(false)

  val dept = Seq(("Finance",10),
    ("Marketing",20),
    ("Sales",30),
    ("IT",40)
  )

  val deptColumns = Seq("dept_name","dept_id")
  val deptDF = dept.toDF("deptColumns")
  deptDF.show(false)

This prints emp and dept DataFrame to console.


Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Spark Left Semi Join Example

In order to use left Semi join, you can use either semileftsemileft_semi as a join type.


empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftsemi")
    .show(false)

Below is the result of the above join expression.


leftsemi join
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

Using Spark SQL Left Semi Join

Let’s see how use Left Semi Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.

<pre><code class="language-Scala">
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT SEMI JOIN DEPT d ON e.emp_dept_id == d.dept_id") 
  .show(truncate=False)
</code></pre>

This also returns same output as above.

Conclusion

In this article, you have learned Spark Left Semi Join (semi, leftsemi, left_semi) is similar to inner join difference being leftsemi join returns all columns from the left dataset and ignores all columns from the right dataset.

Hope you Like it !!

References:

Leave a Reply