Spark Left Semi Join (semi, left semi, left_semi) is similar to inner join difference being left semi-join returns all columns from the left DataFrame/Dataset and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on the join expression, records not matched on the join expression are ignored from both left and right datasets.
The same result can be achieved using select on the result of the inner join however, using this join would be efficient.
In this Spark article, I will explain how to do Left Semi Join (semi, leftsemi, left_semi) on two Spark DataFrames with Scala Example.
Before we jump into Spark Left Semi Join examples, first, let’s create an emp
and dept
DataFrame’s. here, column emp_id
is unique on emp and dept_id
is unique on the dept DataFrame and emp_dept_id from emp has a reference to dept_id on dept dataset.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("sparkbyexamples.com")
.master("local")
.getOrCreate()
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF("deptColumns")
deptDF.show(false)
This prints emp and dept DataFrame to the console.
#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
#Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
Spark Left Semi Join Example
In order to use left Semi join, you can use either semi
, leftsemi
, left_semi
as a join type.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftsemi")
.show(false)
Below is the result of the above join expression.
leftsemi join
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
Using Spark SQL Left Semi Join
Let’s see how use Left Semi Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT SEMI JOIN DEPT d ON e.emp_dept_id == d.dept_id")
.show(truncate=False)
This also returns the same output as above.
Conclusion
In this article, you have learned Spark Left Semi Join (semi, leftsemi, left_semi) is similar to inner
join difference being leftsemi join returns all columns from the left dataset and ignores all columns from the right dataset.
Hope you Like it !!
Related Articles
- Spark SQL Inner Join Dataframe Example
- Spark SQL Left Outer Join Examples
- Spark SQL Self Join Examples
- Spark SQL Left Anti Join Examples
- Spark SQL Full Outer Join with Example
- Spark SQL Right Outer Join with Example
- Spark SQL like() Using Wildcard Example
- Spark SQL – Select Columns From DataFrame