In this article, I will explain Spark SQL Self Join (Joining DataFrame to itself) with Scala Example. Joins are not complete without a self join, though there is no self-join type available in Spark, it is still achievable using existing join types, all below examples use inner
self join.
In this Spark article, I will explain how to do Self Join (Self Join) on two DataFrames with Scala Example.
Before we jump into Spark Self Join examples, first, let’s create emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id
is unique on the dept dataset’s and emp_dept_id from emp has a reference to dept_id on the dept dataset.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("sparkbyexamples.com")
.master("local")
.getOrCreate()
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF("deptColumns")
deptDF.show(false)
This prints emp and dept DataFrame to console.
#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
Spark DataFrame Self Join Example
Below is an example of how to use Self Join (Self
) on Spark DataFrame.
empDF.alias("emp1").join(empDF.alias("emp2"), \
col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
.select(col("emp1.emp_id"),col("emp1.name"), \
col("emp2.emp_id").alias("superior_emp_id"), \
col("emp2.name").alias("superior_emp_name")) \
.show(truncate=False)
Here, we are joining emp
dataset with itself to find out superior emp_id
and name
for all employees.
+------+--------+---------------+-----------------+
|emp_id|name |superior_emp_id|superior_emp_name|
+------+--------+---------------+-----------------+
|2 |Rose |1 |Smith |
|3 |Williams|1 |Smith |
|4 |Jones |2 |Rose |
|5 |Brown |2 |Rose |
|6 |Brown |2 |Rose |
+------+--------+---------------+-----------------+
Using Spark SQL Self Join
Let’s see how to use Self Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id")
.show(truncate=False)
This also returns the same output as above.
Conclusion
In this Spark article, I have explain Spark SQL Self Join (Joining DataFrame to itself) with Scala Example. Though there is no self-join type available in Spark, it is still achievable using existing join types, all below examples use inner
self join.
Hope you Like it !!
Related Articles
- Spark SQL Inner Join Dataframe Example
- Spark SQL Left Outer Join Examples
- Spark SQL Left Anti Join Examples
- Spark SQL Full Outer Join with Example
- Spark SQL Left Semi Join Example
- Spark SQL Right Outer Join with Example
- Spark SQL – Select Columns From DataFrame
- Spark SQL like() Using Wildcard Example