Spark SQL Self Join With Example

In this article, I will explain Spark SQL Self Join (Joining DataFrame to itself) with Scala Example. Joins are not complete without a self join, though there is no self-join type available in Spark, it is still achievable using existing join types, all below examples use inner self join.

In this Spark article, I will explain how to do Self Join (Self Join) on two DataFrames with Scala Example.

Before we jump into Spark Self Join examples, first, let’s create emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id is unique on the dept dataset’s and emp_dept_id from emp has a reference to dept_id on the dept dataset.


import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
       .appName("sparkbyexamples.com")
       .master("local")
       .getOrCreate()

val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
    (2,"Rose",1,"2010","20","M",4000),
    (3,"Williams",1,"2010","10","M",1000),
    (4,"Jones",2,"2005","10","F",2000),
    (5,"Brown",2,"2010","40","",-1),
      (6,"Brown",2,"2010","50","",-1)
  )

val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
       "emp_dept_id","gender","salary")

import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)

val dept = Seq(("Finance",10),
    ("Marketing",20),
    ("Sales",30),
    ("IT",40)
  )

val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF("deptColumns")
deptDF.show(false)

This prints emp and dept DataFrame to console.


#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Spark DataFrame Self Join Example

Below is an example of how to use Self Join (Self) on Spark DataFrame.


empDF.alias("emp1").join(empDF.alias("emp2"), \
    col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
    .select(col("emp1.emp_id"),col("emp1.name"), \
      col("emp2.emp_id").alias("superior_emp_id"), \
      col("emp2.name").alias("superior_emp_name")) \
   .show(truncate=False)

Here, we are joining emp dataset with itself to find out superior emp_id and name for all employees.


+------+--------+---------------+-----------------+
|emp_id|name    |superior_emp_id|superior_emp_name|
+------+--------+---------------+-----------------+
|2     |Rose    |1              |Smith            |
|3     |Williams|1              |Smith            |
|4     |Jones   |2              |Rose             |
|5     |Brown   |2              |Rose             |
|6     |Brown   |2              |Rose             |
+------+--------+---------------+-----------------+

Using Spark SQL Self Join

Let’s see how to use Self Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.


empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id") 
  .show(truncate=False)

This also returns the same output as above.

Conclusion

In this Spark article, I have explain Spark SQL Self Join (Joining DataFrame to itself) with Scala Example. Though there is no self-join type available in Spark, it is still achievable using existing join types, all below examples use inner self join.

Hope you Like it !!

References:

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Spark SQL Self Join With Example