Spark SQL Self Join With Example

Spark SQL Self Join

In this article, I will explain Spark SQL Self Join (Joining DataFrame to itself) with Scala Example. Joins are not complete without a self join, though there is no self-join type available in Spark, it is still achievable using existing join types, all below examples use inner self join.

In this Spark article, I will explain how to do Self Join (Self Join) on two DataFrames with Scala Example.

Before we jump into Spark Self Join examples, first, let’s create emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id is unique on the dept dataset’s and emp_dept_id from emp has a reference to dept_id on the dept dataset.

   import org.apache.spark.sql.SparkSession
     val spark = SparkSession.builder

  val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
  val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
  import spark.sqlContext.implicits._
  val empDF = emp.toDF(empColumns:_*)

  val dept = Seq(("Finance",10),

  val deptColumns = Seq("dept_name","dept_id")
  val deptDF = dept.toDF("deptColumns")

This prints emp and dept DataFrame to console.

Emp Dataset
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |

Dept Dataset
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |

Spark DataFrame Self Join Example

Below is an example of how to use Self Join (Self) on Spark DataFrame.

empDF.alias("emp1").join(empDF.alias("emp2"), \
    col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
    .select(col("emp1.emp_id"),col(""), \
      col("emp2.emp_id").alias("superior_emp_id"), \
      col("").alias("superior_emp_name")) \

Here, we are joining emp dataset with itself to find out superior emp_id and name for all employees.

|emp_id|name    |superior_emp_id|superior_emp_name|
|2     |Rose    |1              |Smith            |
|3     |Williams|1              |Smith            |
|4     |Jones   |2              |Rose             |
|5     |Brown   |2              |Rose             |
|6     |Brown   |2              |Rose             |

Using Spark SQL Self Join

Let’s see how to use Self Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.


joinDF2 = spark.sql("SELECT e.* FROM EMP e LEFT OUTER JOIN DEPT d ON e.emp_dept_id == d.dept_id") 

This also returns same output as above.


In this Spark article,Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above-explained join types to join DataFrame to itself

Hope you Like it !!


Leave a Reply