Spark SQL Inner join is the default join in and it’s mostly used, this joins two DataFrame/Datasets on key columns, where keys don’t match the rows get dropped from both datasets.
In this Spark article, I will explain how to do Inner Join( Inner) on two DataFrames with Scala Example.
Before we jump into Spark Inner Join examples, first, let’s create an emp
and dept
DataFrame’s. here, column emp_id
is unique on emp and dept_id
is unique on the dept DataFrame and emp_dept_id from emp has a reference to dept_id on dept dataset.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("sparkbyexamples.com")
.master("local")
.getOrCreate()
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF("deptColumns")
deptDF.show(false)
This prints emp and dept DataFrame to console.
#Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
#Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
Spark DataFrame Inner Join Example
Below is a Spark DataFrame example using Inner Join
as a join type.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
When we apply Inner join on our datasets, It drops “emp_dept_id
” 60 from “emp
” and “dept_id
” 30 from “dept
” datasets. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |
|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |
|5 |Brown |2 |2010 |40 | |-1 |IT |40 |
+------+--------+---------------+-----------+-----------+------
Using Spark SQL Inner Join
Let’s see how to use Inner Join on Spark SQL expression, In order to do so first let’s create a temporary view for EMP and DEPT tables.
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
joinDF2 = spark.sql("SELECT e.* FROM EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
.show(truncate=False)
This also returns the same output as above.
Conclusion
In this Spark article, Inner
join is the default join in Spark and it’s mostly used. This joins two datasets on key columns.where keys don’t match the rows get dropped from both datasets (emp
& dept
).
Hope you Like it !!
Related Articles
- Spark SQL Left Outer Join Examples
- Spark SQL Self Join Examples
- Spark SQL Left Anti Join Examples
- Spark SQL Full Outer Join with Example
- Spark SQL Right Outer Join with Example
- Spark SQL Left Semi Join With Example
- Spark SQL – Select Columns From DataFrame
- Spark SQL like() Using Wildcard Example