You are currently viewing Subtracting two DataFrames in Spark?

Subtracting two DataFrames in Spark using Scala means taking the difference between the rows in the first DataFrame and the rows in the second DataFrame. The result of the subtraction operation is a new DataFrame containing only the rows that are present in the first DataFrame but not present in the second DataFrame. In this article, we shall discuss the different ways to subtract data frames.

Quick Example

Following is the quick example of substracting two DataFrames or Datasets


// Imports
import org.apache.spark.sql.functions._

// create first DataFrame
val df1 = Seq(
  (1, "apple", 100),
  (2, "banana", 200),
  (3, "orange", 150)
).toDF("id", "fruit", "price")

// create second DataFrame
val df2 = Seq(
  (1, "apple", 100),
  (2, "banana", 250)
).toDF("id", "fruit", "price")

// subtract df2 from df1
val result1 = df1.subtract(df2)

// subtract df2 from df1 based on columns "id" and "fruit"
val result2 = df1.except(df2.select("id", "fruit"))

For Subtracting two DataFrames in Spark using Scala, you can use the subtract function. The subtract function returns a new DataFrame that contains the rows in the first DataFrame that are not present in the second DataFrame. Here’s an example:


// Imports
import org.apache.spark.sql.functions._

// create first DataFrame
val df1 = Seq(
  (1, "apple"),
  (2, "banana"),
  (3, "orange")
).toDF("id", "fruit")

// create second DataFrame
val df2 = Seq(
  (1, "apple"),
  (2, "banana")
).toDF("id", "fruit")

// subtract df2 from df1
val result = df1.subtract(df2)

// display result
result.show()

Yields the below output.


//Output
+---+------+
| id| fruit|
+---+------+
|  3|orange|
+---+------+

In this example, we created two DataFrames (df1 and df2) containing some fruits and their IDs. We then subtracted df2 from df1 to get the fruits that are present in df1 but not in df2. Finally, we displayed the result using the show function. As you can see, the output only contains the row with id equal to 3 and fruit equal to orange, which is not present in df2.

2. Using Except Function

If you want to subtract two DataFrames in Spark using Scala such that only specific columns are compared, you can use the except() function instead of subtract(). The except function also returns a new DataFrame that contains the rows in the first DataFrame that are not present in the second DataFrame, considering only the specified columns. Here’s an example:


// Imports
import org.apache.spark.sql.functions._

// create first DataFrame
val df1 = Seq(
  (1, "apple", 100),
  (2, "banana", 200),
  (3, "orange", 150)
).toDF("id", "fruit", "price")

// create second DataFrame
val df2 = Seq(
  (1, "apple", 100),
  (2, "banana", 250)
).toDF("id", "fruit", "price")

// subtract df2 from df1 based on columns "id" and "fruit"
val result = df1.except(df2.select("id", "fruit"))

// display result
result.show()

Yields below output.


//Output
+---+------+
| id| fruit|
+---+------+
|  3|orange|
+---+------+

In this example, we created two DataFrames (df1 and df2) containing some fruits, their IDs, and their prices. We then subtracted df2 from df1 based on columns “id” and “fruit” using the except function. Finally, we displayed the result using the show function.

As you can see, the output only contains the row with id equal to 3 and fruit equal to orange, which is not present in df2 based on columns “id” and “fruit”. The row with id equal to 2 is not present in the output because it has a different price value in df2.

3. Conclusion

Subtracting two DataFrames in Spark can be useful for various use cases, such as

  • finding the difference between two datasets,
  • filtering out unwanted data from a DataFrame, or
  • identifying records that are present in one DataFrame but not in another.

The subtract function is used to return a new DataFrame that contains the rows in the first DataFrame that are not present in the second DataFrame, considering all columns of the DataFrames.

The except function, on the other hand, is used to return a new DataFrame that contains the rows in the first DataFrame that are not present in the second DataFrame, considering only the specified columns of the DataFrames.

Both subtract and except functions return a new DataFrame and do not modify the original DataFrames.

Related Articles

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.