PySpark distinct vs dropDuplicates

Naveen Nelamali

2 years ago

What is the difference between PySpark distinct() vs dropDuplicates() methods? Both these methods are used to drop duplicate rows from the DataFrame and return DataFrame with unique values. The main difference is distinct() performs on all columns whereas dropDuplicates() is used on selected columns.

PySpark distinct()
PySpark dropDuplicates()

1. Differences Between PySpark distinct vs dropDuplicates

The main difference between distinct() vs dropDuplicates() functions in PySpark are the former is used to select distinct rows from all columns of the DataFrame and the latter is used select distinct on selected columns.

Let’s create a DataFrame and run some examples to understand the differences.


# Import
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Prepare data
data = [("James", "Sales", 3000), \
    ("Michael", "Sales", 4600), \
    ("Robert", "Sales", 4600), \
    ("James", "Sales", 3000)
  ]
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

Yields below output.

2. PySpark distinct()

pyspark.sql.DataFrame.distinct() is used to get the unique rows from all the columns from DataFrame. This function doesn’t take any argument and by default applies distinct on all columns.

2.1 distinct Syntax

Following is the syntax on PySpark distinct. Returns a new DataFrame containing the distinct rows in this DataFrame


# Syntax
DataFrame.distinct()

2.2 distinct Example

Let’s see an example


# Using distinct()
distinctDF = df.distinct()
distinctDF.show(truncate=False)

3. PySpark dropDuplicates

pyspark.sql.DataFrame.dropDuplicates() method is used to drop the duplicate rows from the single or multiple columns. It returns a new DataFrame with duplicate rows removed, when columns are used as arguments, it only considers the selected columns.

3.1 dropDuplicate Syntax

drop_duplicates() is an alias for dropDuplicates().


# Syntax
DataFrame.dropDuplicates(subset=None)

3.2 dropDuplicates() Example

Let’s see an example.


# Using dropDuplicates on multiple columns
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)

# Using dropDuplicates on single column
dropDisDF = df.dropDuplicates(["salary"]).select("salary")
dropDisDF.show(truncate=False)
print(dropDisDF.collect())

4. Complete Example of distinct vs dropDuplicates

Following is a complete example of demonstrating the difference between distinct vs dropDuplicates functions.


# Difference between distinct() vs dropDuplicates()

# Import
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Prepare data
data = [("James", "Sales", 3000), \
    ("Michael", "Sales", 4600), \
    ("Robert", "Sales", 4600), \
    ("James", "Sales", 3000)
  ]
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using distinct()
distinctDF = df.distinct()
distinctDF.show(truncate=False)

# Using dropDuplicates()
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)

# Using dropDuplicates() on single column
dropDisDF = df.dropDuplicates(["salary"]).select("salary")
dropDisDF.show(truncate=False)
print(dropDisDF.collect())

5. Conclusion

In this article, you have learned what is the difference between PySpark distinct and dropDuplicate functions, both these functions are from DataFrame class and return a DataFrame after eliminating duplicate rows.