PySpark distinct vs dropDuplicates

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:PySpark
Post last modified:March 27, 2024
Reading time:6 mins read

You are currently viewing PySpark distinct vs dropDuplicates

What is the difference between PySpark distinct() vs dropDuplicates() methods? Both these methods are used to drop duplicate rows from the DataFrame and return DataFrame with unique values. The main difference is distinct() performs on all columns whereas dropDuplicates() is used on selected columns.

PySpark distinct()
PySpark dropDuplicates()

1. Differences Between PySpark distinct vs dropDuplicates

The main difference between distinct() vs dropDuplicates() functions in PySpark are the former is used to select distinct rows from all columns of the DataFrame and the latter is used select distinct on selected columns.

Let’s create a DataFrame and run some examples to understand the differences.


# Import
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Prepare data
data = [("James", "Sales", 3000), \
    ("Michael", "Sales", 4600), \
    ("Robert", "Sales", 4600), \
    ("James", "Sales", 3000)
  ]
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

Yields below output.

2. PySpark distinct()

pyspark.sql.DataFrame.distinct() is used to get the unique rows from all the columns from DataFrame. This function doesn’t take any argument and by default applies distinct on all columns.

2.1 distinct Syntax

Following is the syntax on PySpark distinct. Returns a new DataFrame containing the distinct rows in this DataFrame


# Syntax
DataFrame.distinct()

2.2 distinct Example

Let’s see an example


# Using distinct()
distinctDF = df.distinct()
distinctDF.show(truncate=False)

3. PySpark dropDuplicates

pyspark.sql.DataFrame.dropDuplicates() method is used to drop the duplicate rows from the single or multiple columns. It returns a new DataFrame with duplicate rows removed, when columns are used as arguments, it only considers the selected columns.

3.1 dropDuplicate Syntax

drop_duplicates() is an alias for dropDuplicates().


# Syntax
DataFrame.dropDuplicates(subset=None)

3.2 dropDuplicates() Example

Let’s see an example.


# Using dropDuplicates on multiple columns
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)

# Using dropDuplicates on single column
dropDisDF = df.dropDuplicates(["salary"]).select("salary")
dropDisDF.show(truncate=False)
print(dropDisDF.collect())

4. Complete Example of distinct vs dropDuplicates

Following is a complete example of demonstrating the difference between distinct vs dropDuplicates functions.


# Difference between distinct() vs dropDuplicates()

# Import
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Prepare data
data = [("James", "Sales", 3000), \
    ("Michael", "Sales", 4600), \
    ("Robert", "Sales", 4600), \
    ("James", "Sales", 3000)
  ]
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using distinct()
distinctDF = df.distinct()
distinctDF.show(truncate=False)

# Using dropDuplicates()
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)

# Using dropDuplicates() on single column
dropDisDF = df.dropDuplicates(["salary"]).select("salary")
dropDisDF.show(truncate=False)
print(dropDisDF.collect())

5. Conclusion

In this article, you have learned what is the difference between PySpark distinct and dropDuplicate functions, both these functions are from DataFrame class and return a DataFrame after eliminating duplicate rows.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium