What is the difference between PySpark distinct() vs dropDuplicates() methods? Both these methods are used to drop duplicate rows from the DataFrame and return DataFrame with unique values. The main difference is distinct() performs on all columns whereas dropDuplicates() is used on selected columns.
1. Differences Between PySpark distinct vs dropDuplicates
The main difference between distinct() vs dropDuplicates() functions in PySpark are the former is used to select distinct rows from all columns of the DataFrame and the latter is used select distinct on selected columns.
Let’s create a DataFrame and run some examples to understand the differences.
# Import
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# Prepare data
data = [("James", "Sales", 3000), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 4600), \
("James", "Sales", 3000)
]
columns= ["employee_name", "department", "salary"]
# Create DataFrame
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
Yields below output.
2. PySpark distinct()
pyspark.sql.DataFrame.distinct()
is used to get the unique rows from all the columns from DataFrame. This function doesn’t take any argument and by default applies distinct on all columns.
2.1 distinct Syntax
Following is the syntax on PySpark distinct. Returns a new DataFrame containing the distinct rows in this DataFrame
# Syntax
DataFrame.distinct()
2.2 distinct Example
Let’s see an example
# Using distinct()
distinctDF = df.distinct()
distinctDF.show(truncate=False)
3. PySpark dropDuplicates
pyspark.sql.DataFrame.dropDuplicates()
method is used to drop the duplicate rows from the single or multiple columns. It returns a new DataFrame with duplicate rows removed, when columns are used as arguments, it only considers the selected columns.
3.1 dropDuplicate Syntax
drop_duplicates()
is an alias for dropDuplicates()
.
# Syntax
DataFrame.dropDuplicates(subset=None)
3.2 dropDuplicates() Example
Let’s see an example.
# Using dropDuplicates on multiple columns
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)
# Using dropDuplicates on single column
dropDisDF = df.dropDuplicates(["salary"]).select("salary")
dropDisDF.show(truncate=False)
print(dropDisDF.collect())
4. Complete Example of distinct vs dropDuplicates
Following is a complete example of demonstrating the difference between distinct vs dropDuplicates functions.
# Difference between distinct() vs dropDuplicates()
# Import
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# Prepare data
data = [("James", "Sales", 3000), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 4600), \
("James", "Sales", 3000)
]
columns= ["employee_name", "department", "salary"]
# Create DataFrame
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
# Using distinct()
distinctDF = df.distinct()
distinctDF.show(truncate=False)
# Using dropDuplicates()
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)
# Using dropDuplicates() on single column
dropDisDF = df.dropDuplicates(["salary"]).select("salary")
dropDisDF.show(truncate=False)
print(dropDisDF.collect())
5. Conclusion
In this article, you have learned what is the difference between PySpark distinct and dropDuplicate functions, both these functions are from DataFrame class and return a DataFrame after eliminating duplicate rows.
References
- PySpark count() – Different Methods Explained
- How to Convert PySpark Column to List?
- PySpark – Distinct to Drop Duplicate Rows
- PySpark Union and UnionAll Explained
- PySpark – Drop One or Multiple Columns From DataFrame
- PySpark Groupby Count Distinct
- PySpark Count Distinct from DataFrame