How does PySpark select distinct works? In order to perform select distinct/unique rows from all columns use the distinct() method and to perform on a single column or multiple selected columns use dropDuplicates(). In this article, I will explain different examples of how to select distinct values of a column from DataFrame.
1. Quick Examples
Following are quick examples of selecting distinct rows values of column
# Quick examples of select distinct values
Let’s create a DataFrame, run these above examples and explore the output.
# Import
from pyspark.sql import SparkSession
# Create SparrkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# Prepare Data
data = [("James", "Sales", 3000), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("James", "Sales", 3000)
]
# Create DataFrame
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
Yields below output
2. Pyspark Select Distinct Rows
Use pyspark distinct() to select unique rows from all columns. It returns a new DataFrame after selecting only distinct column values, when it finds any rows having unique values on all columns it will be eliminated from the results.
From the above dataframe employee_name
with James
has the same values on all columns.
If you are using pandas API on PySpark refer to pandas get unique values from column
# Select distinct rows
distinctDF = df.distinct()
distinctDF.show(truncate=False)
Yields below output.
3. PySpark Select Distinct Multiple Columns
To select distinct on multiple columns using the dropDuplicates(). This function takes columns where you wanted to select distinct values and returns a new DataFrame with unique values on selected columns. When no argument is used it behaves exactly the same as a distinct() function.
The following example selects distinct columns department
and salary
, after eliminating duplicates it returns all columns.
# Select distinct multiple columns
df2 = df.dropDuplicates("department","salary")
df2.show(truncate=False)
Yields below output.
4. Select Unique Values from Column
To select unique values from a specific single column use dropDuplicates(), since this function returns all columns, use the select() method to get the single column.
Once you have the distinct unique values from columns you can also convert them to a list by collecting the data. When collecting data, be careful as it collects the data to the driver’s memory and if your data doesn’t fit in driver’s memory you will get an exception.
# select distinct on multiple columns
dropDisDF = df.dropDuplicates(["salary"]).select("salary")
dropDisDF.show(truncate=False)
print(df.dropDuplicates(["salary"]).select("salary").collect())
Yields below output.
# Output
+------+
|salary|
+------+
|3000 |
|4600 |
+------+
[Row(salary=3000), Row(salary=4600)]
4. SQL Select Distinct
One of the biggest advantages of PySpark is that it support SQL queries to run on DataFrame data so let’s see how to select distinct rows on single or multiple columns by using SQL queries.
In order to use SQL, make sure you create a temporary view using createOrReplaceTempView(), Since it is a temporary view, the lifetime of the table/view is tied to the current SparkSession. Hence, It will be automatically removed when your spark session ends.
# Create temporary view
df.createOrReplaceTempView("TAB")
spark.sql("SELECT DISTINCT * FROM TAB").show()
spark.sql("SELECT DISTINCT DEPARTMENT, SALARY FROM TAB").show()
Yields below output.
6. Conclusion
In this article, you have learned how to perform PySpark select distinct rows from DataFrame, also learned how to select unique values from single column and multiple columns, and finally learned to use PySpark SQL.
Related Articles
- PySpark count() – Different Methods Explained
- How to Convert PySpark Column to List?
- PySpark – Distinct to Drop Duplicate Rows
- PySpark Union and UnionAll Explained
- PySpark – Drop One or Multiple Columns From DataFrame
- PySpark Groupby Count Distinct
- PySpark Count Distinct from DataFrame
- PySpark createOrReplaceTempView() Explained
You can find the complete example at GitHub project
Hi, I noticed there is a small error in the code:
df2 = df.dropDuplicates(“department”,”salary”)
should be:
df2 = df.dropDuplicates([“department”,”salary”])