• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:8 mins read
You are currently viewing Pyspark Select Distinct Rows

How does PySpark select distinct works? In order to perform select distinct/unique rows from all columns use the distinct() method and to perform on a single column or multiple selected columns use dropDuplicates(). In this article, I will explain different examples of how to select distinct values of a column from DataFrame.

1. Quick Examples

Following are quick examples of selecting distinct rows values of column


# Quick examples of select distinct values

Let’s create a DataFrame, run these above examples and explore the output.


# Import
from pyspark.sql import SparkSession

# Create SparrkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Prepare Data
data = [("James", "Sales", 3000), \
    ("Michael", "Sales", 4600), \
    ("Robert", "Sales", 4100), \
    ("James", "Sales", 3000)
  ]

# Create DataFrame
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

Yields below output

pyspark select distinct

2. Pyspark Select Distinct Rows

Use pyspark distinct() to select unique rows from all columns. It returns a new DataFrame after selecting only distinct column values, when it finds any rows having unique values on all columns it will be eliminated from the results.

From the above dataframe employee_name with James has the same values on all columns.

If you are using pandas API on PySpark refer to pandas get unique values from column


# Select distinct rows
distinctDF = df.distinct()
distinctDF.show(truncate=False)

Yields below output.

pyspark select distinct rows

3. PySpark Select Distinct Multiple Columns

To select distinct on multiple columns using the dropDuplicates(). This function takes columns where you wanted to select distinct values and returns a new DataFrame with unique values on selected columns. When no argument is used it behaves exactly the same as a distinct() function.

The following example selects distinct columns department and salary, after eliminating duplicates it returns all columns.


# Select distinct multiple columns
df2 = df.dropDuplicates("department","salary")
df2.show(truncate=False)

Yields below output.

pyspark select distinct on multiple columns

4. Select Unique Values from Column

To select unique values from a specific single column use dropDuplicates(), since this function returns all columns, use the select() method to get the single column.

Once you have the distinct unique values from columns you can also convert them to a list by collecting the data. When collecting data, be careful as it collects the data to the driver’s memory and if your data doesn’t fit in driver’s memory you will get an exception.


# select distinct on multiple columns
dropDisDF = df.dropDuplicates(["salary"]).select("salary")
dropDisDF.show(truncate=False)

print(df.dropDuplicates(["salary"]).select("salary").collect())

Yields below output.


# Output
+------+
|salary|
+------+
|3000  |
|4600  |
+------+

[Row(salary=3000), Row(salary=4600)]

4. SQL Select Distinct

One of the biggest advantages of PySpark is that it support SQL queries to run on DataFrame data so let’s see how to select distinct rows on single or multiple columns by using SQL queries.

In order to use SQL, make sure you create a temporary view using createOrReplaceTempView(), Since it is a temporary view, the lifetime of the table/view is tied to the current SparkSession. Hence, It will be automatically removed when your spark session ends.


# Create temporary view
df.createOrReplaceTempView("TAB")
spark.sql("SELECT DISTINCT * FROM TAB").show()
spark.sql("SELECT DISTINCT DEPARTMENT, SALARY FROM TAB").show()

Yields below output.

pyspark sql select distinct

6. Conclusion

In this article, you have learned how to perform PySpark select distinct rows from DataFrame, also learned how to select unique values from single column and multiple columns, and finally learned to use PySpark SQL.

You can find the complete example at GitHub project

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has One Comment

  1. Anonymous

    Hi, I noticed there is a small error in the code:

    df2 = df.dropDuplicates(“department”,”salary”)

    should be:

    df2 = df.dropDuplicates([“department”,”salary”])

Comments are closed.