In PySpark, the dense_rank()
window function is used to assign ranks to rows within a partition of a DataFrame based on specified order criteria. When multiple rows have the same value in the ordering column, they receive the same rank, and unlike <a href="https://sparkbyexamples.com/pyspark/pyspark-rank-function-with-examples/">rank()</a>
, it doesn’t skip the subsequent rank. This function is helpful when you need a consistent, gap-free ranking within groups.
In this article, you’ll learn how to use the dense_rank()
function with partitionBy() and orderBy() to group and rank data in a DataFrame.
Key Points-
dense_rank()
is a PySpark window function for ranking rows.- Duplicate values receive the same rank.
- No skipped ranks for ties (gapless ranking).
- Requires a window specification:
Window.partitionBy().orderBy()
. - If
partitionBy()
is omitted, the entire DataFrame is treated as a single group. - Supports both ascending and descending order.
- Suitable for analytics, leaderboard generation, and grouped ranking tasks.
- Ideal for “top-N by group” scenarios without rank gaps.
dense_rank()
vsrank()
:dense_rank()
does not skip ranks on ties.row_number()
assigns a unique sequential number to each row.
PySpark dense_rank()
The dense_rank()
function ranks rows within a partition based on the specified order. Rows with the same order value receive the same rank, but the next rank is not skipped.
Syntax
The following is the syntax of the dense_rank()
function.
# Syntax of the dense_rank()
from pyspark.sql.functions import dense_rank
pyspark.sql.functions.dense_rank()
Parameters
- It has no direct parameters.
- Needs a Window specification to work.
Return Value
Returns a column of type IntegerType
, assigning ranks without skipping for ties
PySpark dense_rank Partition By
You can use the dense_rank()
function to add a new column based on a specified window. For that, you can apply the dense_rank()
function to a specific partition using a defined ordering. This function assigns a rank to each row based on the order. If two or more rows have the same value, they get the same rank, but unlike rank()
, dense_rank()
does not skip the next numbers in the ranking.
# Add a new column using dense_rank() over the specified window
# Applying partitionBy() and orderBy()
from pyspark.sql import SparkSession
from pyspark.sql.functions import row_number, rank, dense_rank, col
from pyspark.sql.window import Window
# Create SparkSession
spark = SparkSession.builder.appName("Sparkbyexamples").getOrCreate()
# Sample data
data = [
("James", "Sales", 3000),
("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
]
columns = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
window_spec = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
Yields below the output.
PySpark dense_rank Without Partition
You can also use the dense_rank()
function to assign ranks without partitioning by using orderBy()
. It will treat the whole DataFrame as a single group and add row numbers based on the global order of the specified column.
# Add the rank to each row without partition
global_window = Window.orderBy(col("salary").desc())
df.withColumn("dense_rank", dense_rank().over(global_window)).show()
# Output:
# +-------------+----------+------+----------+
# |employee_name|department|salary|dense_rank|
# +-------------+----------+------+----------+
# | Michael| Sales| 4600| 1|
# | Robert| Sales| 4100| 2|
# | Saif| Sales| 4100| 2|
# | Jen| Finance| 3900| 3|
# | Scott| Finance| 3300| 4|
# | James| Sales| 3000| 5|
# | Maria| Finance| 3000| 5|
# | Jeff| Marketing| 3000| 5|
# | Kumar| Marketing| 2000| 6|
# +-------------+----------+------+----------+
PySpark dense_rank Order by Desc
To add ranks within each group based on descending order, you can use the dense_rank()
function along with a window specification.
# Add the rank to each row within a partition by descending order
window_spec = Window.partitionBy("department").orderBy(col("salary").desc())
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
# Output:
# +-------------+----------+------+----------+
# |employee_name|department|salary|dense_rank|
# +-------------+----------+------+----------+
# | Jen| Finance| 3900| 1|
# | Scott| Finance| 3300| 2|
# | Maria| Finance| 3000| 3|
# | Jeff| Marketing| 3000| 1|
# | Kumar| Marketing| 2000| 2|
# | Michael| Sales| 4600| 1|
# | Robert| Sales| 4100| 2|
# | Saif| Sales| 4100| 2|
# | James| Sales| 3000| 3|
# +-------------+----------+------+----------+
dense_rank() vs rank()
This comparison demonstrates how each function behaves when multiple rows have identical values.
# Difference between PySpark rank() and dense_rank()
window_spec = Window.partitionBy("department").orderBy(col("salary"))
result_df = df.withColumn("rank", rank().over(window_spec)) \
.withColumn("dense_rank", dense_rank().over(window_spec))
result_df.show()
# Output:
# +-------------+----------+------+----+----------+
# |employee_name|department|salary|rank|dense_rank|
# +-------------+----------+------+----+----------+
# | Maria| Finance| 3000| 1| 1|
# | Scott| Finance| 3300| 2| 2|
# | Jen| Finance| 3900| 3| 3|
# | Kumar| Marketing| 2000| 1| 1|
# | Jeff| Marketing| 3000| 2| 2|
# | James| Sales| 3000| 1| 1|
# | Robert| Sales| 4100| 2| 2|
# | Saif| Sales| 4100| 2| 2|
# | Michael| Sales| 4600| 4| 3|
# +-------------+----------+------+----+----------+
PySpark rank() vs dense_rank() vs row_number()
This example shows the differences between the rank(), dense_rank(), and row_number() functions in PySpark with a window partition. We’ll apply these functions to a DataFrame to add columns that represent row rankings based on the specified partition.
# Complete example of Difference between PySpark rank(), dense_rank(),and row_number()
# Applying partitionBy() and orderBy()
window_spec = Window.partitionBy("department").orderBy(col("salary"))
result_df = df.withColumn("row_number", row_number().over(window_spec))\
.withColumn("rank", rank().over(window_spec)) \
.withColumn("dense_rank", dense_rank().over(window_spec))
# Show the result
result_df.show()
# Output:
# +-------------+----------+------+----------+----+----------+
# |employee_name|department|salary|row_number|rank|dense_rank|
# +-------------+----------+------+----------+----+----------+
# | Maria| Finance| 3000| 1| 1| 1|
# | Scott| Finance| 3300| 2| 2| 2|
# | Jen| Finance| 3900| 3| 3| 3|
# | Kumar| Marketing| 2000| 1| 1| 1|
# | Jeff| Marketing| 3000| 2| 2| 2|
# | James| Sales| 3000| 1| 1| 1|
# | Robert| Sales| 4100| 2| 2| 2|
# | Saif| Sales| 4100| 3| 2| 2|
# | Michael| Sales| 4600| 4| 4| 3|
# +-------------+----------+------+----------+----+----------+
Frequently Asked Questions of PySpark dense_rank() Function
The dense_rank()
function assigns ranks to rows within a partition based on a specified order. When multiple rows have the same value, they are given the same rank, and no ranks are skipped afterward.
dense_rank()
: Assigns the same rank to duplicate values without skipping the next rank (e.g., 1, 2, 2, 3).rank()
: Assigns the same rank to duplicates but skips subsequent ranks (e.g., 1, 2, 2, 4).row_number()
: Assigns a unique sequential number to each row, even if the values are the same (e.g., 1, 2, 3, 4).
You use dense_rank()
with only orderBy()
, the entire DataFrame is considered a single group, and rows are ranked globally based on the specified column.
They receive the same rank. Unlike rank()
, the next rank is not skipped. For example, two rows tied at rank 2 will be followed by rank 3.
The function returns a column of type IntegerType, where each row has a numeric rank based on the sort and partition criteria.
You can use dense_rank()
with a Window.partitionBy()
and orderBy()
, and then filter rows where the rank is less than or equal to N. This approach is ideal for selecting top performers per group.
Conclusion
In this article, you have learned how to use the dense_rank()
function in PySpark with or without partitions. You also saw how it differs from rank()
and row_number()
and when to use each. dense_rank()
is ideal for scenarios that require gap-free ranking, such as leaderboards and top-N selections by group.
Happy Learning!!
References
Related Articles
- PySpark Select Top N Rows From Each Group
- PySpark Find Maximum Row per Group in DataFrame
- PySpark Select First Row of Each Group?
- Pyspark Select Distinct Rows
- PySpark Distinct to Drop Duplicate Rows
- Explain PySpark repartition() vs partitionBy() functions
- How to Add Row Number without partition in PySpark?
- How to add a Column with row number in PySpark?
- Explain PySpark row_number() function with examples