DENSE_RANK and ROW_NUMBER are window functions that are used to retrieve an increasing integer value in Spark however there are some differences between these two. They start with a value based on the condition imposed by the ORDER BY clause. In case of partitioned data, the integer counter is reset to 1 for each partition.

Before jumping into DENSE_RANK and ROW_NUMBER differences, we should be knowing about RANK.

Let’s Create a DataFrame to discuss further. Here, I am using Azure Databricks as my environment hence, I don’t have to create a SparkSession as the Databricks environment provides the `spark`

object.

The above sample data has MAX `power`

as `8000`

for `BWM`

and Least `power`

`1300`

for `Toyota`

.

## 1. Rank Function

The RANK is also a window function in Spark that is used to retrieve ranked rows based on the condition of the ORDER BY clause. For example, if you want to find the name of the car with third highest power, you can use `RANK`

Function.

The `PowerRank`

column in the above table contains the rank of the cars ordered by descending order of their `power`

. An interesting thing about the `RANK`

function is that if there is a tie between N previous records for the value in the `ORDER BY`

column, the `RANK`

functions skips the next N-1 positions before incrementing the counter.

For instance, in the above result, there is a tie for the values in the power column between the 1st and 2nd rows, therefore the `RANK`

function skips the next (2-1 = 1) one record and jumps directly to the 3rd row.

The `RANK`

function can be used in combination with the PARTITION BY clause. In that case, the rank will be reset for each new partition. Take a look at the following script:

In the script above, we partition the results by `company`

column. Now for each company, the rank will be reset to 1 as shown above.

## 2. DENSE_RANK Function

The `DENSE_RANK`

function is similar to `RANK`

function however the `DENSE_RANK`

function does not skip any ranks if there is a tie between the ranks of the preceding records. Take a look at the following script.

You can see from the `DensePowerRank`

output that despite there being a tie between the ranks of the first two rows, the next rank is not skipped and has been assigned a value of 2 instead of 3.

`PARTITION BY`

clause can also be used with the `DENSE_RANK`

.

## 3. ROW_NUMBER Function

Unlike the `RANK`

and `DENSE_RANK`

functions, the `ROW_NUMBER`

function simply returns the row number of the sorted records starting with 1. For example, if `RANK`

and `DENSE_RANK`

functions of the first two records in the `ORDER BY`

column are equal, both of them are assigned 1 as their `RANK`

and `DENSE_RANK`

. However, the `ROW_NUMBER`

function will assign values 1 and 2 to those rows without taking the fact that they are equally into account. Execute the following script to see the `ROW_NUMBER`

function in action.

From the output, you can see that `ROW_NUMBER`

function simply assigns a new row number to each record irrespective of its value.

The `PARTITION BY`

clause can also be used with `ROW_NUMBER`

function.

### 4. Similarities between RANK, DENSE_RANK, and ROW_NUMBER Functions

The `RANK`

, `DENSE_RANK`

and `ROW_NUMBER`

functions in Spark DataFrame or Spark SQL have the following similarities besides the differences

- All of them require an order by clause.
- All of them return an increasing integer with a base value of 1.
- When combined with a
`PARTITION BY`

clause, all of these functions reset the returned integer value to 1 as we have seen. - If there are no duplicated values in the column used by the
`ORDER BY`

clause, these functions return the same output.

### 5. Difference between RANK, DENSE_RANK and ROW_NUMBER functions

The following are differences between `RANK`

, `DENSE_RANK`

, and `ROW_NUMBER`

functions in Spark

- The
`RANK()`

function skips the next N-1 ranks if there is a tie between N previous ranks. `DENSE_RANK()`

function does not skip ranks if there is a tie between ranks`ROW_NUMBER()`

function has no concern with ranking. It simply returns the row number of the sorted records. Even if there are duplicate records in the column used in the`ORDER BY`

clause, the`ROW_NUMBER`

function will not return duplicate values. Instead, it will continue to increment irrespective of the duplicate values.

## 6. Comple Example of DENSE_RANK and ROW_NUMBER functions

```
// Create DataFrame
val data = Seq((1, “Corrolla”, “Toyota”, 1800),
(2, “City”, “Honda”, 1500),
(3, “C200”, “Mercedez”, 2000),
(4, “Vitz”, “Toyota”, 1300),
(5, “Baleno”, “Suzuki”, 1500),
(6, “C500”, “Mercedez”, 5000),
(7, “800”, “BMW”, 8000),
(8, “Mustang”, “Ford”, 5000),
(9, “208”, “Peugeot”, 5400),
(10, “Prius”, “Toyota”, 3200),
(11, “Atlas”, “Volkswagen”, 5000),
(12, “110”, “Bugatti”, 8000),
(13, “Landcruiser”, “Toyota”, 3000),
(14, “Civic”, “Honda”, 1800),
(15, “Accord”, “Honda”, 2000))
val columns = Seq(“id”,”name”, “company”, “power”)
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF(columns:_*)
df.createOrReplaceTempView(“cars”)
// SQL to get dense_rank(), rank() and row_number()
%sql
SELECT company,CONCAT(company, “-“,name) as name, power,
RANK() OVER(ORDER BY power DESC) AS PowerRank,
DENSE_RANK() OVER(ORDER BY power DESC) AS DensePowerRank,
ROW_NUMBER() OVER(ORDER BY power DESC) AS RowRank
FROM cars;
```

## 7. Conclusion

In this article, I have explained similarities and differences between rank(), dense_rank() and row_number() functions in Spark. rank(), dense_rank() and row_number() functions are used to retrieve an increasing integer value.

## 8. Related Articles

- Spark SQL – When Otherwise
- Spark Write DataFrame to CSV File
- Spark Repartition() vs Coalesce()
- Apache Spark Interview Questions
- Spark map() vs flatMap() with Examples
- Spark Shell Command Usage with Examples
- Spark Get the Current SparkContext Settings
- Spark Get Current Number of Partitions of DataFrame