DENSE_RANK and ROW_NUMBER are window functions that are used to retrieve an increasing integer value in Spark however there are some differences between these two. They start with a value based on the condition imposed by the ORDER BY clause. In case of partitioned data, the integer counter is reset to 1 for each partition.
Before jumping into DENSE_RANK and ROW_NUMBER differences, we should be knowing about RANK.
The above sample data has MAX
BWM and Least
1. Rank Function
The RANK is also a window function in Spark that is used to retrieve ranked rows based on the condition of the ORDER BY clause. For example, if you want to find the name of the car with third highest power, you can use
PowerRank column in the above table contains the rank of the cars ordered by descending order of their
power. An interesting thing about the
RANK function is that if there is a tie between N previous records for the value in the
ORDER BY column, the
RANK functions skips the next N-1 positions before incrementing the counter.
For instance, in the above result, there is a tie for the values in the power column between the 1st and 2nd rows, therefore the
RANK function skips the next (2-1 = 1) one record and jumps directly to the 3rd row.
RANK function can be used in combination with the PARTITION BY clause. In that case, the rank will be reset for each new partition. Take a look at the following script:
In the script above, we partition the results by
company column. Now for each company, the rank will be reset to 1 as shown above.
2. DENSE_RANK Function
DENSE_RANK function is similar to
RANK function however the
DENSE_RANK function does not skip any ranks if there is a tie between the ranks of the preceding records. Take a look at the following script.
You can see from the
DensePowerRank output that despite there being a tie between the ranks of the first two rows, the next rank is not skipped and has been assigned a value of 2 instead of 3.
PARTITION BY clause can also be used with the
3. ROW_NUMBER Function
DENSE_RANK functions, the
ROW_NUMBER function simply returns the row number of the sorted records starting with 1. For example, if
DENSE_RANK functions of the first two records in the
ORDER BY column are equal, both of them are assigned 1 as their
DENSE_RANK. However, the
ROW_NUMBER function will assign values 1 and 2 to those rows without taking the fact that they are equally into account. Execute the following script to see the
ROW_NUMBER function in action.
From the output, you can see that
ROW_NUMBER function simply assigns a new row number to each record irrespective of its value.
PARTITION BY clause can also be used with
4. Similarities between RANK, DENSE_RANK, and ROW_NUMBER Functions
ROW_NUMBER functions in Spark DataFrame or Spark SQL have the following similarities besides the differences
- All of them require an order by clause.
- All of them return an increasing integer with a base value of 1.
- When combined with a
PARTITION BYclause, all of these functions reset the returned integer value to 1 as we have seen.
- If there are no duplicated values in the column used by the
ORDER BYclause, these functions return the same output.
5. Difference between RANK, DENSE_RANK and ROW_NUMBER functions
The following are differences between
ROW_NUMBER functions in Spark
RANK()function skips the next N-1 ranks if there is a tie between N previous ranks.
DENSE_RANK()function does not skip ranks if there is a tie between ranks
ROW_NUMBER()function has no concern with ranking. It simply returns the row number of the sorted records. Even if there are duplicate records in the column used in the
ORDER BYclause, the
ROW_NUMBERfunction will not return duplicate values. Instead, it will continue to increment irrespective of the duplicate values.
6. Comple Example of DENSE_RANK and ROW_NUMBER functions
# Create DataFrame val data = Seq((1, “Corrolla”, “Toyota”, 1800), (2, “City”, “Honda”, 1500), (3, “C200”, “Mercedez”, 2000), (4, “Vitz”, “Toyota”, 1300), (5, “Baleno”, “Suzuki”, 1500), (6, “C500”, “Mercedez”, 5000), (7, “800”, “BMW”, 8000), (8, “Mustang”, “Ford”, 5000), (9, “208”, “Peugeot”, 5400), (10, “Prius”, “Toyota”, 3200), (11, “Atlas”, “Volkswagen”, 5000), (12, “110”, “Bugatti”, 8000), (13, “Landcruiser”, “Toyota”, 3000), (14, “Civic”, “Honda”, 1800), (15, “Accord”, “Honda”, 2000)) val columns = Seq(“id”,”name”, “company”, “power”) val rdd = spark.sparkContext.parallelize(data) val df = rdd.toDF(columns:_*) df.createOrReplaceTempView(“cars”) # SQL to get dense_rank(), rank() and row_number() %sql SELECT company,CONCAT(company, “-“,name) as name, power, RANK() OVER(ORDER BY power DESC) AS PowerRank, DENSE_RANK() OVER(ORDER BY power DESC) AS DensePowerRank, ROW_NUMBER() OVER(ORDER BY power DESC) AS RowRank FROM cars;
In this article, I have explained similarities and differences between rank(), dense_rank() and row_number() functions in Spark. rank(), dense_rank() and row_number() functions are used to retrieve an increasing integer value.