How to calculate the cumulative sum in PySpatk? You can use the Window specification along with aggregate functions like sum() to calculate the cumulative sum in PySpark. Which is a common operation, especially when working with time-series or grouped data. PySpark offers powerful window functions that make it easy to calculate cumulative sums both globally and within groups. In this article, you’ll learn
- What a cumulative sum is
- How to implement it in PySpark
- Real-world examples with and without groupings
- How to apply cumulative logic across multiple columns
- Best practices and performance tips
Key Points-
- A cumulative sum is a running total that adds the current value to all previous values in a sequence typically ordered by date or time.
- You need to define a Window using
orderBy()to maintain the order of rows. Optionally, usepartitionBy()to group data. - Use
.rowsBetween(Window.unboundedPreceding, Window.currentRow)to compute the sum from the beginning to the current row. - You can calculate cumulative sums across the entire DataFrame or within specific groups.
- Use a loop to apply the cumulative logic to multiple columns in one go.
What is a Cumulative Sum?
A cumulative sum refers to the running total obtained by summing the current value with all preceding values in a sequence, commonly ordered by time or date.
PySpark Cumulative Sum Syntax
The following is the syntax of cumulative syntax.
# Imports
from pyspark.sql.window import Window
from pyspark.sql.functions import sum
# Define window for calculating cumulative sum
window_spec = Window.partitionBy("<group_col>").orderBy("<order_col>")\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = df.withColumn("cumulative_col", sum("<value_col>").over(window_spec))
Let’s create a sample DataFrame that we’ll use to demonstrate the following examples.
# Imports
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum
spark = SparkSession.builder.appName("CumulativeSum").getOrCreate()
# Sample data
data = [
("Geetha", "2024-01-01", 100),
("Geetha", "2024-01-02", 150),
("Geetha", "2024-01-03", 50),
("Sai", "2024-01-01", 200),
("Sai", "2024-01-03", 300)
]
columns = ["customer", "date", "sales"]
df = spark.createDataFrame(data, columns)
df.show()
Yields below the output.
Calculate the cumulative sum in PySpark
To compute a cumulative sum of the entire DataFrame, you can specify a window that sorts the rows using orderBy() and specifies the range using rowsBetween(), without applying partitionBy(). This creates a running total from the first row to the current row.
# Calculate cumulative sum
# Define Window
windowSpec = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
# Calculate cumulative sum
df = df.withColumn("cumulative_sales", sum("sales").over(windowSpec))
df.show()
Yields below the output.

Calculate Cumulative Sum of the Grouped Column
Alternatively, you can calculate the cumulative sum of a grouped column, using a window specification with partitionBy() and orderBy() along with the rowsBetween() method. In this case, we are calculating a cumulative sum of the sales column for each customer ordered by the date.
# Calculate Cumulative Sum by Grouped Column
# Define Window
windowSpec = Window.partitionBy("customer").orderBy("date")\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
# Calculate cumulative sum
df = df.withColumn("cumulative_sales", sum("sales").over(windowSpec))
df.show()
Yields below the output.
# Output:
+--------+----------+-----+----------------+
|customer| date|sales|cumulative_sales|
+--------+----------+-----+----------------+
| Geetha|2024-01-01| 100| 100|
| Geetha|2024-01-02| 150| 250|
| Geetha|2024-01-03| 50| 300|
| Sai|2024-01-01| 200| 200|
| Sai|2024-01-03| 300| 500|
+--------+----------+-----+----------------+
Cumulative Sum for Multiple Columns in PySpark
So far, we’ve explored how to calculate the cumulative sum for an entire DataFrame and within groups using both partitionBy() and without it. Now, let’s look at how to compute cumulative sums for multiple columns in a DataFrame at once.
You can achieve this by defining a window that uses both partitionBy() and orderBy() to group and sort the data. Then, by looping through the target columns, you can efficiently apply the cumulative sum logic to each one.
# Cumulative Sum for Multiple Columns
spark = SparkSession.builder.appName("CumulativeSum").getOrCreate()
# Sample data
data = [
("Geetha", "2024-01-01", 100, 10),
("Geetha", "2024-01-02", 150, 20),
("Geetha", "2024-01-03", 50, 30),
("Sai", "2024-01-01", 200, 5),
("Sai", "2024-01-03", 300, 15)
]
columns = ["customer", "date", "sales", "transactions"]
df = spark.createDataFrame(data, columns)
df.show()
# Define Window
windowSpec = Window.partitionBy("customer").orderBy("date")\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
# Calculate cumulative sum for multiple columns
columns_to_sum = ["sales", "transactions"]
for col_name in columns_to_sum:
df = df.withColumn(f"cumulative_{col_name}", sum(col_name).over(windowSpec))
df.show()
Yields below the output.
# Output:
+--------+----------+-----+------------+
|customer| date|sales|transactions|
+--------+----------+-----+------------+
| Geetha|2024-01-01| 100| 10|
| Geetha|2024-01-02| 150| 20|
| Geetha|2024-01-03| 50| 30|
| Sai|2024-01-01| 200| 5|
| Sai|2024-01-03| 300| 15|
+--------+----------+-----+------------+
+--------+----------+-----+------------+----------------+-----------------------+
|customer| date|sales|transactions|cumulative_sales|cumulative_transactions|
+--------+----------+-----+------------+----------------+-----------------------+
| Geetha|2024-01-01| 100| 10| 100| 10|
| Geetha|2024-01-02| 150| 20| 250| 30|
| Geetha|2024-01-03| 50| 30| 300| 60|
| Sai|2024-01-01| 200| 5| 200| 5|
| Sai|2024-01-03| 300| 15| 500| 20|
+--------+----------+-----+------------+----------------+-----------------------+
Frequently Asked Questions of Cumulative Sum in Pyspark
If you don’t use partitionBy(), the cumulative sum will be calculated over the entire DataFrame, treating it as a single group.
<strong>orderBy()</strong>? orderBy() is mandatory for cumulative sums to maintain sequence. Without it, the order of calculation is undefined.
You can order by timestamp columns to calculate time-based running totals.
rowsBetween considers row positions.rangeBetween considers the value range of the ordering column (only works on numeric or timestamp).
Use a partitionBy() with an additional column to segment your data logically
You can apply .withColumn() multiple times or use a loop to compute cumulative sums for multiple columns.
Window functions can be expensive on large datasets. Always filter, repartition, and cache DataFrames wisely before applying them.
Window functions can be expensive on large datasets. Always filter, repartition, and cache DataFrames wisely before applying them.
Conclusion
In this article, you have learned how to calculate the cumulative sum in PySpark using window functions, both across the entire dataset and within specific groups. We also explored how to efficiently compute cumulative sums for multiple columns in a DataFrame with well-defined examples..
Happy Learning!!
Related Articles
- PySpark rank() vs dense_rank() vs row_number()
- Explain PySpark repartition() vs partitionBy() functions
- How to add row Number without partition in PySpark?
- How to add column with row number in the DataFrame by Partition?
- Explain the rank() function in PySpark
- Explain the dense_rank() function in PySpark
- How to Reorder Columns in PySpark DataFrame?
- How to Remove Specific Characters from Strings PySpark?