• Post author:
  • Post category:PySpark
  • Post last modified:July 3, 2025
  • Reading time:14 mins read
You are currently viewing How to Calculate Cumulative Sum in Pyspark?

How to calculate the cumulative sum in PySpatk? You can use the Window specification along with aggregate functions like sum() to calculate the cumulative sum in PySpark. Which is a common operation, especially when working with time-series or grouped data. PySpark offers powerful window functions that make it easy to calculate cumulative sums both globally and within groups. In this article, you’ll learn

Advertisements
  • What a cumulative sum is
  • How to implement it in PySpark
  • Real-world examples with and without groupings
  • How to apply cumulative logic across multiple columns
  • Best practices and performance tips

Key Points-

  • A cumulative sum is a running total that adds the current value to all previous values in a sequence typically ordered by date or time.
  • You need to define a Window using orderBy() to maintain the order of rows. Optionally, use partitionBy() to group data.
  • Use .rowsBetween(Window.unboundedPreceding, Window.currentRow) to compute the sum from the beginning to the current row.
  • You can calculate cumulative sums across the entire DataFrame or within specific groups.
  • Use a loop to apply the cumulative logic to multiple columns in one go.

What is a Cumulative Sum?

A cumulative sum refers to the running total obtained by summing the current value with all preceding values in a sequence, commonly ordered by time or date.

PySpark Cumulative Sum Syntax

The following is the syntax of cumulative syntax.


# Imports
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

# Define window for calculating cumulative sum
window_spec = Window.partitionBy("<group_col>").orderBy("<order_col>")\
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df = df.withColumn("cumulative_col", sum("<value_col>").over(window_spec))

Let’s create a sample DataFrame that we’ll use to demonstrate the following examples.


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

spark = SparkSession.builder.appName("CumulativeSum").getOrCreate()

# Sample data
data = [
    ("Geetha", "2024-01-01", 100),
    ("Geetha", "2024-01-02", 150),
    ("Geetha", "2024-01-03", 50),
    ("Sai", "2024-01-01", 200),
    ("Sai", "2024-01-03", 300)
]

columns = ["customer", "date", "sales"]
df = spark.createDataFrame(data, columns)
df.show()

Yields below the output.

Cumulative Sum in Pyspark

Calculate the cumulative sum in PySpark

To compute a cumulative sum of the entire DataFrame, you can specify a window that sorts the rows using orderBy() and specifies the range using rowsBetween(), without applying partitionBy(). This creates a running total from the first row to the current row.


# Calculate cumulative sum 
# Define Window
windowSpec = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Calculate cumulative sum
df = df.withColumn("cumulative_sales", sum("sales").over(windowSpec))
df.show()

Yields below the output.

Cumulative Sum in Pyspark

Calculate Cumulative Sum of the Grouped Column

Alternatively, you can calculate the cumulative sum of a grouped column, using a window specification with partitionBy() and orderBy() along with the rowsBetween() method. In this case, we are calculating a cumulative sum of the sales column for each customer ordered by the date.


# Calculate Cumulative Sum by Grouped Column
# Define Window
windowSpec = Window.partitionBy("customer").orderBy("date")\
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Calculate cumulative sum
df = df.withColumn("cumulative_sales", sum("sales").over(windowSpec))
df.show()

Yields below the output.


# Output:
+--------+----------+-----+----------------+
|customer|      date|sales|cumulative_sales|
+--------+----------+-----+----------------+
|  Geetha|2024-01-01|  100|             100|
|  Geetha|2024-01-02|  150|             250|
|  Geetha|2024-01-03|   50|             300|
|     Sai|2024-01-01|  200|             200|
|     Sai|2024-01-03|  300|             500|
+--------+----------+-----+----------------+

Cumulative Sum for Multiple Columns in PySpark

So far, we’ve explored how to calculate the cumulative sum for an entire DataFrame and within groups using both partitionBy() and without it. Now, let’s look at how to compute cumulative sums for multiple columns in a DataFrame at once.

You can achieve this by defining a window that uses both partitionBy() and orderBy() to group and sort the data. Then, by looping through the target columns, you can efficiently apply the cumulative sum logic to each one.


# Cumulative Sum for Multiple Columns 
spark = SparkSession.builder.appName("CumulativeSum").getOrCreate()

# Sample data
data = [
    ("Geetha", "2024-01-01", 100, 10),
    ("Geetha", "2024-01-02", 150, 20),
    ("Geetha", "2024-01-03", 50, 30),
    ("Sai", "2024-01-01", 200, 5),
    ("Sai", "2024-01-03", 300, 15)
]
columns = ["customer", "date", "sales", "transactions"]
df = spark.createDataFrame(data, columns)
df.show()

# Define Window
windowSpec = Window.partitionBy("customer").orderBy("date")\
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Calculate cumulative sum for multiple columns
columns_to_sum = ["sales", "transactions"]
for col_name in columns_to_sum:
    df = df.withColumn(f"cumulative_{col_name}", sum(col_name).over(windowSpec))

df.show()

Yields below the output.


# Output:
+--------+----------+-----+------------+
|customer|      date|sales|transactions|
+--------+----------+-----+------------+
|  Geetha|2024-01-01|  100|          10|
|  Geetha|2024-01-02|  150|          20|
|  Geetha|2024-01-03|   50|          30|
|     Sai|2024-01-01|  200|           5|
|     Sai|2024-01-03|  300|          15|
+--------+----------+-----+------------+

+--------+----------+-----+------------+----------------+-----------------------+
|customer|      date|sales|transactions|cumulative_sales|cumulative_transactions|
+--------+----------+-----+------------+----------------+-----------------------+
|  Geetha|2024-01-01|  100|          10|             100|                     10|
|  Geetha|2024-01-02|  150|          20|             250|                     30|
|  Geetha|2024-01-03|   50|          30|             300|                     60|
|     Sai|2024-01-01|  200|           5|             200|                      5|
|     Sai|2024-01-03|  300|          15|             500|                     20|
+--------+----------+-----+------------+----------------+-----------------------+

Frequently Asked Questions of Cumulative Sum in Pyspark

How can I calculate the cumulative sum without a partition?

If you don’t use partitionBy(), the cumulative sum will be calculated over the entire DataFrame, treating it as a single group.

What happens if I don’t specify <strong>orderBy()</strong>?

orderBy() is mandatory for cumulative sums to maintain sequence. Without it, the order of calculation is undefined.

How can I use cumulative sum on timestamp columns?

You can order by timestamp columns to calculate time-based running totals.

What’s the difference between rowsBetween and rangeBetween?

rowsBetween considers row positions.
rangeBetween considers the value range of the ordering column (only works on numeric or timestamp).

How can I reset the cumulative sum based on conditions?

Use a partitionBy() with an additional column to segment your data logically

How can I do cumulative sum for multiple columns at once?

You can apply .withColumn() multiple times or use a loop to compute cumulative sums for multiple columns.

What is the performance impact of using cumulative sum?

Window functions can be expensive on large datasets. Always filter, repartition, and cache DataFrames wisely before applying them.

What is the performance impact of using cumulative sum?

Window functions can be expensive on large datasets. Always filter, repartition, and cache DataFrames wisely before applying them.

Conclusion

In this article, you have learned how to calculate the cumulative sum in PySpark using window functions, both across the entire dataset and within specific groups. We also explored how to efficiently compute cumulative sums for multiple columns in a DataFrame with well-defined examples..

Happy Learning!!