• Post author:
  • Post category:PySpark
  • Post last modified:July 14, 2025
  • Reading time:14 mins read

In PySpark, the coalesce() function serves two primary purposes. First, it is commonly used as a transformation to reduce the number of partitions in a DataFrame to a specified number. It is particularly efficient when reducing to a single partition (e.g., coalesce(1)), as it avoids a full data shuffle, making it more performant than repartition(). Second, coalesce() is a powerful column-level function used to handle missing data by returning the first non-null value from a list of specified columns for each row. This is especially useful when working with datasets that have multiple columns containing potential null values, allowing you to consolidate them into a single meaningful column.

Advertisements

In this article, I will explain how to use pyspark.sql.functions.coalesce() to combine multiple columns into one, and how to handle null values in the new column by assigning a default value using the lit() function.

PySpark coalesce() Function

In PySpark, the coalesce() function is used to reduce the number of partitions in a DataFrame to a specified number. It is the preferred option when decreasing partitions, especially when you want to reduce them to a single partition (numPartitions = 1), as it avoids a full shuffle and is more efficient than repartition().

Syntax of coalesce() Function

The following is the syntax of the coalesce() function.


# Syntax of the coalesce() function
pyspark.sql.functions.coalesce(*cols)

Parameters

<strong>cols</strong>: A column or column name, a list of columns to work on.

Return Value

It returns the first non-NULL value. If all expressions are NULL, it returns NULL.

Sample DataFrame

First, let’s create a sample DataFrame


# Create DataFrame
# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import coalesce

# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Create data
data = [[None, "blue", None], 
        ["red", None, None], 
        [None, None, "green"], 
        [None, "yellow", "black"], 
        [None, None, None]]

# Create column names
columns = ['primary_color', 'secondary_color', 'backup_color']

# Create DataFrame using data and column names
df = spark.createDataFrame(data, columns)

# View original DataFrame
df.show()

Yields below the output.

PySpark coalesce()

Pyspark Coalesce Multiple Columns

You can use PySpark coalesce() function to add a new column to a DataFrame having the first non-null value from a specified list of columns. Let’s pass specified columns of a DataFrame into this function to get a new column that will contain the first available non-null values from specified columns of each row.


# Add a new column using coalesce from multiple columns
df = df.withColumn("final_color", coalesce("primary_color", "secondary_color", "backup_color"))

# View updated DataFrame
df.show()

Yields below the output.

PySpark coalesce()

Coalesce two Columns in a PySpark DataFrame

You can also use coalesce() to combine two columns into one by returning the first non-null value from the specified columns for each row.


# Add a new column using coalesce from two columns
df = df.withColumn("final_color", coalesce("primary_color", "secondary_color"))

# View updated DataFrame
df.show()

Yields below the output.


# Output:
+-------------+---------------+------------+-----------+
|primary_color|secondary_color|backup_color|final_color|
+-------------+---------------+------------+-----------+
|         NULL|           blue|        NULL|       blue|
|          red|           NULL|        NULL|        red|
|         NULL|           NULL|       green|       NULL|
|         NULL|         yellow|       black|     yellow|
|         NULL|           NULL|        NULL|       NULL|
+-------------+---------------+------------+-----------+

Coalesce with Literal Values

If you want to avoid null values in the new column, you can use coalesce() along with a default fallback value using lit(). Simply pass the columns into this function along with lit("default"); if all column values are null, it returns the specified default value, otherwise, it returns the first non-null value from the given columns.


# Add a new column using coalesce with a default fallback value "Missing"
df = df.withColumn(
    "final_value_with_default",
    coalesce("primary_color", "secondary_color", "backup_color", lit("Missing"))
)

# View updated DataFrame
df.show()

Yields below the output.


# Output:
+-------------+---------------+------------+-----------+------------------------+
|primary_color|secondary_color|backup_color|final_color|final_value_with_default|
+-------------+---------------+------------+-----------+------------------------+
|         NULL|           blue|        NULL|       blue|                    blue|
|          red|           NULL|        NULL|        red|                     red|
|         NULL|           NULL|       green|      green|                   green|
|         NULL|         yellow|       black|     yellow|                  yellow|
|         NULL|           NULL|        NULL|       NULL|                 Missing|
+-------------+---------------+------------+-----------+------------------------+

PySpark Coalesce Empty String

If you want to deal with empty strings instead of null, coalesce() will not treat them as nulls. You can explicitly convert empty strings to null using when() and otherwise().


# Coalesce empty string
from pyspark.sql.functions import when, col
from pyspark.sql import SparkSession


# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Define data
data = [["", "blue", None], 
        ["red", "", None], 
        [None, None, "green"], 
        [None, "yellow", "black"], 
        [None, None, None]]

# Define column names
columns = ['primary_color', 'secondary_color', 'backup_color']

# Create DataFrame using data and column names
df = spark.createDataFrame(data, columns)

# Handle empty string with coalesce
df = df.withColumn("primary_color", when(col("primary_color") == "", None).otherwise(col("primary_color")))
df.show()

Yields below the output.


+-------------+---------------+------------+
|primary_color|secondary_color|backup_color|
+-------------+---------------+------------+
|         NULL|           blue|        NULL|
|          red|               |        NULL|
|         NULL|           NULL|       green|
|         NULL|         yellow|       black|
|         NULL|           NULL|        NULL|
+-------------+---------------+------------+

PySpark coalesce Partition

Alternatively, you can use the coalesce() as a DataFrame method, which helps reduce the number of partitions in a DataFrame.


# Use PySpark to reduce the Partition of the DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import coalesce

spark = SparkSession.builder \
    .appName("CoalesceExample") \
    .getOrCreate()

# Create a DataFrame with 8 partitions
df = spark.range(0, 100).repartition(8)
print("Original number of partitions:", df.rdd.getNumPartitions())

# Reduce to 2 partitions
df_reduced = df.coalesce(2)
print("Reduced number of partitions:", df_reduced.rdd.getNumPartitions())

# Output:
# Original number of partitions: 8
# Reduced number of partitions: 2

PySpark coalesce Null to 0

To replace null values in the new column of the resulting DataFrame with 0, use the lit() function along with PySpark’s coalesce() function to provide a default value.


# Replace Null values with 0
from pyspark.sql.functions import coalesce, lit

# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Sample data with nulls
data = [(1, None),
        (2, 50),
        (3, None),
        (4, 100)]

# Define columns
columns = ["id", "numeric_column"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Replace nulls with 0 using coalesce
df = df.withColumn(
    "numeric_column_filled",
    coalesce("numeric_column", lit(0))
)

# Show result
df.show()

Yields below the output.


# Output:
+---+--------------+---------------------+
| id|numeric_column|numeric_column_filled|
+---+--------------+---------------------+
|  1|          NULL|                    0|
|  2|            50|                   50|
|  3|          NULL|                    0|
|  4|           100|                  100|
+---+--------------+---------------------+

PySpark coalesce(1)

You can apply coalesce(1) on a DataFrame (not the function) to reduce all partitions to one.


# Coalesce to 1 partition 
df_single = df_reduced.coalesce(1)
print("Reduced number of partitions:", df_single.rdd.getNumPartitions())

# Output:
# Reduced number of partitions: 1

Frequently Asked Questions of Pyspark Coalesce()

What is the purpose of coalesce() in PySpark?

It returns the first non-null value from a list of columns for each row, commonly used to handle missing data.

How is coalesce() used to merge multiple columns into one?

By passing multiple columns to coalesce(), it picks the first non-null value row-wise and creates a consolidated column.

What’s the difference between coalesce() and fillna()?

fillna() replaces all null values with a specified constant, while coalesce() dynamically selects the first non-null value across multiple columns.

How to provide a default value when all columns are null?

Include a literal value using F.lit() at the end of the argument list:
For example: F.coalesce("col1", "col2", F.lit("Missing"))

How to use coalesce() in PySpark SQL queries?

Use the COALESCE(col1, col2, ...) syntax within SQL strings passed to spark.sql().

What output is returned if all columns in coalesce() are null?

The result will be null for that row unless a fallback value (literal) is provided.

Conclusion

In this article, I have explained how to use the PySpark coalesce() function to combine two or more columns into a single column by returning the first non-null value from the specified columns for each row. I also demonstrated how to handle null values by setting a default value using the lit() function.

We explored the use of coalesce() as:

  • A column-level function, which is particularly useful for handling missing data and consolidating values across multiple columns.
  • A DataFrame method to reduce the number of partitions efficiently without triggering a full shuffle — an important optimization technique for performance tuning in distributed data processing.

Understanding both usages of coalesce() helps in writing cleaner, more efficient PySpark code when working with incomplete data and optimizing resource usage.

Happy Learning!!