In PySpark, the coalesce() function serves two primary purposes. First, it is commonly used as a transformation to reduce the number of partitions in a DataFrame to a specified number. It is particularly efficient when reducing to a single partition (e.g., coalesce(1)), as it avoids a full data shuffle, making it more performant than repartition(). Second, coalesce() is a powerful column-level function used to handle missing data by returning the first non-null value from a list of specified columns for each row. This is especially useful when working with datasets that have multiple columns containing potential null values, allowing you to consolidate them into a single meaningful column.
In this article, I will explain how to use pyspark.sql.functions.coalesce() to combine multiple columns into one, and how to handle null values in the new column by assigning a default value using the lit() function.
PySpark coalesce() Function
In PySpark, the coalesce() function is used to reduce the number of partitions in a DataFrame to a specified number. It is the preferred option when decreasing partitions, especially when you want to reduce them to a single partition (numPartitions = 1), as it avoids a full shuffle and is more efficient than repartition().
Syntax of coalesce() Function
The following is the syntax of the coalesce() function.
# Syntax of the coalesce() function
pyspark.sql.functions.coalesce(*cols)
Parameters
<strong>cols</strong>: A column or column name, a list of columns to work on.
Return Value
It returns the first non-NULL value. If all expressions are NULL, it returns NULL.
Sample DataFrame
First, let’s create a sample DataFrame
# Create DataFrame
# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import coalesce
# Create Spark session
spark = SparkSession.builder.getOrCreate()
# Create data
data = [[None, "blue", None],
["red", None, None],
[None, None, "green"],
[None, "yellow", "black"],
[None, None, None]]
# Create column names
columns = ['primary_color', 'secondary_color', 'backup_color']
# Create DataFrame using data and column names
df = spark.createDataFrame(data, columns)
# View original DataFrame
df.show()
Yields below the output.

Pyspark Coalesce Multiple Columns
You can use PySpark coalesce() function to add a new column to a DataFrame having the first non-null value from a specified list of columns. Let’s pass specified columns of a DataFrame into this function to get a new column that will contain the first available non-null values from specified columns of each row.
# Add a new column using coalesce from multiple columns
df = df.withColumn("final_color", coalesce("primary_color", "secondary_color", "backup_color"))
# View updated DataFrame
df.show()
Yields below the output.

Coalesce two Columns in a PySpark DataFrame
You can also use coalesce() to combine two columns into one by returning the first non-null value from the specified columns for each row.
# Add a new column using coalesce from two columns
df = df.withColumn("final_color", coalesce("primary_color", "secondary_color"))
# View updated DataFrame
df.show()
Yields below the output.
# Output:
+-------------+---------------+------------+-----------+
|primary_color|secondary_color|backup_color|final_color|
+-------------+---------------+------------+-----------+
| NULL| blue| NULL| blue|
| red| NULL| NULL| red|
| NULL| NULL| green| NULL|
| NULL| yellow| black| yellow|
| NULL| NULL| NULL| NULL|
+-------------+---------------+------------+-----------+
Coalesce with Literal Values
If you want to avoid null values in the new column, you can use coalesce() along with a default fallback value using lit(). Simply pass the columns into this function along with lit("default"); if all column values are null, it returns the specified default value, otherwise, it returns the first non-null value from the given columns.
# Add a new column using coalesce with a default fallback value "Missing"
df = df.withColumn(
"final_value_with_default",
coalesce("primary_color", "secondary_color", "backup_color", lit("Missing"))
)
# View updated DataFrame
df.show()
Yields below the output.
# Output:
+-------------+---------------+------------+-----------+------------------------+
|primary_color|secondary_color|backup_color|final_color|final_value_with_default|
+-------------+---------------+------------+-----------+------------------------+
| NULL| blue| NULL| blue| blue|
| red| NULL| NULL| red| red|
| NULL| NULL| green| green| green|
| NULL| yellow| black| yellow| yellow|
| NULL| NULL| NULL| NULL| Missing|
+-------------+---------------+------------+-----------+------------------------+
PySpark Coalesce Empty String
If you want to deal with empty strings instead of null, coalesce() will not treat them as nulls. You can explicitly convert empty strings to null using when() and otherwise().
# Coalesce empty string
from pyspark.sql.functions import when, col
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.getOrCreate()
# Define data
data = [["", "blue", None],
["red", "", None],
[None, None, "green"],
[None, "yellow", "black"],
[None, None, None]]
# Define column names
columns = ['primary_color', 'secondary_color', 'backup_color']
# Create DataFrame using data and column names
df = spark.createDataFrame(data, columns)
# Handle empty string with coalesce
df = df.withColumn("primary_color", when(col("primary_color") == "", None).otherwise(col("primary_color")))
df.show()
Yields below the output.
+-------------+---------------+------------+
|primary_color|secondary_color|backup_color|
+-------------+---------------+------------+
| NULL| blue| NULL|
| red| | NULL|
| NULL| NULL| green|
| NULL| yellow| black|
| NULL| NULL| NULL|
+-------------+---------------+------------+
PySpark coalesce Partition
Alternatively, you can use the coalesce() as a DataFrame method, which helps reduce the number of partitions in a DataFrame.
# Use PySpark to reduce the Partition of the DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import coalesce
spark = SparkSession.builder \
.appName("CoalesceExample") \
.getOrCreate()
# Create a DataFrame with 8 partitions
df = spark.range(0, 100).repartition(8)
print("Original number of partitions:", df.rdd.getNumPartitions())
# Reduce to 2 partitions
df_reduced = df.coalesce(2)
print("Reduced number of partitions:", df_reduced.rdd.getNumPartitions())
# Output:
# Original number of partitions: 8
# Reduced number of partitions: 2
PySpark coalesce Null to 0
To replace null values in the new column of the resulting DataFrame with 0, use the lit() function along with PySpark’s coalesce() function to provide a default value.
# Replace Null values with 0
from pyspark.sql.functions import coalesce, lit
# Create Spark session
spark = SparkSession.builder.getOrCreate()
# Sample data with nulls
data = [(1, None),
(2, 50),
(3, None),
(4, 100)]
# Define columns
columns = ["id", "numeric_column"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Replace nulls with 0 using coalesce
df = df.withColumn(
"numeric_column_filled",
coalesce("numeric_column", lit(0))
)
# Show result
df.show()
Yields below the output.
# Output:
+---+--------------+---------------------+
| id|numeric_column|numeric_column_filled|
+---+--------------+---------------------+
| 1| NULL| 0|
| 2| 50| 50|
| 3| NULL| 0|
| 4| 100| 100|
+---+--------------+---------------------+
PySpark coalesce(1)
You can apply coalesce(1) on a DataFrame (not the function) to reduce all partitions to one.
# Coalesce to 1 partition
df_single = df_reduced.coalesce(1)
print("Reduced number of partitions:", df_single.rdd.getNumPartitions())
# Output:
# Reduced number of partitions: 1
Frequently Asked Questions of Pyspark Coalesce()
It returns the first non-null value from a list of columns for each row, commonly used to handle missing data.
By passing multiple columns to coalesce(), it picks the first non-null value row-wise and creates a consolidated column.
fillna() replaces all null values with a specified constant, while coalesce() dynamically selects the first non-null value across multiple columns.
Include a literal value using F.lit() at the end of the argument list:
For example: F.coalesce("col1", "col2", F.lit("Missing"))
Use the COALESCE(col1, col2, ...) syntax within SQL strings passed to spark.sql().
The result will be null for that row unless a fallback value (literal) is provided.
Conclusion
In this article, I have explained how to use the PySpark coalesce() function to combine two or more columns into a single column by returning the first non-null value from the specified columns for each row. I also demonstrated how to handle null values by setting a default value using the lit() function.
We explored the use of coalesce() as:
- A column-level function, which is particularly useful for handling missing data and consolidating values across multiple columns.
- A DataFrame method to reduce the number of partitions efficiently without triggering a full shuffle — an important optimization technique for performance tuning in distributed data processing.
Understanding both usages of coalesce() helps in writing cleaner, more efficient PySpark code when working with incomplete data and optimizing resource usage.
Happy Learning!!