• Post author:
  • Post category:PySpark
  • Post last modified:October 30, 2025
  • Reading time:15 mins read

In PySpark, when working with Parquet or Delta files, data often evolves, new columns may appear, data types may change, or files may have slightly different structures. Managing these evolving schemas efficiently is crucial to maintaining data consistency and preventing data loss.

Advertisements

The PySpark mergeSchema option plays a key role in handling schema evolution and schema drift in data lakes. It allows Spark to merge multiple file schemas into a single unified schema, ensuring that all columns are retained when reading or writing Parquet and Delta data.

In this article, you’ll learn everything about PySpark mergeSchema, including syntax, parameters, examples, how it differs from overwriteSchema, and when to use it for efficient schema managem

Key Points-

  • mergeSchema merges columns from multiple files into one unified schema.
  • Missing columns are automatically filled with null.
  • It applies only to Parquet and Delta file formats.
  • It helps handle schema drift in data lakes or evolving datasets.
  • Use it carefully in production, as it may impact performance.
  • Best combined with .option("mergeSchema", True) for schema evolution in Delta tables.

What is mergeSchema in PySpark?

The mergeSchema option in PySpark tells Spark to merge multiple schema definitions across files within a folder.

When you read Parquet or Delta data:

  • By default, Spark picks one schema (usually from the first file).
  • When you enable mergeSchema=True, Spark merges all unique columns into a unified schema and fills missing columns with null.

Syntax


# Syntax of mergeSchema
spark.read.option("mergeSchema", "true").parquet("path_to_directory")

Parameter:

  • Default: False
  • mergeSchema (bool): If set to True, Spark merges all unique schemas across files.

Why mergeSchema is Important

In real-world data engineering pipelines, new columns may appear in later loads or partitions. Without merging, Spark would ignore these columns, leading to data inconsistency or data loss.

Using PySpark mergeSchema ensures that all versions of your schema are unified and all columns are preserved during reading or writing.

PySpark mergeSchema Example with Parquet Files

Let’s see how mergeSchema in PySpark works with Parquet data.

In the first step, create two Parquet files with different schemas and see how mergeSchema behaves.


# Create two DataFrames
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Create SparkSession
spark = SparkSession.builder.appName("MergeSchemaExample").getOrCreate()

# Dataset 1
data1 = [(1, "John"), (2, "Sara")]
schema1 = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True)
])
df1 = spark.createDataFrame(data1, schema1)
df1.write.mode("overwrite").parquet("data/customers/part1")

# Dataset 2 with extra column
data2 = [(3, "Mike", "NY"), (4, "Emma", "LA")]
schema2 = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("city", StringType(), True)
])
df2 = spark.createDataFrame(data2, schema2)
df2.write.mode("overwrite").parquet("data/customers/part2")

Read Parquet Files Without mergeSchema

When you read Parquet files with different schemas in PySpark without using the mergeSchema option, Spark applies only the schema from the first file it encounters and ignores any additional columns present in other files.


# Read Parquet Files Without mergeSchema
df = spark.read.parquet("data/customers/")
df.printSchema()

Yields below the output.

MergeSchema in PySpark

Here, the city column from the second dataset is missing because mergeSchema=False.

Set PySpark mergeSchema as True

Alternatively, you can set the mergeSchema option to True in PySpark to efficiently read Parquet files with evolving schemas. Let’s read a directory that contains multiple Parquet files with slightly different structures, and mergeSchema=True instructs Spark to automatically infer a unified schema that merges all columns present across those files.


# Read Parquet Files With mergeSchema
df_merged = spark.read.option("mergeSchema", True).parquet("data/customers/")
df_merged.printSchema()

Yields below the output.

MergeSchema in PySpark

With PySpark mergeSchema=True, all columns are included, and missing values are filled with null.

PySpark mergeSchema with Delta Tables

You can also use mergeSchema during the Delta table writes to automatically merge new columns with existing schema definitions.


# Writing to Delta with schema merge
df2.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", True) \
    .save("data/delta/customers")

Here, Spark merges the new columns in df2 with the existing schema stored in the Delta table, ensuring smooth schema evolution.

PySpark saveAsTable mergeSchema Example

You can also apply mergeSchema while using saveAsTable for Delta tables.


# Writing to Delta with schema merge
df2.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", True) \
    .saveAsTable("default.customers")

This automatically merges the schema changes when appending new data to an existing table.

When to Use mergeSchema

Use mergeSchema=True when:

  • You’re reading Parquet or Delta files that have different or evolving schemas.
  • You want to unify all columns across multiple partitions or files.
  • You’re working in data lake environments where schema drift is common.

Avoid mergeSchema when:

  • You need high performance reads (schema merging requires scanning all file schemas).
  • You want strict schema enforcement in production systems.

Best Practices

  • For Delta tables, combine with .option("mergeSchema", True) during write operations for smooth schema evolution.
  • Use mergeSchema sparingly in production, as it adds overhead.
  • For static or stable schemas, define schemas manually for faster reads.
  • Use it primarily for data lake ingestion or historical data merging.

PySpark mergeSchema vs overwriteSchema

Both mergeSchema and overwriteSchema handle schema evolution differently.

PropertymergeSchemaoverwriteSchema
OperationReading/WritingWriting only
BehaviorMerges new columnsReplaces existing schema
Common UsageHandle evolving dataRedefine schema completely
Example.option("mergeSchema", True).option("overwriteSchema", True)

Frequently Asked Questions of PySpark mergeSchema

What is mergeSchema in PySpark?

The mergeSchema option allows Spark to combine multiple file schemas when reading Parquet or Delta data. It ensures that all columns across files are included in the final DataFrame.

How does mergeSchema work internally?

Spark scans metadata (footers) of each Parquet or Delta file, collects all field definitions, and creates a unified schema by combining all unique columns. Missing columns are filled with null.

Why do we need mergeSchema in PySpark?

In real-world data lakes, schemas often evolve as new fields are added. Without mergeSchema, Spark would ignore new columns, leading to data loss or incomplete reads.

What happens if we don’t use mergeSchema?

Spark picks one file’s schema (usually the first) and ignores any extra columns in other files, potentially losing data.

How is mergeSchema different from overwriteSchema?

mergeSchema: Used while reading files (combines schemas).
overwriteSchema: Used while writing to tables (replaces schema).

Why should mergeSchema be used selectively?

Spark needs to scan all file schemas, which adds overhead and increases read time, especially for large datasets.

Conclusion

The mergeSchema option in PySpark is essential for handling evolving datasets where schemas change over time. It ensures that all columns across multiple files are included, making your data processing more resilient and complete.

However, because schema merging can impact performance, it’s best used selectively — especially during data ingestion or exploration phases. For production pipelines, defining schemas manually remains the most efficient approach.

Happy Learning!!

Related Articles