You are currently viewing Spark or PySpark Write Modes Explained

In this article, I will explain different save or write modes in Spark or PySpark with examples. These write modes would be used to write Spark DataFrame as JSON, CSV, Parquet, Avro, ORC, Text files and also used to write to Hive table, JDBC tables like MySQL, SQL server, e.t.c

Related Articles –

Key Points of Spark Write Modes

  • Save or Write modes are optional
  • These are used to specify how to handle existing data if present.
  • Both option() and mode() functions can be used to specify the save or write mode.
  • With Overwrite write mode, spark drops the existing table before saving.
  • If you have indexes on an existing table, after using overwriting, you need to re-create the indexes.
  • The truncate DataFrame option can be used not to drop the table but instead just truncates the table. When using this, no need to recreate the indexes.

1. Write Modes in Spark or PySpark

Use Spark/PySpark DataFrameWriter.mode() or option() with mode to specify save mode; the argument to this method either takes the below string or a constant from SaveMode class.

Spark Write ModesDescription
overwriteThe overwrite mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite
appendTo add the data to the existing file, alternatively, you can use SaveMode.Append.
ignoreIgnores write operation when the file already exists, alternatively, you can use SaveMode.Ignore.
errorifexists or errorThis is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists.

2. Errorifexists or error Write Mode

This errorifexists or error is a default write option in Spark. The below example writes the personDF as a JSON file into a specified directory. If a person directory already exists in the path, it will throw an error message Error: pyspark.sql.utils.AnalysisException: path /path/to/write/person already exists.;


//Using string
personDF.write.mode("error").json("/path/to/write/person")

//Using SaveMode class (works only with Scala)
personDF.write.mode(SaveMode.ErrorIfExists).json("/path/to/write/person")

//Using option()
personDF.write.option("mode","error").json("/path/to/write/person")

All the above examples have the same behavior.

3. Spark Write in Overwrite Mode

The overwrite mode is used to overwrite the existing file, Alternatively, you can use SaveMode.Overwrite. Using this write mode Spark deletes the existing file or drops the existing table before writing.

When you are working with JDBC, you have to be careful using this option as you would lose indexes if exists on the table. To overcome this you can use truncate write option; this just truncates the table by keeping the indexes.


//Using overwrite
personDF.write.mode("overwrite").json("/path/to/write/person")

//Works only with Scala
personDF.write.mode(SaveMode.Overwrite).json("/path/to/write/person")

Using with truncate option with overwrite mode.


//Using overwrite with truncate
personDF.write.mode("overwrite")
    .format("jdbc")
    .option("driver","com.mysql.cj.jdbc.Driver")
    .option("url", "jdbc:mysql://localhost:3306/emp")
    .option("dbtable","employee")
    .option("truncate","true")
    .option("user", "root")
    .option("password", "root")
    .load()

4. Append Write Mode

Use append string or SaveMode.Append to add the data to the existing file or add the data as rows to the existing table.


//Using append
personDF.write.mode("append").json("/path/to/write/person")

//Works only with Scala
personDF.write.mode(SaveMode.Append).json("/path/to/write/person")

5. Ignore Write Mode

The ignore mode or SaveMode.Ignore is used to ignore the operation when the data/table already exists. It writes the data if data/table not exists. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.


//Using ignore
personDF.write.mode("overwrite").json("/path/to/write/person")

//Works only with Scala
personDF.write.mode(SaveMode.Overwrite).json("/path/to/write/person")

Conclusion

In this article, you have learned Spark or PySpark save or write modes with examples. Use Spark DataFrameWriter.mode() or option() with mode to specify save mode; the argument to this method either takes the below string or a constant from SaveMode class.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium