Spark – Overwrite the output directory

Spark/PySpark by default doesn’t overwrite the output directory on S3, HDFS, or any other file systems, when you try to write the DataFrame contents (JSON, CSV, Avro, Parquet, ORC) to an existing directory, Spark returns runtime error hence, to overcome this you should use mode("overwrite").

If you are using Spark with Scala you can use an enumeration org.apache.spark.sql.SaveMode, this contains a field SaveMode.Overwrite to replace the contents on an existing folder.

You should be very sure when using overwrite mode, unknowingly using this mode will result in loss of data.

You need to use this Overwrite as an argument to mode() function of the DataFrameWrite class, for example. Note that this is not supported in PySpark.


df.write.mode(SaveMode.Overwrite).csv("/tmp/out/foldername")

For PySpark use overwrite string. This option can also be used with Scala.


df.write.mode("overwrite").csv("/tmp/out/foldername")

Besides Overwrite, SaveMode also offers other modes like SaveMode.Append, SaveMode.ErrorIfExists and SaveMode.Ignore.

For older versions of Spark/PySpark, you can use the following to overwrite the output directory with the RDD contents.


sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply