You are currently viewing Spark – Overwrite the output directory

Spark/PySpark by default doesn’t overwrite the output directory on S3, HDFS, or any other file systems, when you try to write the DataFrame contents (JSON, CSV, Avro, Parquet, ORC) to an existing directory, Spark returns runtime error hence, to overcome this you should use mode("overwrite").

Advertisements

If you are using Spark with Scala you can use an enumeration org.apache.spark.sql.SaveMode, this contains a field SaveMode.Overwrite to replace the contents on an existing folder.

You should be very sure when using overwrite mode, unknowingly using this mode will result in loss of data.

You need to use this Overwrite as an argument to mode() function of the DataFrameWrite class, for example. Note that this is not supported in PySpark.


df.write.mode(SaveMode.Overwrite).csv("/tmp/out/foldername")

For PySpark use overwrite string. This option can also be used with Scala.


df.write.mode("overwrite").csv("/tmp/out/foldername")

Besides Overwrite, SaveMode also offers other modes like SaveMode.Append, SaveMode.ErrorIfExists and SaveMode.Ignore.

For older versions of Spark/PySpark, you can use the following to overwrite the output directory with the RDD contents.


sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium