Spark/PySpark by default doesn’t overwrite the output directory on S3, HDFS, or any other file systems, when you try to write the DataFrame contents (JSON, CSV, Avro, Parquet, ORC) to an existing directory, Spark returns runtime error hence, to overcome this you should use
If you are using Spark with Scala you can use an enumeration
org.apache.spark.sql.SaveMode, this contains a field
SaveMode.Overwrite to replace the contents on an existing folder.
You should be very sure when using overwrite mode, unknowingly using this mode will result in loss of data.
You need to use this Overwrite as an argument to
mode() function of the DataFrameWrite class, for example. Note that this is not supported in PySpark.
For PySpark use
overwrite string. This option can also be used with Scala.
Besides Overwrite, SaveMode also offers other modes like
For older versions of Spark/PySpark, you can use the following to overwrite the output directory with the RDD contents.
sparkConf.set("spark.hadoop.validateOutputSpecs", "false") val sparkContext = SparkContext(sparkConf)
Happy Learning !!