In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj.write.csv("path")
, using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.
In this article I will explain how to write a Spark DataFrame as a CSV file to disk, S3, HDFS with or without header, I will also cover several options like compressed, delimiter, quote, escape e.t.c and finally using different save mode options. Since Spark 2.0.0 version CSV is natively supported without any external dependencies, if you are using an older version you would need to use databricks spark-csv library. Most of the examples and concepts explained here can also be used to write Parquet, Avro, JSON, text, ORC, and any Spark supported file formats, all you need is just replace csv()
with parquet()
, avro()
, json()
, text()
, orc()
respectively.
In order to explain first let’s create a DataFrame either from data or by reading a CSV file into Spark DataFrame.
val data=Seq(("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")
)
val columns = Seq("firstname","lastname","country","state")
import spark.implicits._
val df = data.toDF(columns:_*)
1. Spark Write DataFrame as CSV with Header
Spark DataFrameWriter
class provides a method csv()
to save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn’t write a header or column names.
//Write DataFrame data to CSV file
df.write.csv("/tmp/spark_output/datacsv")
// You can also use below
df.write.format("csv").save("/tmp/spark_output/datacsv")
In order to write DataFrame to CSV with a header, you should use option(), Spark CSV data-source provides several options which we will see in the next section.
df.write.option("header",true)
.csv("/tmp/spark_output/datacsv")
I have 3 partitions on DataFrame hence it created 3 part files when you save it to the file system.

Note: Depending on the number of partitions you have for DataFrame, it writes the same number of part files in a directory specified as a path. You can get the partition size by using the below snippet. For more details on partitions refer to Spark Partitioning. If you wanted to write as a single CSV file, refer to Spark Write Single CSV File.
df.rdd.getNumPartitions
Use partitionBy()
If you want to save a file partition by sub-directories meaning each sub-directory contains records about a single partition. This speeds up further reads if you query based on partition. The below example creates three sub-directories (state=CA
, state=NY
, state=FL
)
2. Save CSV File Using Options
Spark DataFrameWriter
provides option(key,value)
to set a single option, to set multiple options either you can chain option() method or use options(options: Map[String, String])
. Note that both these option methods return DataFrameWriter
.
//Chain option to set multiple
df.write.option("header",True)
.option("delimiter","|")
.csv("file-path")
//Example to set multiple options
df.write.options(Map("header"->"true", "delimiter"->","))
.csv("file-path")
header
As explained above, use header
option to save a Spark DataFrame to CSV along with column names as a header on the first line. By default, this option is set to false meaning does not write the header.
delimiter
Use delimiter
option to specify the delimiter on the CSV output file (delimiter is a single character as a separator for each field and value). Most used delimiters are comma (default), pipe, tab e.t.c.
Refer above for a sample example.
quote
If you have a separator or delimiter part of your value, use the quote
to set a single character used for escaping quoted values. defaults to \.
compression
Use the compression codec option when you want to compress a CSV file while writing to disk to reduce disk space. It supports the following values none
, bzip2
, gzip
, lz4
, snappy
and deflate
. Default’s to null.
df2.write.option("header",true)
.option("compression","gzip")
.csv("file-path")
escape
Use escape
to sets a single character used for escaping quotes inside an already quoted value.
nullValue
When you have an empty string/value on DataFrame while writing to DataFrame it writes it as NULL as the nullValue
option set to empty by default. Change this if you wanted to set any value as NULL.
dateFormat
Spark CSV default writes the date (columns with Spark DateType) in yyyy-MM-dd
format, if you want to change it to custom format use dateFormat
(default yyyy-MM-dd
). Custom date formats follow the formats at Datetime Patterns.
timestampFormat
Spark CSV default writes the timestamp (columns with Spark TimestampType) in yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
format, if you want to change it to custom format use <code>timestampFormat
(default <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
). Custom timestamp formats also follow the formats at Datetime Patterns.
encoding
encoding
(by default it is not set): specifies encoding (charset) of saved CSV files. If it is not set, the UTF-8 charset will be used.
lineSep
By default CSV file written to disk is separated with \n for each line. If you wanted to change this and use another character use lineSep
option (line separator). The maximum length is 1 character.
Note: Besides these, Spark CSV data-source also supports several other options, please refer to complete list.
3. Save DataFrame as CSV to S3
In order to save DataFrame to Amazon S3 bucket, first, you need to have an S3 bucket created and you need to collect all AWS access and secret keys from your account and set it to Spark configurations. For more details refer to How to Read and Write from S3.
// Replace Key with your AWS account key (You can find this on IAM
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.access.key", "awsaccesskey value")
service)
// Replace Key with your AWS secret key (You can find this on IAM
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.secret.key", "aws secretkey value")
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com")
//Write to CSV file
df.write.parquet("s3a://sparkbyexamples/csv/datacsv")
4. Save DataFrame as CSV to HDFS
If you have Spark running on YARN on Hadoop, you can write DataFrame as CSV file to HDFS similar to writing to a local disk. All you need is to specify the Hadoop name node path. Hadoop name node path, you can find this on fs.defaultFS
of Hadoop core-site.xml
file under the Hadoop configuration folder.
df.write.option("header","true")
.csv("hdfs://nn1home:8020/csvfile")
The above example writes data from DataFrame to CSV file with a header on HDFS location.
5. Save Modes
Spark DataFrameWriter also has a method mode()
to specify SaveMode; the argument to this method either takes below string or a constant from SaveMode
class.
overwrite
– mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite
.
append
– To add the data to the existing file, alternatively, you can use SaveMode.Append
.
ignore
– Ignores write operation when the file already exists, alternatively you can use SaveMode.Ignore
.
errorifexists
or error
– This is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists
.
df.write.mode(SaveMode.Overwrite)
.csv("/tmp/spark_output/datacsv")
6. Conclusion
I hope you have learned some basic points about how to save a Spark DataFrame to CSV file with header, save to S3, HDFS and use multiple options and save modes.
Happy Learning !!
Hi – the write CSV example with option(“header”, true) throws an error as it is supposed to be True in python.
Thanks, Victor. I have corrected the typo.
Hello, I have problem. When I want to save sparke dataframe to csv format. All columns with double format convert to string and all values in the columns change. Can anybody help how I can fix this problem?
After writing, do you see all values in double-quoted for double fields? Also, what do you mean values in the column change?