In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using
dataframeObj.write.csv("path"), using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.
In this article I will explain how to write a Spark DataFrame as a CSV file to disk, S3, HDFS with or without header, I will also cover several options like compressed, delimiter, quote, escape e.t.c and finally using different save mode options. Since Spark 2.0.0 version CSV is natively supported without any external dependencies, if you are using an older version you would need to use databricks spark-csv library. Most of the examples and concepts explained here can also be used to write Parquet, Avro, JSON, text, ORC, and any Spark supported file formats, all you need is just replace
In order to explain first let’s create a DataFrame either from data or by reading a CSV file into Spark DataFrame.
val data=Seq(("James","Smith","USA","CA"),("Michael","Rose","USA","NY"), ("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL") ) val columns = Seq("firstname","lastname","country","state") import spark.implicits._ val df = data.toDF(columns:_*)
1. Spark Write DataFrame as CSV with Header
DataFrameWriter class provides a method
csv() to save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn’t write a header or column names.
//Write DataFrame data to CSV file df.write.csv("/tmp/spark_output/datacsv") // You can also use below df.write.format("csv").save("/tmp/spark_output/datacsv")
In order to write DataFrame to CSV with a header, you should use option(), Spark CSV data-source provides several options which we will see in the next section.
I have 3 partitions on DataFrame hence it created 3 part files when you save it to the file system.
Note: Depending on the number of partitions you have for DataFrame, it writes the same number of part files in a directory specified as a path. You can get the partition size by using the below snippet. For more details on partitions refer to Spark Partitioning. If you wanted to write as a single CSV file, refer to Spark Write Single CSV File.
partitionBy() If you want to save a file partition by sub-directories meaning each sub-directory contains records about a single partition. This speeds up further reads if you query based on partition. The below example creates three sub-directories (
2. Save CSV File Using Options
option(key,value) to set a single option, to set multiple options either you can chain option() method or use
options(options: Map[String, String]). Note that both these option methods return
//Chain option to set multiple df.write.option("header",True) .option("delimiter","|") .csv("file-path") //Example to set multiple options df.write.options(Map("header"->"true", "delimiter"->",")) .csv("file-path")
As explained above, use
header option to save a Spark DataFrame to CSV along with column names as a header on the first line. By default, this option is set to false meaning does not write the header.
delimiter option to specify the delimiter on the CSV output file (delimiter is a single character as a separator for each field and value). Most used delimiters are comma (default), pipe, tab e.t.c.
Refer above for a sample example.
If you have a separator or delimiter part of your value, use the
quote to set a single character used for escaping quoted values. defaults to \.
Use the compression codec option when you want to compress a CSV file while writing to disk to reduce disk space. It supports the following values
deflate. Default’s to null.
df2.write.option("header",true) .option("compression","gzip") .csv("file-path")
escape to sets a single character used for escaping quotes inside an already quoted value.
When you have an empty string/value on DataFrame while writing to DataFrame it writes it as NULL as the
nullValue option set to empty by default. Change this if you wanted to set any value as NULL.
Spark CSV default writes the date (columns with Spark DateType) in
yyyy-MM-dd format, if you want to change it to custom format use
yyyy-MM-dd). Custom date formats follow the formats at Datetime Patterns.
Spark CSV default writes the timestamp (columns with Spark TimestampType) in
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] format, if you want to change it to custom format use
default <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]). Custom timestamp formats also follow the formats at Datetime Patterns.
encoding (by default it is not set): specifies encoding (charset) of saved CSV files. If it is not set, the UTF-8 charset will be used.
By default CSV file written to disk is separated with \n for each line. If you wanted to change this and use another character use
lineSep option (line separator). The maximum length is 1 character.
Note: Besides these, Spark CSV data-source also supports several other options, please refer to complete list.
3. Save DataFrame as CSV to S3
In order to save DataFrame to Amazon S3 bucket, first, you need to have an S3 bucket created and you need to collect all AWS access and secret keys from your account and set it to Spark configurations. For more details refer to How to Read and Write from S3.
// Replace Key with your AWS account key (You can find this on IAM spark.sparkContext .hadoopConfiguration.set("fs.s3a.access.key", "awsaccesskey value") service) // Replace Key with your AWS secret key (You can find this on IAM spark.sparkContext .hadoopConfiguration.set("fs.s3a.secret.key", "aws secretkey value") spark.sparkContext .hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com") //Write to CSV file df.write.parquet("s3a://sparkbyexamples/csv/datacsv")
4. Save DataFrame as CSV to HDFS
If you have Spark running on YARN on Hadoop, you can write DataFrame as CSV file to HDFS similar to writing to a local disk. All you need is to specify the Hadoop name node path. Hadoop name node path, you can find this on
fs.defaultFS of Hadoop
core-site.xml file under the Hadoop configuration folder.
The above example writes data from DataFrame to CSV file with a header on HDFS location.
5. Save Modes
Spark DataFrameWriter also has a method
mode() to specify SaveMode; the argument to this method either takes below string or a constant from
overwrite – mode is used to overwrite the existing file, alternatively, you can use
append – To add the data to the existing file, alternatively, you can use
ignore – Ignores write operation when the file already exists, alternatively you can use
error – This is a default option when the file already exists, it returns an error, alternatively, you can use
I hope you have learned some basic points about how to save a Spark DataFrame to CSV file with header, save to S3, HDFS and use multiple options and save modes.
Happy Learning !!