Spark Write DataFrame to CSV File

In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj.write.csv("path"), using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.

In this article I will explain how to write a Spark DataFrame as a CSV file to disk, S3, HDFS with or without header, I will also cover several options like compressed, delimiter, quote, escape e.t.c and finally using different save mode options. Since Spark 2.0.0 version CSV is natively supported without any external dependencies, if you are using an older version you would need to use databricks spark-csv library. Most of the examples and concepts explained here can also be used to write Parquet, Avro, JSON, text, ORC, and any Spark supported file formats, all you need is just replace csv() with parquet(), avro(), json(), text(), orc() respectively.

In order to explain first let’s create a DataFrame either from data or by reading a CSV file into Spark DataFrame.


val data=Seq(("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")
  )
val columns = Seq("firstname","lastname","country","state")
import spark.implicits._
val df = data.toDF(columns:_*)

1. Spark Write DataFrame as CSV with Header

Spark DataFrameWriter class provides a method csv() to save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn’t write a header or column names.


//Write DataFrame data to CSV file
df.write.csv("/tmp/spark_output/datacsv")

// You can also use below
df.write.format("csv").save("/tmp/spark_output/datacsv")

In order to write DataFrame to CSV with a header, you should use option(), Spark CSV data-source provides several options which we will see in the next section.


df.write.option("header",true)
   .csv("/tmp/spark_output/datacsv")

I have 3 partitions on DataFrame hence it created 3 part files when you save it to the file system.

spark write csv

Note: Depending on the number of partitions you have for DataFrame, it writes the same number of part files in a directory specified as a path. You can get the partition size by using the below snippet. For more details on partitions refer to Spark Partitioning. If you wanted to write as a single CSV file, refer to Spark Write Single CSV File.


df.rdd.getNumPartitions

Use partitionBy() If you want to save a file partition by sub-directories meaning each sub-directory contains records about a single partition. This speeds up further reads if you query based on partition. The below example creates three sub-directories (state=CA, state=NY, state=FL)

df.write.option(“header”,True) \ .partitionBy(“state”) \ .csv(“/tmp/spark_output/datacsv”)

2. Save CSV File Using Options

Spark DataFrameWriter provides option(key,value) to set a single option, to set multiple options either you can chain option() method or use options(options: Map[String, String]). Note that both these option methods return DataFrameWriter.


//Chain option to set multiple
df.write.option("header",True)
        .option("delimiter","|")
        .csv("file-path")

//Example to set multiple options
df.write.options(Map("header"->"true", "delimiter"->","))
        .csv("file-path")

As explained above, use header option to save a Spark DataFrame to CSV along with column names as a header on the first line. By default, this option is set to false meaning does not write the header.

delimiter

Use delimiter option to specify the delimiter on the CSV output file (delimiter is a single character as a separator for each field and value). Most used delimiters are comma (default), pipe, tab e.t.c.

Refer above for a sample example.

quote

If you have a separator or delimiter part of your value, use the quote to set a single character used for escaping quoted values. defaults to \.

compression

Use the compression codec option when you want to compress a CSV file while writing to disk to reduce disk space. It supports the following values nonebzip2gziplz4snappy and deflate. Default’s to null.


df2.write.option("header",true)
         .option("compression","gzip")
         .csv("file-path")

escape

Use escape to sets a single character used for escaping quotes inside an already quoted value.

nullValue

When you have an empty string/value on DataFrame while writing to DataFrame it writes it as NULL as the nullValue  option set to empty by default. Change this if you wanted to set any value as NULL.

dateFormat

Spark CSV default writes the date (columns with Spark DateType) in yyyy-MM-dd format, if you want to change it to custom format use dateFormat (default yyyy-MM-dd). Custom date formats follow the formats at Datetime Patterns.

timestampFormat

Spark CSV default writes the timestamp (columns with Spark TimestampType) in yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] format, if you want to change it to custom format use <code>timestampFormat (default <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]). Custom timestamp formats also follow the formats at Datetime Patterns.

encoding

encoding (by default it is not set): specifies encoding (charset) of saved CSV files. If it is not set, the UTF-8 charset will be used.

lineSep

By default CSV file written to disk is separated with \n for each line. If you wanted to change this and use another character use lineSep option (line separator). The maximum length is 1 character.

Note: Besides these, Spark CSV data-source also supports several other options, please refer to complete list.

3. Save DataFrame as CSV to S3

In order to save DataFrame to Amazon S3 bucket, first, you need to have an S3 bucket created and you need to collect all AWS access and secret keys from your account and set it to Spark configurations. For more details refer to How to Read and Write from S3.


// Replace Key with your AWS account key (You can find this on IAM 
spark.sparkContext
     .hadoopConfiguration.set("fs.s3a.access.key", "awsaccesskey value")
service)
// Replace Key with your AWS secret key (You can find this on IAM 
spark.sparkContext
     .hadoopConfiguration.set("fs.s3a.secret.key", "aws secretkey value")
spark.sparkContext
      .hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com")
//Write to CSV file
df.write.parquet("s3a://sparkbyexamples/csv/datacsv")

4. Save DataFrame as CSV to HDFS

If you have Spark running on YARN on Hadoop, you can write DataFrame as CSV file to HDFS similar to writing to a local disk. All you need is to specify the Hadoop name node path. Hadoop name node path, you can find this on fs.defaultFS of Hadoop core-site.xml file under the Hadoop configuration folder.


df.write.option("header","true")
  .csv("hdfs://nn1home:8020/csvfile")

The above example writes data from DataFrame to CSV file with a header on HDFS location.

5. Save Modes

Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes below string or a constant from SaveMode class.

overwrite – mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite.

append – To add the data to the existing file, alternatively, you can use SaveMode.Append.

ignore – Ignores write operation when the file already exists, alternatively you can use SaveMode.Ignore.

errorifexists or error – This is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists.


df.write.mode(SaveMode.Overwrite)
  .csv("/tmp/spark_output/datacsv")

6. Conclusion

I hope you have learned some basic points about how to save a Spark DataFrame to CSV file with header, save to S3, HDFS and use multiple options and save modes.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 4 Comments

  1. Somayeh

    Hello, I have problem. When I want to save sparke dataframe to csv format. All columns with double format convert to string and all values in the columns change. Can anybody help how I can fix this problem?

    1. NNK

      After writing, do you see all values in double-quoted for double fields? Also, what do you mean values in the column change?

  2. Victor Diaz

    Hi – the write CSV example with option(“header”, true) throws an error as it is supposed to be True in python.

    1. NNK

      Thanks, Victor. I have corrected the typo.