Spark Save a File without a Directory

  • Post author:
  • Post category:Apache Spark

In this quick article, I will explain how to save a Spark DataFrame into a CSV File without a directory.

When you write a Spark DataFrame, it creates a directory and saves all part files inside a directory, sometimes you don’t want to create a directory instead you just want a single data file (CSV, JSON, Parquet, Avro e.t.c) with the name specified in the path.

Unfortunately, Spark doesn’t support creating a data file without a folder, However, you can use the Hadoop file system library in order to achieve this.

First, Using Spark coalesce() or repartition(), create a single part (partition) file.

val spark:SparkSession = SparkSession.builder()

val df ="header",true).csv("address.csv")

The above example creates an address directory and creates a part-000* file along with _SUCCESS and CRC hidden files.

Now, Let’s use Hadoop Filesystem API to copy the part-0000* file from the directory to the desired location with the new file name and remove the directory.

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}

// Copy the actual file from Directory and Renames to custom name
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)

val srcPath=new Path("c:/tmp/address")
val destPath= new Path("c:/tmp/address_merged.csv")
val srcFile=FileUtil.listFiles(new File("c:/tmp/address"))
//Copy the CSV file outside of Directory and rename to desired file name
//Removes CRC File that create from above statement
hdfs.delete(new Path(".address_merged.csv.crc"),true)
//Remove Directory created by df.write()

You can also achieve this with out using coalesce to single partition file.

Hope this helps and Happy Learning !!

NNK is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply