You are currently viewing Spark Save a File without a Directory

In this quick article, I will explain how to save a Spark DataFrame into a CSV File without a directory.

When you write a Spark DataFrame, it creates a directory and saves all part files inside a directory, sometimes you don’t want to create a directory instead you just want a single data file (CSV, JSON, Parquet, Avro e.t.c) with the name specified in the path.

Unfortunately, Spark doesn’t support creating a data file without a folder, However, you can use the Hadoop file system library in order to achieve this.

First, Using Spark coalesce() or repartition(), create a single part (partition) file.

val spark:SparkSession = SparkSession.builder()

val df ="header",true).csv("address.csv")

The above example creates an address directory and creates a part-000* file along with _SUCCESS and CRC hidden files.

Now, Let’s use Hadoop Filesystem API to copy the part-0000* file from the directory to the desired location with the new file name and remove the directory.

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}

// Copy the actual file from Directory and Renames to custom name
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)

val srcPath=new Path("c:/tmp/address")
val destPath= new Path("c:/tmp/address_merged.csv")
val srcFile=FileUtil.listFiles(new File("c:/tmp/address"))
//Copy the CSV file outside of Directory and rename to desired file name
//Removes CRC File that create from above statement
hdfs.delete(new Path(".address_merged.csv.crc"),true)
//Remove Directory created by df.write()

You can also achieve this with out using coalesce to single partition file.

Hope this helps and Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply