Spark Save a File without a Directory

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark
Post last modified:February 20, 2023
Reading time:4 mins read

You are currently viewing Spark Save a File without a Directory

In this quick article, I will explain how to save a Spark DataFrame into a CSV File without a directory.

Advertisements

When you write a Spark DataFrame, it creates a directory and saves all part files inside a directory, sometimes you don’t want to create a directory instead you just want a single data file (CSV, JSON, Parquet, Avro e.t.c) with the name specified in the path.

Unfortunately, Spark doesn’t support creating a data file without a folder, However, you can use the Hadoop file system library in order to achieve this.

First, Using Spark coalesce() or repartition(), create a single part (partition) file.


val spark:SparkSession = SparkSession.builder()
    .master("local[3]")
    .appName("SparkByExamples.com")
    .getOrCreate()

val df = spark.read.option("header",true).csv("address.csv")
df.coalesce(1).write.csv("address")

The above example creates an address directory and creates a part-000* file along with _SUCCESS and CRC hidden files.

Now, Let’s use Hadoop Filesystem API to copy the part-0000* file from the directory to the desired location with the new file name and remove the directory.


import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}

// Copy the actual file from Directory and Renames to custom name
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)

val srcPath=new Path("c:/tmp/address")
val destPath= new Path("c:/tmp/address_merged.csv")
val srcFile=FileUtil.listFiles(new File("c:/tmp/address"))
           .filterNot(f=>f.getPath.endsWith(".csv"))(0)
//Copy the CSV file outside of Directory and rename to desired file name
FileUtil.copy(srcFile,hdfs,destPath,true,hadoopConfig)
//Removes CRC File that create from above statement
hdfs.delete(new Path(".address_merged.csv.crc"),true)
//Remove Directory created by df.write()
hdfs.delete(srcPath,true)

You can also achieve this with out using coalesce to single partition file.

Hope this helps and Happy Learning !!

Related Articles

Leave a Reply