In this quick article, I will explain how to save a Spark DataFrame into a CSV File without a directory.
When you write a Spark DataFrame, it creates a directory and saves all part files inside a directory, sometimes you don’t want to create a directory instead you just want a single data file (CSV, JSON, Parquet, Avro e.t.c) with the name specified in the path.
Unfortunately, Spark doesn’t support creating a data file without a folder, However, you can use the Hadoop file system library in order to achieve this.
First, Using Spark coalesce() or repartition(), create a single part (partition) file.
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
val df = spark.read.option("header",true).csv("address.csv")
df.coalesce(1).write.csv("address")
The above example creates an address
directory and creates a part-000*
file along with _SUCCESS
and CRC
hidden files.
Now, Let’s use Hadoop Filesystem API to copy the part-0000*
file from the directory to the desired location with the new file name and remove the directory.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// Copy the actual file from Directory and Renames to custom name
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val srcPath=new Path("c:/tmp/address")
val destPath= new Path("c:/tmp/address_merged.csv")
val srcFile=FileUtil.listFiles(new File("c:/tmp/address"))
.filterNot(f=>f.getPath.endsWith(".csv"))(0)
//Copy the CSV file outside of Directory and rename to desired file name
FileUtil.copy(srcFile,hdfs,destPath,true,hadoopConfig)
//Removes CRC File that create from above statement
hdfs.delete(new Path(".address_merged.csv.crc"),true)
//Remove Directory created by df.write()
hdfs.delete(srcPath,true)
You can also achieve this with out using coalesce to single partition file.
Hope this helps and Happy Learning !!