Spark – Rename and Delete a File or Directory From HDFS

  • Post author:

In this Spark article, I will explain how to rename and delete a File or a Directory from HDFS. The same approach can be used to rename or delete a file or folder from the Local File system, AWS S3, or Azure Blob/Data lake (ADLS).

We typically need these when you need to move or rename part files to a custom location or delete a directory that Spark created.

First, let’s create a Spark Session


import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
    .master("local[3]")
    .appName("SparkByExamples.com")
    .getOrCreate()

You May Also Like Reading:

Spark Rename File or a Directory

Spark libraries have no operation to rename or delete a file however, Spark natively supports Hadoop Filesystem API so we can use this to rename or delete Files/Directories.

In order to do File System operations in Spark, will use org.apache.hadoop.conf.Configuration and org.apache.hadoop.fs.FileSystem classes of Hadoop FileSystem Library and this library comes with Apache Spark distribution hence no additional library needed.

First create a Hadoop Configuration org.apache.hadoop.conf.Configuration from a SparkContext.


import org.apache.hadoop.conf.Configuration
//Create Hadoop Configuration from Spark
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)

Now, create a org.apache.hadoop.fs.Path variables for source and destination file paths. Use fs.rename() by passing source and destination paths to rename a file.


import org.apache.hadoop.fs.{FileSystem, Path}
val srcPath=new Path("/tmp/address_rename_merged")
val destPath= new Path("/tmp/address_merged")

//Rename a File
if(fs.exists(srcPath) && fs.isFile(srcPath))
     fs.rename(srcPath,destPath)

Note that in the above example we also check if the file exists using fs.exists(path) method. By not checking this it returns an error when the source file not exists.

Alternatively, you can also create Hadoop configuration and rename a Fil


val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
hdfs.rename(srcPath,destPath)

Spark Delete File or a Directory

In order to delete a file or a directory in Spark, use delete() method of Hadoop FileSystem.


//To Delete File
if(fs.exists(srcPath) && fs.isFile(srcPath))
    fs.delete(srcPath,true)

//To Delete Directory
if(fs.exists(srcPath) && fs.isDirectory(srcPath))
    fs.delete(srcPath,true)

delete() method of FileSystem will be used to delete both File and a Directory.

Rename or Delete with Scala using Hadoop Commands

You can also use the Scala library scala.sys.process to run Hadoop HDFS commands to perform Filesystem operations.


import scala.sys.process._
//Delete a File
s"hdfs dfs -rm /tmp/.address_merged2.csv.crc" !
  
//Delete a Directory
s"hdfs dfs -rm -r /tmp/.address_merged2.csv.crc" !

Rename or Delete Files from Databricks

Spark Databricks provides a dbutils to perform File operations.


//This remove File or Directory
dbutils.fs.rm(folder-to-delete:String,recurse=true)

//Moves a file or directory, possibly across FileSystems.
//Can also be used to Rename File or Directory.
dbutils.fs.mv(from: String, to: String, recurse= false) 

Using dbutils you can perform file operations on Azure blob, Data lake (ADLS) and AWS S3 storages.

Conclusion

Since Spark natively supports Hadoop, we can use the Hadoop Filesystem library to delete() and rename() File and a Directory. For databaricks you should use dbutils library to perform these operations.

Happy Learning !!

Spark Rename Delete File Directory

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Spark – Rename and Delete a File or Directory From HDFS