Spark – Rename and Delete a File or Directory From HDFS

In this Spark article, I will explain how to rename and delete a File or a Directory from HDFS. The same approach can be used to rename or delete a file or folder from the Local File system, AWS S3, or Azure Blob/Data lake (ADLS).

We typically need these when you need to move or rename part files to a custom location or delete a directory that Spark created.

First, let’s create a Spark Session

import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()

Related Articles:

1. Spark Rename File or a Directory

Spark libraries have no operation to rename or delete a file however, Spark natively supports Hadoop Filesystem API so we can use this to rename or delete Files/Directories.

In order to do File System operations in Spark, will use org.apache.hadoop.conf.Configuration and org.apache.hadoop.fs.FileSystem classes of Hadoop FileSystem Library and this library comes with Apache Spark distribution hence no additional library needed.

First create a Hadoop Configuration org.apache.hadoop.conf.Configuration from a SparkContext.

import org.apache.hadoop.conf.Configuration
// Create Hadoop Configuration from Spark
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)

Now, create a org.apache.hadoop.fs.Path variables for source and destination file paths. Use fs.rename() by passing source and destination paths to rename a file.

import org.apache.hadoop.fs.{FileSystem, Path}
val srcPath=new Path("/tmp/address_rename_merged")
val destPath= new Path("/tmp/address_merged")

// Rename a File
if(fs.exists(srcPath) && fs.isFile(srcPath))

Note that in the above example we also check if the file exists using fs.exists(path) method. By not checking this it returns an error when the source file not exists.

Alternatively, you can also create Hadoop configuration and rename a File.

val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)

2. Spark Delete File or a Directory

In order to delete a file or a directory in Spark, use delete() method of Hadoop FileSystem.

// To Delete File
if(fs.exists(srcPath) && fs.isFile(srcPath))

// To Delete Directory
if(fs.exists(srcPath) && fs.isDirectory(srcPath))

delete() method of FileSystem will be used to delete both File and a Directory.

3. Rename or Delete with Scala using Hadoop Commands

You can also use the Scala library scala.sys.process to run Hadoop HDFS commands to perform Filesystem operations.

import scala.sys.process._
// Delete a File
s"hdfs dfs -rm /tmp/.address_merged2.csv.crc" !
// Delete a Directory
s"hdfs dfs -rm -r /tmp/.address_merged2.csv.crc" !

4. Rename or Delete Files from Databricks

Spark Databricks provides a dbutils to perform File operations.

// This remove File or Directory

// Moves a file or directory, possibly across FileSystems.
// Can also be used to Rename File or Directory. String, to: String, recurse= false) 

Using dbutils you can perform file operations on Azure blob, Data lake (ADLS) and AWS S3 storages.


Since Spark natively supports Hadoop, we can use the Hadoop Filesystem library to delete() and rename() File and a Directory. For databaricks you should use dbutils library to perform these operations.

Happy Learning !!

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply

You are currently viewing Spark – Rename and Delete a File or Directory From HDFS