You are currently viewing Spark Read Files from HDFS (TXT, CSV, AVRO, PARQUET, JSON)
Photo by Christina Winter on Unsplash

Though Spark supports to read from/write to files on multiple file systems like Amazon S3, Hadoop HDFS, Azure, GCP e.t.c, the HDFS file system is mostly used at the time of writing this article. Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS.

Spark RDD natively supports reading text files and later with DataFrame, Spark added different data sources like CSV, JSON, Avro, and Parquet. Based on the data source you may need a third party dependency and Spark can read and write all these files from/to HDFS.

In this article, you will learn how to read and write TEXT, CSV, Avro, Parquet and JSON file formats from/to Hadoop HDFS file system using Scala language.


Spark Hadoop/HDFS dependencies

Spark distribution binary comes with Hadoop and HDFS libraries hence we don’t have to explicitly specify the dependency library when we running with Spark-submit. And, even when you wanted to read a file from HDFS in your Spark program you don’t have to use any Hadoop & HDFS libraries as there are abstracted from us in Spark.

HDFS file system path

Unlike other filesystems, to access files from HDFS you need to provide the Hadoop name node path, you can find this on Hadoop core-site.xml file under Hadoop configuration folder. On this file look for fs.defaultFS property and pick the value from this property. for example, you will have the value in the below format. replace nn1home and port from the value in fs.defaultFS property.


Write & Read Text file from HDFS

Use textFile() and wholeTextFiles() method of the SparkContext to read files from any file system and to read from HDFS, you need to provide the hdfs path as an argument to the function.

val rddFromFile = spark.sparkContext.textFile("hdfs://nn1home:8020/text01.txt")
val rddWhole = spark.sparkContext.wholeTextFiles("hdfs://nn1home:8020/text01.txt")

If you wanted to read a text file from an HDFS into DataFrame.

val df:DataFrame ="hdfs://nn1home:8020/text01.txt")
val ds:Dataset[String] ="hdfs://nn1home:8020/text01.txt")

Write & Read CSV & TSV file from HDFS

In Spark CSV/TSV files can be read in using"path"), replace the path to HDFS."hdfs://nn1home:8020/file.csv")

And Write a CSV file to HDFS using below syntax.

Use the write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.


Write & Read JSON file from HDFS

Using"path") or"json").load("path") you can read a JSON file into a Spark DataFrame, these methods take a HDFS path as an argument. Unlike reading a CSV, By default JSON data source inferschema from an input file

val df ="hdfs://nn1home:8020/file.json")

And write a JSON file to HDFS using below syntax


Write & Read Avro file from HDFS

Since Spark 2.4, Spark SQL provides built-in support for reading and writing Apache Avro data files, you can use this to read a file from HDFS, however, the spark-avro module is external and by default, it’s not included in spark-submit or spark-shell hence, accessing Avro file format in Spark is enabled by providing a package.

While using spark-submit, provide spark-avro_2.12 and its dependencies directly using --packages, such as,

./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.4

avro() function is not provided in Spark DataFrameReader  hence, we should use DataSource format as “avro” or “org.apache.spark.sql.avro” and load() is used to read the Avro file. pass HDFS path as an argument to the load function.

val personDF="avro").load("hdfs://nn1home:8020/file.avro")

Since Avro library is external to Spark, it doesn’t provide avro() function on DataFrameWriter , hence we should use DataSource “avro” or “org.apache.spark.sql.avro” to write Spark DataFrame to Avro file. pass HDFS path as an argument to the load function.


Write & Read Parquet file from HDFS

DataFrameReader provides parquet() function ( to read the parquet files and creates a Spark DataFrame.

val parqDF ="hdfs://nn1home:8020/people.parquet")

Using spark.write.parquet() function we can write Spark DataFrame to Parquet file, and parquet() function is provided in DataFrameWriter class. 



In this article, you have learned how to read and write TEXT, CSV, Avro, Parquet and JSON file formats from/to Hadoop HDFS file system using Scala language. I hope you like this article.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 2 Comments

  1. PappuGandhi

    thank you

  2. Saket Khanna

    Very helpful tips for Spark. Thank you!

Comments are closed.