Spark Read Files from HDFS (TXT, CSV, AVRO, PARQUET, JSON)

Though Spark supports to read from/write to files on multiple file systems like Amazon S3, Hadoop HDFS, Azure, GCP e.t.c, the HDFS file system is mostly used at the time of writing this article. Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS.

Spark RDD natively supports reading text files and later with DataFrame, Spark added different data sources like CSV, JSON, Avro, and Parquet. Based on the data source you may need a third party dependency and Spark can read and write all these files from/to HDFS.

In this article, you will learn how to read and write TEXT, CSV, Avro, Parquet and JSON file formats from/to Hadoop HDFS file system using Scala language.

Prerequisites

Spark Hadoop/HDFS dependencies

Spark distribution binary comes with Hadoop and HDFS libraries hence we don’t have to explicitly specify the dependency library when we running with Spark-submit. And, even when you wanted to read a file from HDFS in your Spark program you don’t have to use any Hadoop & HDFS libraries as there are abstracted from us in Spark.

HDFS file system path

Unlike other filesystems, to access files from HDFS you need to provide the Hadoop name node path, you can find this on Hadoop core-site.xml file under Hadoop configuration folder. On this file look for fs.defaultFS property and pick the value from this property. for example, you will have the value in the below format. replace nn1home and port from the value in fs.defaultFS property.


hdfs://nn1home:8020

Write & Read Text file from HDFS

Use textFile() and wholeTextFiles() method of the SparkContext to read files from any file system and to read from HDFS, you need to provide the hdfs path as an argument to the function.


//RDD
val rddFromFile = spark.sparkContext.textFile("hdfs://nn1home:8020/text01.txt")
val rddWhole = spark.sparkContext.wholeTextFiles("hdfs://nn1home:8020/text01.txt")

If you wanted to read a text file from an HDFS into DataFrame.


val df:DataFrame = spark.read.text("hdfs://nn1home:8020/text01.txt")
val ds:Dataset[String] = spark.read.textFile("hdfs://nn1home:8020/text01.txt")

Write & Read CSV & TSV file from HDFS

In Spark CSV/TSV files can be read in using spark.read.csv("path"), replace the path to HDFS.


spark.read.csv("hdfs://nn1home:8020/file.csv")

And Write a CSV file to HDFS using below syntax.

Use the write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.


df2.write.option("header","true")
 .csv("hdfs://nn1home:8020/csvfile")

Write & Read JSON file from HDFS

Using spark.read.json("path") or spark.read.format("json").load("path") you can read a JSON file into a Spark DataFrame, these methods take a HDFS path as an argument. Unlike reading a CSV, By default JSON data source inferschema from an input file


val df = spark.read.json("hdfs://nn1home:8020/file.json")

And write a JSON file to HDFS using below syntax


df2.write.json("hdfs://nn1home:8020/jsonfile")

Write & Read Avro file from HDFS

Since Spark 2.4, Spark SQL provides built-in support for reading and writing Apache Avro data files, you can use this to read a file from HDFS, however, the spark-avro module is external and by default, it’s not included in spark-submit or spark-shell hence, accessing Avro file format in Spark is enabled by providing a package.

While using spark-submit, provide spark-avro_2.12 and its dependencies directly using --packages, such as,


./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.4

avro() function is not provided in Spark DataFrameReader  hence, we should use DataSource format as “avro” or “org.apache.spark.sql.avro” and load() is used to read the Avro file. pass HDFS path as an argument to the load function.


val personDF= spark.read.format("avro").load("hdfs://nn1home:8020/file.avro")

Since Avro library is external to Spark, it doesn’t provide avro() function on DataFrameWriter , hence we should use DataSource “avro” or “org.apache.spark.sql.avro” to write Spark DataFrame to Avro file. pass HDFS path as an argument to the load function.


df.write.format("avro").save("hdfs://nn1home:8020/avroFile")

Write & Read Parquet file from HDFS

DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame.


val parqDF = spark.read.parquet("hdfs://nn1home:8020/people.parquet")

Using spark.write.parquet() function we can write Spark DataFrame to Parquet file, and parquet() function is provided in DataFrameWriter class. 


df.write.parquet("hdfs://nn1home:8020/parquetFile")

Conclusion

In this article, you have learned how to read and write TEXT, CSV, Avro, Parquet and JSON file formats from/to Hadoop HDFS file system using Scala language. I hope you like this article.

Happy Learning !!

NNK

SparkByExamples.com is a BigData and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment using Scala and Maven.

Leave a Reply