Though Spark supports to read from/write to files on multiple file systems like Amazon S3
, Hadoop HDFS
, Azure
, GCP
e.t.c, the HDFS file system is mostly used at the time of writing this article. Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS.
Spark RDD natively supports reading text files and later with DataFrame, Spark added different data sources like CSV, JSON, Avro, and Parquet. Based on the data source you may need a third party dependency and Spark can read and write all these files from/to HDFS.
In this article, you will learn how to read and write TEXT, CSV, Avro, Parquet and JSON file formats from/to Hadoop HDFS file system using Scala language.
Prerequisites
Spark Hadoop/HDFS dependencies
Spark distribution binary comes with Hadoop and HDFS libraries hence we don’t have to explicitly specify the dependency library when we running with Spark-submit
. And, even when you wanted to read a file from HDFS in your Spark program you don’t have to use any Hadoop & HDFS libraries as there are abstracted from us in Spark.
HDFS file system path
Unlike other filesystems, to access files from HDFS you need to provide the Hadoop name node path, you can find this on Hadoop core-site.xml
file under Hadoop configuration folder. On this file look for fs.defaultFS
property and pick the value from this property. for example, you will have the value in the below format. replace nn1home and port from the value in fs.defaultFS
property.
hdfs://nn1home:8020
Write & Read Text file from HDFS
Use textFile() and wholeTextFiles() method of the SparkContext to read files from any file system and to read from HDFS, you need to provide the hdfs path as an argument to the function.
//RDD
val rddFromFile = spark.sparkContext.textFile("hdfs://nn1home:8020/text01.txt")
val rddWhole = spark.sparkContext.wholeTextFiles("hdfs://nn1home:8020/text01.txt")
If you wanted to read a text file from an HDFS into DataFrame.
val df:DataFrame = spark.read.text("hdfs://nn1home:8020/text01.txt")
val ds:Dataset[String] = spark.read.textFile("hdfs://nn1home:8020/text01.txt")
Write & Read CSV & TSV file from HDFS
In Spark CSV/TSV files can be read in using spark.read.csv("path")
, replace the path to HDFS.
spark.read.csv("hdfs://nn1home:8020/file.csv")
And Write a CSV file to HDFS using below syntax.
Use the write()
method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.
df2.write.option("header","true")
.csv("hdfs://nn1home:8020/csvfile")
Write & Read JSON file from HDFS
Using spark.read.json("path")
or spark.read.format("json").load("path")
you can read a JSON file into a Spark DataFrame, these methods take a HDFS path as an argument. Unlike reading a CSV, By default JSON data source inferschema from an input file
val df = spark.read.json("hdfs://nn1home:8020/file.json")
And write a JSON file to HDFS using below syntax
df2.write.json("hdfs://nn1home:8020/jsonfile")
Write & Read Avro file from HDFS
Since Spark 2.4, Spark SQL provides built-in support for reading and writing Apache Avro data files, you can use this to read a file from HDFS, however, the spark-avro
module is external and by default, it’s not included in spark-submit
or spark-shell
hence, accessing Avro file format in Spark is enabled by providing a package.
While using spark-submit
, provide spark-avro_2.12
and its dependencies directly using --packages
, such as,
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.4
avro()
function is not provided in Spark DataFrameReader
hence, we should use DataSource format as “avro” or “org.apache.spark.sql.avro” and load()
is used to read the Avro file. pass HDFS path as an argument to the load function.
val personDF= spark.read.format("avro").load("hdfs://nn1home:8020/file.avro")
Since Avro library is external to Spark, it doesn’t provide avro()
function on DataFrameWriter
, hence we should use DataSource “avro
” or “org.apache.spark.sql.avro
” to write Spark DataFrame to Avro file. pass HDFS path as an argument to the load function.
df.write.format("avro").save("hdfs://nn1home:8020/avroFile")
Write & Read Parquet file from HDFS
DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame.
val parqDF = spark.read.parquet("hdfs://nn1home:8020/people.parquet")
Using spark.write.parquet()
function we can write Spark DataFrame to Parquet file, and parquet() function is provided in DataFrameWriter class.
df.write.parquet("hdfs://nn1home:8020/parquetFile")
Conclusion
In this article, you have learned how to read and write TEXT, CSV, Avro, Parquet and JSON file formats from/to Hadoop HDFS file system using Scala language. I hope you like this article.
Happy Learning !!
Very helpful tips for Spark. Thank you!
thank you