Though Spark supports to read from/write to files on multiple file systems like
GCP e.t.c, the HDFS file system is mostly used at the time of writing this article. Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS.
Spark RDD natively supports reading text files and later with DataFrame, Spark added different data sources like CSV, JSON, Avro, and Parquet. Based on the data source you may need a third party dependency and Spark can read and write all these files from/to HDFS.
In this article, you will learn how to read and write TEXT, CSV, Avro, Parquet and JSON file formats from/to Hadoop HDFS file system using Scala language.
Spark Hadoop/HDFS dependencies
Spark distribution binary comes with Hadoop and HDFS libraries hence we don’t have to explicitly specify the dependency library when we running with
Spark-submit. And, even when you wanted to read a file from HDFS in your Spark program you don’t have to use any Hadoop & HDFS libraries as there are abstracted from us in Spark.
HDFS file system path
Unlike other filesystems, to access files from HDFS you need to provide the Hadoop name node path, you can find this on Hadoop
core-site.xml file under Hadoop configuration folder. On this file look for
fs.defaultFS property and pick the value from this property. for example, you will have the value in the below format. replace nn1home and port from the value in
Write & Read Text file from HDFS
//RDD val rddFromFile = spark.sparkContext.textFile("hdfs://nn1home:8020/text01.txt") val rddWhole = spark.sparkContext.wholeTextFiles("hdfs://nn1home:8020/text01.txt")
If you wanted to read a text file from an HDFS into DataFrame.
val df:DataFrame = spark.read.text("hdfs://nn1home:8020/text01.txt") val ds:Dataset[String] = spark.read.textFile("hdfs://nn1home:8020/text01.txt")
Write & Read CSV & TSV file from HDFS
In Spark CSV/TSV files can be read in using
spark.read.csv("path"), replace the path to HDFS.
And Write a CSV file to HDFS using below syntax.
write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.
Write & Read JSON file from HDFS
spark.read.format("json").load("path") you can read a JSON file into a Spark DataFrame, these methods take a HDFS path as an argument. Unlike reading a CSV, By default JSON data source inferschema from an input file
val df = spark.read.json("hdfs://nn1home:8020/file.json")
And write a JSON file to HDFS using below syntax
Write & Read Avro file from HDFS
Since Spark 2.4, Spark SQL provides built-in support for reading and writing Apache Avro data files, you can use this to read a file from HDFS, however, the
spark-avro module is external and by default, it’s not included in
spark-shell hence, accessing Avro file format in Spark is enabled by providing a package.
spark-avro_2.12 and its dependencies directly using
--packages, such as,
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.4
avro() function is not provided in Spark
DataFrameReader hence, we should use DataSource format as “avro” or “org.apache.spark.sql.avro” and
load() is used to read the Avro file. pass HDFS path as an argument to the load function.
val personDF= spark.read.format("avro").load("hdfs://nn1home:8020/file.avro")
Since Avro library is external to Spark, it doesn’t provide
avro() function on
DataFrameWriter , hence we should use DataSource “
avro” or “
org.apache.spark.sql.avro” to write Spark DataFrame to Avro file. pass HDFS path as an argument to the load function.
Write & Read Parquet file from HDFS
DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame.
val parqDF = spark.read.parquet("hdfs://nn1home:8020/people.parquet")
spark.write.parquet() function we can write Spark DataFrame to Parquet file, and parquet() function is provided in DataFrameWriter class.
In this article, you have learned how to read and write TEXT, CSV, Avro, Parquet and JSON file formats from/to Hadoop HDFS file system using Scala language. I hope you like this article.
Happy Learning !!
- Spark Read and Write MySQL Database Table
- Spark Read Multiple CSV Files
- Spark Read Json From Amazon S3
- Spark Read multiline (multiple line) CSV File
- Spark Read ORC file into DataFrame
- Spark Read Text File from AWS S3 bucket
- Spark Read JSON from multiline
- Spark Read JSON from a CSV file