How to read multiple CSV files in Spark? Spark SQL provides a method csv() in SparkSession class that is used to read a file or directory of multiple files into a single Spark DataFrame. Using this method we can also read files from a directory with a specific pattern.
In this article, let us see how we can read single or multiple CSV files in a single load using scala in Databricks.
Table of contents
1. Exploring Files in Databricks
For our demo, let us explore the COVID dataset in databricks. Here in the below screenshot, we are listing the covid hospital beds dataset. We can see multiple source files in CSV format.
//Listing COVID dataset in Databricks
%fs ls /databricks-datasets/COVID/ESRI_hospital_beds/

Now let us try processing single, multiple, and all CSV files in the directory using spark session.
2. Spark Read Multiple CSV Files
Spark SQL provides spark.read().csv("file_name")
to read a file, multiple files, or all files from a directory into Spark DataFrame.
2.1. Read Multiple CSV files from Directory.
We can pass multiple absolute paths of CSV files with comma separation to the CSV() method of the spark session to read multiple CSV files and create a dataframe.
Syntax:
# Read multiple csv files
spark.read.csv("path1", "path2", "path3",....)
Here in the below example, we can see passing multiple absolute paths of two CSV files to the same CSV() method at once
val weatherDF = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("/databricks-datasets/COVID/ESRI_hospital_beds/Definitive_Healthcare__USA_Hospital_Beds_2020_03_24.csv",
"/databricks-datasets/COVID/ESRI_hospital_beds/Definitive_Healthcare__USA_Hospital_Beds_2020_03_30.csv")
println(s"total number of records these two files ${weatherDF.count}")

2.2. Read All CSV files from Directory
In spark, we can pass the absolute path of the directory which has the CSV files to the CSV() method and it reads all the CSV files available in the directory and returns dataframe.
Syntax:
spark.read.csv("absolutepath_directory")
Here in the below example, we are passing the absolute path of the directory COVID ESRI_hospital_beds datasets to CSV() method at once
val totalCovidDF = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("/databricks-datasets/COVID/ESRI_hospital_beds")
println(s"total number of records in all the files ${totalCovidDF.count}")

3. Options while reading CSV file
Spark CSV dataset provides multiple options to work with CSV files. Below are some of the most important options explained with examples.
delimiter
delimiter
option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to pipe (|), tab, space, or any character using this option.
val df2 = spark.read.options(Map("delimiter"->","))
.csv("src/main/resources/zipcodes.csv")
inferSchema
The default value set to this option is false
when setting to true
it automatically infers column types based on the data. Note that, it requires reading the data one more time to infer the schema.
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->","))
.csv("src/main/resources/zipcodes.csv")
header
This option is used to read the first line of the CSV file as column names. By default the value of this option is false
, and all column types are assumed to be a string.
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true"))
.csv("src/main/resources/zipcodes.csv")
4. Conclusion
In this article, you have learned how to read multiple CSV files by using spark.read.csv(). To read all files from a directory use directory as a param to the method. And, to read selected files use comma-separated file names as param.
Related Articles
- Spark Dataframe – Show Full Column Contents?
- Spark Join Multiple DataFrames | Tables
- Spark Write DataFrame to CSV File
- Spark Word Count Explained with Example
- Spark Create DataFrame with Examples
- Spark Read and Write MySQL Database Table
- Spark Read Json From Amazon S3
- Spark Read multiline (multiple line) CSV File