You are currently viewing Spark Read Multiple CSV Files

How to read multiple CSV files in Spark? Spark SQL provides a method csv() in SparkSession class that is used to read a file or directory of multiple files into a single Spark DataFrame. Using this method we can also read files from a directory with a specific pattern.

In this article, let us see how we can read single or multiple CSV files in a single load using scala in Databricks.

1. Exploring Files in Databricks

For our demo, let us explore the COVID dataset in databricks. Here in the below screenshot, we are listing the covid hospital beds dataset. We can see multiple source files in CSV format.


//Listing COVID dataset in Databricks
%fs ls /databricks-datasets/COVID/ESRI_hospital_beds/
Databricks COVID dataset
Databricks COVID dataset

Now let us try processing single, multiple, and all CSV files in the directory using spark session.

2. Spark Read Multiple CSV Files

Spark SQL provides spark.read().csv("file_name") to read a file, multiple files, or all files from a directory into Spark DataFrame.

2.1. Read Multiple CSV files from Directory.

We can pass multiple absolute paths of CSV files with comma separation to the CSV() method of the spark session to read multiple CSV files and create a dataframe.

Syntax:


# Read multiple csv files
spark.read.csv("path1", "path2", "path3",....)

Here in the below example, we can see passing multiple absolute paths of two CSV files to the same CSV() method at once


 val weatherDF = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv("/databricks-datasets/COVID/ESRI_hospital_beds/Definitive_Healthcare__USA_Hospital_Beds_2020_03_24.csv", 
           "/databricks-datasets/COVID/ESRI_hospital_beds/Definitive_Healthcare__USA_Hospital_Beds_2020_03_30.csv")

println(s"total number of records these two files ${weatherDF.count}")
spark read multiple CSV files
Spark Reading multiple CSV files

2.2. Read All CSV files from Directory

In spark, we can pass the absolute path of the directory which has the CSV files to the CSV() method and it reads all the CSV files available in the directory and returns dataframe.

Syntax:


 spark.read.csv("absolutepath_directory")

Here in the below example, we are passing the absolute path of the directory COVID ESRI_hospital_beds datasets to CSV() method at once


 val totalCovidDF = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv("/databricks-datasets/COVID/ESRI_hospital_beds")
println(s"total number of records in all the files ${totalCovidDF.count}")
spark read multiple csv directory
spark read directory

3. Options while reading CSV file

Spark CSV dataset provides multiple options to work with CSV files. Below are some of the most important options explained with examples.

delimiter

delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to pipe (|), tab, space, or any character using this option.


val df2 = spark.read.options(Map("delimiter"->","))
  .csv("src/main/resources/zipcodes.csv")

inferSchema

The default value set to this option is false when setting to true it automatically infers column types based on the data. Note that, it requires reading the data one more time to infer the schema.

val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",")) .csv("src/main/resources/zipcodes.csv")

This option is used to read the first line of the CSV file as column names. By default the value of this option is false , and all column types are assumed to be a string.

val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")) .csv("src/main/resources/zipcodes.csv")

4. Conclusion

In this article, you have learned how to read multiple CSV files by using spark.read.csv(). To read all files from a directory use directory as a param to the method. And, to read selected files use comma-separated file names as param.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium