Spark SQL provides
spark.read.csv("path") to read a CSV file into Spark DataFrame and
dataframe.write.csv("path") to save or write to CSV file, In this tutorial you will learn how to read a single file, multiple files, all files from a directory into DataFrame and applying some transformations finally writing DataFrame back to CSV file using Scala & Python (PySpark) example.
Note: Spark out of the box supports to read files in CSV, JSON, and many more file formats into Spark DataFrame.
Table of contents:
- Spark Read CSV file into DataFrame
- Options while reading CSV file
- Read CSV files with a user-specified schema
- Applying DataFrame transformations
- Write DataFrame to CSV file
Example explained in this tutorial uses the CSV file from following GitHub location.
Spark Read CSV file into DataFrame
spark.read.format("csv").load("path") you can read a CSV file into a Spark DataFrame, Thes method takes a file path to read as an argument. By default read method considers header as a data record hence it reads column names on file as data, To overcome this we need to explicitly mention “true” for header option. It also reads all columns as a string (StringType) by default.
I will explain in later sections on how to
inferschema the schema of the CSV which reads the column names from header and column type from data.
val df = spark.read.csv("src/main/resources/zipcodes.csv") df.show(false) df.printSchema()
When you use format(“csv”) method, you can also specify the Data sources by their fully qualified name (i.e.,
org.apache.spark.sql.csv), but for built-in sources, you can also use their short names (
Example in Python (PySpark)
Here is a similar example in python (PySpark) using format and load methods
spark.read.format('csv').options(header='true', inferSchema='true') .load('zipcodes.csv')
This example reads the data into DataFrame columns “_c0” for the first column and “_c1” for second and so on. and by default type of all these columns would be String.
Read multiple CSV files
Using the spark.read.csv() method you can also read multiple csv files, just pass all file names by separating comma as a path, for example :
val df = spark.read.csv("path1,path2,path3")
Read all CSV files in a directory
We can read all CSV files from a directory into DataFrame just by passing directory as a path to the csv() method.
val df = spark.read.csv("Folder path")
Options while reading CSV file
Spark CSV dataset provides multiple options to work with CSV files, all these options
delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to any character us this option.
val df2 = spark.read.options(Map("delimiter"->",")) .csv("src/main/resources/zipcodes.csv")
The default value set to this option is
false, when set to true it automatically infer column types based on the data. It requires to read the data one more time to infer the schema.
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",")) .csv("src/main/resources/zipcodes.csv")
This option is used to read the first line of the CSV file as column names. By default the value of this option is
false , and all column types are assumed to be a string.
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")) .csv("src/main/resources/zipcodes.csv")
When you have a column with a delimiter that used to split the columns, use
quotes option to specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using this option you can set any character.
nullValues option you can specify the string in a CSV to consider as null. For example, if you want to consider a date column with a value “1900-01-01” set null on DataFrame.
Note: Besides the above options, Spark CSV dataset also supports many other options, please refer to this article for details.
Reading CSV files with a user-specified custom schema
If you know the schema of the file ahead and do not want to use the
inferSchema option for column names and types, use user-defined custom column names and type using
val schema = new StructType() .add("RecordNumber",IntegerType,true) .add("Zipcode",IntegerType,true) .add("ZipCodeType",StringType,true) .add("City",StringType,true) .add("State",StringType,true) .add("LocationType",StringType,true) .add("Lat",DoubleType,true) .add("Long",DoubleType,true) .add("Xaxis",IntegerType,true) .add("Yaxis",DoubleType,true) .add("Zaxis",DoubleType,true) .add("WorldRegion",StringType,true) .add("Country",StringType,true) .add("LocationText",StringType,true) .add("Location",StringType,true) .add("Decommisioned",BooleanType,true) .add("TaxReturnsFiled",StringType,true) .add("EstimatedPopulation",IntegerType,true) .add("TotalWages",IntegerType,true) .add("Notes",StringType,true) val df_with_schema = spark.read.format("csv") .option("header", "true") .schema(schema) .load("src/main/resources/zipcodes.csv") df_with_schema.printSchema() df_with_schema.show(false)
Applying DataFrame transformations
Once you have created DataFrame from the CSV file, you can apply all transformation and actions DataFrame support. Please refer to the link for more details.
Write Spark DataFrame to CSV file
Use the Spark DataFrameWriter object “write” method on DataFrame to write a CSV file.
df2.write .option("header","true") .csv("/tmp/spark_output/zipcodes")
While writing a CSV file you can use several options. for example, whether you want to output the column names as header using option
header and what should be your delimiter on CSV file using option
delimiter and many more.
Other options available
Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes below string or a constant from
overwrite – mode is used to overwrite the existing file, alternatively, you can use
append – To add the data to the existing file, alternatively, you can use
ignore – Ignores write operation when the file already exists, alternatively you can use
errorifexists or error – This is a default option when the file already exists, it returns an error, alternatively, you can use
In this tutorial, you have learned how to read a CSV file, multiple csv files and all files in a folder into Spark DataFrame, using multiple options to change the default behavior and writing CSV files back to DataFrame using different save options.
Happy Learning !!