You are currently viewing Spark Read multiline (multiple line) CSV File

Spark CSV Data source API supports to read a multiline (records having new line character) CSV file by using spark.read.option("multiLine", true). Before you start using this option, let’s read through this article to understand better using few options.

Advertisements
  • Spark read CSV (Default Behavior)
  • Spark read CSV using multiline option (with double quotes escape character)
  • Load when multiline record surrounded with single quotes or another escape character.
  • Load when the multiline record doesn’t have an escape character

Sample CSV File With Multiline Records

Below is sample records that I used in this article to explain how to read CSV file with multiline records. The contents on the second record has a multiline record in second column(basically I inserted a newline character)


Id,Address Line1,City,State,Zipcode
1,9182 Clear Water Rd,Fayetteville,AR,72704
2,"9920 State
Highway 89",Ringling,OK,73456
3,9724 E Landon Ln,Kennewick,WA,99338

Spark Read CSV File (Default)

Let’s Read a CSV file into Spark DataFrame with out any options.


val spark:SparkSession = SparkSession.builder()
    .master("local[3]")
    .appName("SparkByExamples.com")
    .getOrCreate()

val df = spark.read.option("header",true)
    .csv("src/main/resources/address-multiline.csv")

df.show()

Yields below output. If you notice Spark unable to parse multiline character by default hence the second record is corrupted on DataFrame.


+-----------+-------------------+------------+-----+-------+
|         Id|      Address Line1|        City|State|Zipcode|
+-----------+-------------------+------------+-----+-------+
|          1|9182 Clear Water Rd|Fayetteville|   AR|  72704|
|          2|         9920 State|        null| null|   null|
|Highway 89"|           Ringling|          OK|73456|   null|
|          3|   9724 E Landon Ln|   Kennewick|   WA|  99338|
+-----------+-------------------+------------+-----+-------+

By default, Spark CSV data source considers file contains records with a comma delimiter. In case if you have another delimiter like pipe character (|) use spark.read.option("delimiter","|") option.

Spark Read CSV using Multiline Option

In order to process the CSV file with values in rows scattered across multiple lines, use option("multiLine",true).


val df = spark.read
    .option("header",true)
    .option("multiLine",true)
    .csv("src/main/resources/address-multiline.csv")

df.show()

This yields below output.


+---+---------------------+------------+-----+-------+
|Id |Address Line1        |City        |State|Zipcode|
+---+---------------------+------------+-----+-------+
|1  |9182 Clear Water Rd  |Fayetteville|AR   |72704  |
|2  |9920 State Highway 89|Ringling    |OK   |73456  |
|3  |9724 E Landon Ln     |Kennewick   |WA   |99338  |
+---+---------------------+------------+-----+-------+

Note: By default, using the multiline option; Spark considers you have multiline rows in double-quotes or any special escape characters. By not having an escape character, the data will be read incorrectly and you need special processing to handle it.

Load when multiline record surrounded with single quotes or another escape character.

In case if your multiline records have different escaped character, you should provide an additional option “escape


val df = spark.read
    .option("header",true)
    .option("multiLine",true)
    .option("escape","\'")
    .csv("src/main/resources/address-multiline.csv")

df.show()

Load when the multiline record doesn’t have an escape character

When your multiline records don’t have escape characters then parsing such CSV files gets complicated. Use the below process to read the file.

  • First, read the CSV file as a text file (spark.read.text())
  • Replace all delimiters with escape character + delimiter + escape character “,”. If you have comma separated file then it would replace, with “,”.
  • Add escape character to the end of each record (write logic to ignore this for rows that have multiline).
  • Write data into another temporary file.
  • Finally, read this file as CSV with multiline and escape options.

Conclusion

Spark loading a CSV with multiline records is processed by using multiline and escape options. If you have no escapee character, still you can process the file but you need to write custom code to handle it.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply