PySpark Read Multiple Lines (multiline) JSON File

Spread the love

Problem: How to read JSON files from multiple lines (multiline option) in PySpark with Python example?

Solution: PySpark JSON data source API provides the multiline option to read records from multiple lines. By default, PySpark considers every record in a JSON file as a fully qualified record in a single line.

PySpark Read JSON multiple lines (Option multiline)

In this PySpark example, we set multiline option to true to read JSON records on file from multiple lines. By default, this option is set to false.

Let’s consider we have a below JSON file with multiple lines by name “multiline-zipcode.json”.

  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "State": "PR"
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"

Use read.option to set the multiline property as shown below."multiline", "true") 

Example of using multiline option.

# Read multiline json file
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("") \
multiline_df ="multiline", "true") \

multiline_df.printSchema() yields below schema

 |-- City: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true) statement yields below output

|City               |RecordNumber|State|ZipCodeType|Zipcode|
|PASEO COSTA DEL SUR|2           |PR   |STANDARD   |704    |
|BDA SAN LUIS       |10          |PR   |STANDARD   |709    |

This complete example is available at GitHub.

Happy Learning !!

Naveen (NNK) is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing PySpark Read Multiple Lines (multiline) JSON File