• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:4 mins read
You are currently viewing PySpark Read Multiple Lines (multiline) JSON File

Problem: How to read JSON files from multiple lines (multiline option) in PySpark with Python example?

Solution: PySpark JSON data source API provides the multiline option to read records from multiple lines. By default, PySpark considers every record in a JSON file as a fully qualified record in a single line.

PySpark Read JSON multiple lines (Option multiline)

In this PySpark example, we set multiline option to true to read JSON records on file from multiple lines. By default, this option is set to false.

Let’s consider we have a below JSON file with multiple lines by name “multiline-zipcode.json”.


[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Use read.option to set the multiline property as shown below.


spark.read.option("multiline", "true") 

Example of using multiline option.


# Read multiline json file
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()
multiline_df = spark.read.option("multiline", "true") \
      .json("resources/multiline-zipcode.json")
multiline_df.printSchema()
multiline_df.show()

multiline_df.printSchema() yields below schema


root
 |-- City: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true)

multiline_df.show() statement yields below output


+-------------------+------------+-----+-----------+-------+
|City               |RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|2           |PR   |STANDARD   |704    |
|BDA SAN LUIS       |10          |PR   |STANDARD   |709    |
+-------------------+------------+-----+-----------+-------+

This complete example is available at GitHub.

Happy Learning !!