PySpark Read Multiple Lines (multiline) JSON File

Problem: How to read JSON files from multiple lines (multiline option) in PySpark with Python example?

Solution: PySpark JSON data source API provides the multiline option to read records from multiple lines. By default, PySpark considers every record in a JSON file as a fully qualified record in a single line.

PySpark Read JSON multiple lines (Option multiline)

In this PySpark example, we set multiline option to true to read JSON records on file from multiple lines. By default, this option is set to false.

Let’s consider we have a below JSON file with multiple lines by name “multiline-zipcode.json”.


[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Use read.option to set the multiline property as shown below.


spark.read.option("multiline", "true") 

Example of using multiline option.


# Read multiline json file
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()
multiline_df = spark.read.option("multiline", "true") \
      .json("resources/multiline-zipcode.json")
multiline_df.printSchema()
multiline_df.show()

multiline_df.printSchema() yields below schema


root
 |-- City: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true)

multiline_df.show() statement yields below output


+-------------------+------------+-----+-----------+-------+
|City               |RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|2           |PR   |STANDARD   |704    |
|BDA SAN LUIS       |10          |PR   |STANDARD   |709    |
+-------------------+------------+-----+-----------+-------+

This complete example is available at GitHub.

Happy Learning !!

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing PySpark Read Multiple Lines (multiline) JSON File