• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:4 mins read
You are currently viewing PySpark Read Multiple Lines (multiline) JSON File

Problem: How to read JSON files from multiple lines (multiline option) in PySpark with Python example?

Solution: PySpark JSON data source API provides the multiline option to read records from multiple lines. By default, PySpark considers every record in a JSON file as a fully qualified record in a single line.

PySpark Read JSON multiple lines (Option multiline)

In this PySpark example, we set multiline option to true to read JSON records on file from multiple lines. By default, this option is set to false.

Let’s consider we have a below JSON file with multiple lines by name “multiline-zipcode.json”.


[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Use read.option to set the multiline property as shown below.


spark.read.option("multiline", "true") 

Example of using multiline option.


# Read multiline json file
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()
multiline_df = spark.read.option("multiline", "true") \
      .json("resources/multiline-zipcode.json")
multiline_df.printSchema()
multiline_df.show()

multiline_df.printSchema() yields below schema


root
 |-- City: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true)

multiline_df.show() statement yields below output


+-------------------+------------+-----+-----------+-------+
|City               |RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|2           |PR   |STANDARD   |704    |
|BDA SAN LUIS       |10          |PR   |STANDARD   |709    |
+-------------------+------------+-----+-----------+-------+

This complete example is available at GitHub.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium