Problem: How to read JSON files from multiple lines (multiline option) in PySpark with Python example?
Solution: PySpark JSON data source API provides the multiline
option to read records from multiple lines. By default, PySpark considers every record in a JSON file as a fully qualified record in a single line.
PySpark Read JSON multiple lines (Option multiline)
In this PySpark example, we set multiline
option to true
to read JSON records on file from multiple lines. By default, this option is set to false.
Let’s consider we have a below JSON file with multiple lines by name “multiline-zipcode.json”.
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
Use read.option
to set the multiline property as shown below.
spark.read.option("multiline", "true")
Example of using multiline option.
# Read multiline json file
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
multiline_df = spark.read.option("multiline", "true") \
.json("resources/multiline-zipcode.json")
multiline_df.printSchema()
multiline_df.show()
multiline_df.printSchema()
yields below schema
root
|-- City: string (nullable = true)
|-- RecordNumber: long (nullable = true)
|-- State: string (nullable = true)
|-- ZipCodeType: string (nullable = true)
|-- Zipcode: long (nullable = true)
multiline_df.show()
statement yields below output
+-------------------+------------+-----+-----------+-------+
|City |RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|2 |PR |STANDARD |704 |
|BDA SAN LUIS |10 |PR |STANDARD |709 |
+-------------------+------------+-----+-----------+-------+
This complete example is available at GitHub.
Happy Learning !!
Related Articles
- PySpark SQL Types (DataType) with Examples
- PySpark Read JSON File into Data Frame
- PySpark Parse JSON from String Column | TEXT File
- PySpark print Schema() to String or JSON
- PySpark Read CSV file into Data Frame
- PySpark Shell Command Usage with Examples
- What is PySpark DataFrame?
- PySpark Create DataFrame From Dictionary (Dict)