PySpark printSchema() to String or JSON

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:PySpark
Post last modified:March 27, 2024
Reading time:5 mins read

You are currently viewing PySpark printSchema() to String or JSON

How to export Spark/PySpark printSchame() result to String or JSON? As you know printSchema() prints schema to console or log depending on how you are running, however, sometimes you may be required to convert it into a String or to a JSON file. In this article, I will explain how to convert printSchema() result to a String and convert the PySpark DataFrame schema to a JSON.

Advertisements

First, let’s create a DataFrame.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
                    
# Create DataFrame                
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 10000), ("Scala", 10000)]

df = spark.createDataFrame(data).toDF(*columns)
df.printSchema()

Yields below schema.


# printSchema() result
root
 |-- language: string (nullable = true)
 |-- fee: long (nullable = true)

1. Save PySpark printSchema() result to String

Now let’s save this printSchema() result to a string variable. If you look at the source code of this statement, it internally does the following.


# printSchema() internally uses below line
print(self._jdf.schema().treeString())

So, you can save the print schema result to a string using.


# Save printSchema() result to String
schemaString = df._jdf.schema().treeString()
print(schemaString)

2. Convert printSchema() result to JSON

In order to convert the schema (printScham()) result to JSON, use the DataFrame.schema.json() method. DataFrame.schema variable holds the schema of the DataFrame, schema.json() returns the schema as JSON string format.


# Using schema.jsom()
print(df.schema.json())

prints DataFrame schema in JSON string.


# Schema in JSON
{"fields":[{"metadata":{},"name":"language","nullable":true,"type":"string"},{"metadata":{},"name":"fee","nullable":true,"type":"long"}],"type":"struct"}

3. DataFrame.schema to String

Alternatively, you can also use DataFrame.schema.simpleString() method to convert schema to String.


# Using schema.simpleString()
print(df.schema.simpleString())

Yields below output.


# Schema to string
struct<language:string,fee:bigint>

Happy Learning !!

1. Save PySpark printSchema() result to String

2. Convert printSchema() result to JSON

3. DataFrame.schema to String

Related Articles