• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:5 mins read
You are currently viewing PySpark printSchema() to String or JSON

How to export Spark/PySpark printSchame() result to String or JSON? As you know printSchema() prints schema to console or log depending on how you are running, however, sometimes you may be required to convert it into a String or to a JSON file. In this article, I will explain how to convert printSchema() result to a String and convert the PySpark DataFrame schema to a JSON.

First, let’s create a DataFrame.

# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
# Create DataFrame                
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 10000), ("Scala", 10000)]

df = spark.createDataFrame(data).toDF(*columns)

Yields below schema.

# printSchema() result
 |-- language: string (nullable = true)
 |-- fee: long (nullable = true)

1. Save PySpark printSchema() result to String

Now let’s save this printSchema() result to a string variable. If you look at the source code of this statement, it internally does the following.

# printSchema() internally uses below line

So, you can save the print schema result to a string using.

# Save printSchema() result to String
schemaString = df._jdf.schema().treeString()

2. Convert printSchema() result to JSON

In order to convert the schema (printScham()) result to JSON, use the DataFrame.schema.json() method. DataFrame.schema variable holds the schema of the DataFrame, schema.json() returns the schema as JSON string format.

# Using schema.jsom()

prints DataFrame schema in JSON string.

# Schema in JSON

3. DataFrame.schema to String

Alternatively, you can also use DataFrame.schema.simpleString() method to convert schema to String.

# Using schema.simpleString()

Yields below output.

# Schema to string

Happy Learning !!

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium