PySpark printSchema() to String or JSON

Spread the love

How to export Spark/PySpark printSchame() result to String or JSON? As you know printSchema() prints schema to console or log depending on how you are running, however, sometimes you may be required to convert it into a String or to a JSON file. In this article, I will explain how to convert printSchema() result to a String and convert the PySpark DataFrame schema to a JSON.

First, let’s create a DataFrame.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
                    
# Create DataFrame                
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 10000), ("Scala", 10000)]

df = spark.createDataFrame(data).toDF(*columns)
df.printSchema()

Yields below schema.


# printSchema() result
root
 |-- language: string (nullable = true)
 |-- fee: long (nullable = true)

1. Save PySpark printSchema() result to String

Now let’s save this printSchema() result to a string variable. If you look at the source code of this statement, it internally does the following.


# printSchema() internally uses below line
print(self._jdf.schema().treeString())

So, you can save the print schema result to a string using.


# Save printSchema() result to String
schemaString = df._jdf.schema().treeString()
print(schemaString)

2. Convert printSchema() result to JSON

In order to convert the schema (printScham()) result to JSON, use the DataFrame.schema.json() method. DataFrame.schema variable holds the schema of the DataFrame, schema.json() returns the schema as JSON string format.


# Using schema.jsom()
print(df.schema.json())

prints DataFrame schema in JSON string.


# Schema in JSON
{"fields":[{"metadata":{},"name":"language","nullable":true,"type":"string"},{"metadata":{},"name":"fee","nullable":true,"type":"long"}],"type":"struct"}

3. DataFrame.schema to String

Alternatively, you can also use DataFrame.schema.simpleString() method to convert schema to String.


# Using schema.simpleString()
print(df.schema.simpleString())

Yields below output.


# Schema to string
struct<language:string,fee:bigint>

Happy Learning !!

Related Articles

Naveen (NNK)

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing PySpark printSchema() to String or JSON