• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:7 mins read
You are currently viewing PySpark printSchema() Example

pyspark.sql.DataFrame.printSchema() is used to print or display the schema of the DataFrame in the tree format along with column name and data type. If you have DataFrame with a nested structure it displays schema in a nested tree format.

1. printSchema() Syntax

Following is the Syntax of the printSchema() method, this method doesn’t take any parameters and print/display the schema of the PySpark DataFrame.


# printSchema() Syntax
DataFrame.printSchema()

2. PySpark printSchema() Example

First, let’s create a PySpark DataFrame with column names.


# Create DataFrame
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000)]
df = spark.createDataFrame(data).toDF(*columns)

The above example creates the DataFrame with two columns language and fee. Since we have not specified the data types it infers the data type of each column based on the column values (data). now let’s use printSchama() which displays the schema of the DataFrame on the console or logs.


# Print Schema
df.printSchema()

# Output
#root
# |-- language: string (nullable = true)
# |-- fee: long (nullable = true)

Now let’s assign a data type to each column by using PySpark StructType and StructField.


# With Specific data types
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

schema = StructType([ \
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

This yields similar output as above. To display the contents of the DataFrame using pyspark show() method.


# Output
root
 |-- language: string (nullable = true)
 |-- fee: int (nullable = true)

3. printSchema() with Nested Structure

While working on DataFrame we often need to work with the nested struct column and this can be defined using StructType. In the below example column name data type is StructType which is nested.

printSchema() method on the PySpark DataFrame shows StructType columns as struct.


# Nested structure
schema = StructType([ \
    StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),                 
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
data =  [(("James","","Smith"),"Java",20000),
         (("Michael","Rose",""),"Python",10000)]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Yields below output.


# Output
root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- language: string (nullable = true)
 |-- fee: integer (nullable = true)

4. Using ArrayType and MapType

StructType also supports ArrayType and MapType to define the DataFrame columns for array and map collections respectively. In the below example, column languages defined as ArrayType(StringType) and properties defined as MapType(StringType,StringType) meaning both key and value as String.


# Using ArrayType & MapType
from pyspark.sql.types import StringType, ArrayType,MapType
schema = StructType([
       StructField('name', StringType(), True),
       StructField('languages', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])

data =  [("James",["Java","Scala"],{'hair':'black','eye':'brown'}),
         ("Michael",["Python","PHP"],{'hair':'brown','eye':None})]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Outputs the below schema. Note that field languages is array type and properties is map type.


# Output
root
 |-- name: string (nullable = true)
 |-- languages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

Complete Example


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
# Example 1 - printSchema()                    
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 10000), ("Scala", 10000)]

df = spark.createDataFrame(data).toDF(*columns)
df.printSchema()

# Example 2 - Using StructType & StructField
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

schema = StructType([ \
    StructField("language",StringType(),True), \
    StructField("fee",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

# Example 3 - Using Nested StructType
schema = StructType([ \
    StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),                 
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
data =  [(("James","","Smith"),"Java",20000),
         (("Michael","Rose",""),"Python",10000)]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

# Example 4 - Using MapType & ArrayType
from pyspark.sql.types import  StringType, ArrayType,MapType
schema = StructType([
       StructField('name', StringType(), True),
       StructField('languages', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])

data =  [("James",["Java","Scala"],{'hair':'black','eye':'brown'}),
         ("Michael",["Python","PHP"],{'hair':'brown','eye':None})]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Conclusion

In this article, you have learned the syntax and usage of the PySpark printschema() method with several examples including how printSchema() displays the schema of the DataFrame when it has nested structure, array, and map (dict) types.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium