PySpark printSchema() Example

The printSchema() method in PySpark is a very helpful function used to display the schema of a DataFrame in a readable hierarchy format. This method provides a detailed structure of the DataFrame, including the names of columns, their data types, and whether they are nullable.

1. printSchema() Syntax

The following is the syntax of the printSchema() method. This method takes no parameters and prints/displays the schema of the PySpark DataFrame to the log or console.


# printSchema() Syntax
DataFrame.printSchema()

2. PySpark printSchema() Example

First, let’s create a PySpark DataFrame with column names.


# Create DataFrame
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000)]
df = spark.createDataFrame(data).toDF(*columns)

The above example creates the DataFrame with two columns language and fee. Since we have not specified the data types it infers the data type of each column based on the column values (data). now let’s use printSchama() which displays the schema of the DataFrame on the console or logs.


# Print Schema
df.printSchema()

# Output
#root
# |-- language: string (nullable = true)
# |-- fee: long (nullable = true)

Now let’s assign a data type to each column by using PySpark StructType and StructField.


# With Specific data types
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

schema = StructType([ \
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

This yields a similar output as above. To display the contents of the DataFrame use pyspark show() method.


# Output
root
 |-- language: string (nullable = true)
 |-- fee: int (nullable = true)

3. printSchema() with Nested Structure

While working on DataFrame, we often need to work with the nested struct column, and this can be defined using StructType. In the below example, column name data type is StructType which is nested.

printSchema() method on the PySpark DataFrame shows StructType columns as struct.


# Nested structure
schema = StructType([ \
    StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),                 
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
data =  [(("James","","Smith"),"Java",20000),
         (("Michael","Rose",""),"Python",10000)]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Yields below output.


# Output
root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- language: string (nullable = true)
 |-- fee: integer (nullable = true)

4. Using ArrayType and MapType

StructType also supports ArrayType and MapType to define the DataFrame columns for array and map collections respectively. In the below example, column languages defined as ArrayType(StringType) and properties defined as MapType(StringType,StringType) meaning both key and value as String.


# Using ArrayType & MapType
from pyspark.sql.types import StringType, ArrayType,MapType
schema = StructType([
       StructField('name', StringType(), True),
       StructField('languages', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])

data =  [("James",["Java","Scala"],{'hair':'black','eye':'brown'}),
         ("Michael",["Python","PHP"],{'hair':'brown','eye':None})]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Outputs the below schema. Note that field languages is array type and properties is map type.


# Output
root
 |-- name: string (nullable = true)
 |-- languages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

Complete Example


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
# Example 1 - printSchema()                    
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 10000), ("Scala", 10000)]

df = spark.createDataFrame(data).toDF(*columns)
df.printSchema()

# Example 2 - Using StructType & StructField
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

schema = StructType([ \
    StructField("language",StringType(),True), \
    StructField("fee",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

# Example 3 - Using Nested StructType
schema = StructType([ \
    StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),                 
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
data =  [(("James","","Smith"),"Java",20000),
         (("Michael","Rose",""),"Python",10000)]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

# Example 4 - Using MapType & ArrayType
from pyspark.sql.types import  StringType, ArrayType,MapType
schema = StructType([
       StructField('name', StringType(), True),
       StructField('languages', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])

data =  [("James",["Java","Scala"],{'hair':'black','eye':'brown'}),
         ("Michael",["Python","PHP"],{'hair':'brown','eye':None})]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Conclusion

In this article, you have learned the syntax and usage of the PySpark printschema() method with several examples, including how printSchema() displays the schema of the DataFrame when it has nested structure, array, and map (dict) types.

The printSchema() is an essential tool in PySpark for inspecting and verifying the structure of DataFrames, ensuring data integrity, and aiding in the development of robust data processing pipelines.

Happy Learning !!