PySpark printSchema() Example

  • Post author:

pyspark.sql.DataFrame.printSchema() is used to print or display the schema of the DataFrame in the tree format along with column name and data type. If you have DataFrame with a nested structure it displays schema in a nested tree format.

1. printSchema() Syntax

Following is the Syntax of the printSchema() method, this method doesn’t take any parameters and print/display the schema of the PySpark DataFrame.


# printSchema() Syntax
DataFrame.printSchema()

2. PySpark printSchema() Example

First, let’s create a PySpark DataFrame with column names.


# Create DataFrame
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000)]
df = spark.createDataFrame(data).toDF(*columns)

The above example creates the DataFrame with two columns language and fee. Since we have not specified the data types it infers the data type of each column based on the column values (data). now let’s use printSchama() which displays the schema of the DataFrame on the console or logs.


# Print Schema
df.printSchema()

# Output
#root
# |-- language: string (nullable = true)
# |-- fee: long (nullable = true)

Now let’s assign a data type to each column by using PySpark StructType and StructField.


# With Specific data types
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

schema = StructType([ \
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

This yields similar output as above. To display the contents of the DataFrame using pyspark show() method.


# Output
root
 |-- language: string (nullable = true)
 |-- fee: int (nullable = true)

3. printSchema() with Nested Structure

While working on DataFrame we often need to work with the nested struct column and this can be defined using StructType. In the below example column name data type is StructType which is nested.

printSchema() method on the PySpark DataFrame shows StructType columns as struct.


# Nested structure
schema = StructType([ \
    StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),                 
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
data =  [(("James","","Smith"),"Java",20000),
         (("Michael","Rose",""),"Python",10000)]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Yields below output.


# Output
root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- language: string (nullable = true)
 |-- fee: integer (nullable = true)

4. Using ArrayType and MapType

StructType also supports ArrayType and MapType to define the DataFrame columns for array and map collections respectively. In the below example, column languages defined as ArrayType(StringType) and properties defined as MapType(StringType,StringType) meaning both key and value as String.


# Using ArrayType & MapType
from pyspark.sql.types import StringType, ArrayType,MapType
schema = StructType([
       StructField('name', StringType(), True),
       StructField('languages', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])

data =  [("James",["Java","Scala"],{'hair':'black','eye':'brown'}),
         ("Michael",["Python","PHP"],{'hair':'brown','eye':None})]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Outputs the below schema. Note that field languages is array type and properties is map type.


# Output
root
 |-- name: string (nullable = true)
 |-- languages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

Complete Example


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
# Example 1 - printSchema()                    
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 10000), ("Scala", 10000)]

df = spark.createDataFrame(data).toDF(*columns)
df.printSchema()

# Example 2 - Using StructType & StructField
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

schema = StructType([ \
    StructField("language",StringType(),True), \
    StructField("fee",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

# Example 3 - Using Nested StructType
schema = StructType([ \
    StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),                 
    StructField("language",StringType(),True), \
    StructField("fee",IntegerType(),True)
  ])
data =  [(("James","","Smith"),"Java",20000),
         (("Michael","Rose",""),"Python",10000)]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

# Example 4 - Using MapType & ArrayType
from pyspark.sql.types import  StringType, ArrayType,MapType
schema = StructType([
       StructField('name', StringType(), True),
       StructField('languages', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])

data =  [("James",["Java","Scala"],{'hair':'black','eye':'brown'}),
         ("Michael",["Python","PHP"],{'hair':'brown','eye':None})]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

Conclusion

In this article, you have learned the syntax and usage of the PySpark printschema() method with several examples including how printSchema() displays the schema of the DataFrame when it has nested structure, array, and map (dict) types.

Happy Learning !!

pyspark printschema

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing PySpark printSchema() Example