pyspark.sql.DataFrame.printSchema()
is used to print or display the schema of the DataFrame in the tree format along with column name and data type. If you have DataFrame with a nested structure it displays schema in a nested tree format.
1. printSchema() Syntax
Following is the Syntax of the printSchema() method, this method doesn’t take any parameters and print/display the schema of the PySpark DataFrame.
# printSchema() Syntax
DataFrame.printSchema()
2. PySpark printSchema() Example
First, let’s create a PySpark DataFrame with column names.
# Create DataFrame
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000)]
df = spark.createDataFrame(data).toDF(*columns)
The above example creates the DataFrame with two columns language
and fee
. Since we have not specified the data types it infers the data type of each column based on the column values (data). now let’s use printSchama()
which displays the schema of the DataFrame on the console or logs.
# Print Schema
df.printSchema()
# Output
#root
# |-- language: string (nullable = true)
# |-- fee: long (nullable = true)
Now let’s assign a data type to each column by using PySpark StructType and StructField.
# With Specific data types
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([ \
StructField("language",StringType(),True), \
StructField("fee",IntegerType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
This yields similar output as above. To display the contents of the DataFrame using pyspark show() method.
# Output
root
|-- language: string (nullable = true)
|-- fee: int (nullable = true)
3. printSchema() with Nested Structure
While working on DataFrame we often need to work with the nested struct column and this can be defined using StructType. In the below example column name
data type is StructType which is nested.
printSchema()
method on the PySpark DataFrame shows StructType columns as struct
.
# Nested structure
schema = StructType([ \
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField("language",StringType(),True), \
StructField("fee",IntegerType(),True)
])
data = [(("James","","Smith"),"Java",20000),
(("Michael","Rose",""),"Python",10000)]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
Yields below output.
# Output
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- language: string (nullable = true)
|-- fee: integer (nullable = true)
4. Using ArrayType and MapType
StructType also supports ArrayType and MapType to define the DataFrame columns for array and map collections respectively. In the below example, column languages
defined as ArrayType(StringType) and properties
defined as MapType(StringType,StringType) meaning both key and value as String.
# Using ArrayType & MapType
from pyspark.sql.types import StringType, ArrayType,MapType
schema = StructType([
StructField('name', StringType(), True),
StructField('languages', ArrayType(StringType()), True),
StructField('properties', MapType(StringType(),StringType()), True)
])
data = [("James",["Java","Scala"],{'hair':'black','eye':'brown'}),
("Michael",["Python","PHP"],{'hair':'brown','eye':None})]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
Outputs the below schema. Note that field languages
is array type and properties
is map type.
# Output
root
|-- name: string (nullable = true)
|-- languages: array (nullable = true)
| |-- element: string (containsNull = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Complete Example
# Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
# Example 1 - printSchema()
columns = ["language","fee"]
data = [("Java", 20000), ("Python", 10000), ("Scala", 10000)]
df = spark.createDataFrame(data).toDF(*columns)
df.printSchema()
# Example 2 - Using StructType & StructField
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([ \
StructField("language",StringType(),True), \
StructField("fee",StringType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
# Example 3 - Using Nested StructType
schema = StructType([ \
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField("language",StringType(),True), \
StructField("fee",IntegerType(),True)
])
data = [(("James","","Smith"),"Java",20000),
(("Michael","Rose",""),"Python",10000)]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
# Example 4 - Using MapType & ArrayType
from pyspark.sql.types import StringType, ArrayType,MapType
schema = StructType([
StructField('name', StringType(), True),
StructField('languages', ArrayType(StringType()), True),
StructField('properties', MapType(StringType(),StringType()), True)
])
data = [("James",["Java","Scala"],{'hair':'black','eye':'brown'}),
("Michael",["Python","PHP"],{'hair':'brown','eye':None})]
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
Conclusion
In this article, you have learned the syntax and usage of the PySpark printschema()
method with several examples including how printSchema() displays the schema of the DataFrame when it has nested structure, array, and map (dict) types.
Happy Learning !!
Related Articles
- PySpark printSchema() to String or JSON
- PySpark count() – Different Methods Explained
- PySpark Join Multiple Columns
- PySpark Groupby Agg (aggregate) – Explained
- PySpark repartition() – Explained with Examples
- PySpark alias() Column & DataFrame Examples
- PySpark SparkContext Explained
- Dynamic way of doing ETL through Pyspark