Using the PySpark select() and selectExpr() transformations, one can select the nested struct columns from the DataFrame. When working with semi-structured files like JSON or structured files like Avro, Parquet, or ORC, we often have to deal with complex nested structures.
When you read these files into DataFrame, all nested structure elements are converted into struct type StructType. Read Understand PySpark StructType for a better understanding of StructType.
Nested columns in PySpark refer to columns that contain complex data types such as StructType, ArrayType, MapType, or combinations thereof. These complex data types allow you to represent structured or nested data within a single DataFrame column.
When dealing with a struct (StructType) column in a PySpark DataFrame, you should use a specific column qualifier to access the nested struct columns.
Related Articles: Flatten Nested Struct Columns
To explain this scenario with an example, create a DataFrame with a nested column.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# Data
data = [
(("James",None,"Smith"),"OH","M"),
(("Anna","Rose",""),"NY","F"),
(("Julia","","Williams"),"OH","F"),
(("Maria","Anne","Jones"),"NY","M"),
(("Jen","Mary","Brown"),"NY","M"),
(("Mike","Mary","Williams"),"OH","M")
]
# Schema
schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('state', StringType(), True),
StructField('gender', StringType(), True)
])
# Create DataFrame
df2 = spark.createDataFrame(data = data, schema = schema)
df2.printSchema()
df2.show(truncate=False) # shows all columns
In the above example, the column name
is a struct
type, which is a nested column that has child columns firstname
, middlename
, and lastname
.
# Output:
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- state: string (nullable = true)
|-- gender: string (nullable = true)
+----------------------+-----+------+
|name |state|gender|
+----------------------+-----+------+
|[James,, Smith] |OH |M |
|[Anna, Rose, ] |NY |F |
|[Julia, , Williams] |OH |F |
|[Maria, Anne, Jones] |NY |M |
|[Jen, Mary, Brown] |NY |M |
|[Mike, Mary, Williams]|OH |M |
+----------------------+-----+------+
Now, using the select(), select the struct column and see what it returns.
# Select struct type
df2.select("name").show()
This returns the nested column name
as is. PySpark may truncate long nested columns in the output. To display nested columns without truncation, you can pass the truncate=False
parameter to the show()
method.
# Output:
+----------------------+
|name |
+----------------------+
|[James,, Smith] |
|[Anna, Rose, ] |
|[Julia, , Williams] |
|[Maria, Anne, Jones] |
|[Jen, Mary, Brown] |
|[Mike, Mary, Williams]|
+----------------------+
To select nested columns in PySpark DataFrame, you can use the dot notation or the select()
method with the appropriate column qualifier.
# Select columns from struct type
df2.select("name.firstname","name.lastname").show(truncate=False)
This outputs the columns firstname
and lastname
from the struct column.
# Output:
+---------+--------+
|firstname|lastname|
+---------+--------+
|James |Smith |
|Anna | |
|Julia |Williams|
|Maria |Jones |
|Jen |Brown |
|Mike |Williams|
+---------+--------+
To extract all columns from a struct column from a DataFrame, you can use the select()
method along with the *
operator.
# Extract all columns from struct type
df2.select("name.*").show(truncate=False)
This results in the below output.
# Output:
+---------+----------+--------+
|firstname|middlename|lastname|
+---------+----------+--------+
|James |null |Smith |
|Anna |Rose | |
|Julia | |Williams|
|Maria |Anne |Jones |
|Jen |Mary |Brown |
|Mike |Mary |Williams|
+---------+----------+--------+
Happy Learning !!