Using PySpark select() transformations one can select the nested struct columns from DataFrame. While working with semi-structured files like JSON or structured files like Avro, Parquet, ORC we often have to deal with complex nested structures.
When you read these files into DataFrame, all nested structure elements are converted into struct type StructType. Read Understand PySpark StructType for a better understanding of StructType.
If you have a struct (StructType) column on PySpark DataFrame, you need to use an explicit column qualifier in order to select the nested struct columns.
Related Articles: Flatten Nested Struct Columns
First, let’s create a new DataFrame with a struct type.
data = [
(("James",None,"Smith"),"OH","M"),
(("Anna","Rose",""),"NY","F"),
(("Julia","","Williams"),"OH","F"),
(("Maria","Anne","Jones"),"NY","M"),
(("Jen","Mary","Brown"),"NY","M"),
(("Mike","Mary","Williams"),"OH","M")
]
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('state', StringType(), True),
StructField('gender', StringType(), True)
])
df2 = spark.createDataFrame(data = data, schema = schema)
df2.printSchema()
df2.show(truncate=False) # shows all columns
If you notice the column name
is a struct
type which consists of nested columns firstname
, middlename
, lastname
.
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- state: string (nullable = true)
|-- gender: string (nullable = true)
+----------------------+-----+------+
|name |state|gender|
+----------------------+-----+------+
|[James,, Smith] |OH |M |
|[Anna, Rose, ] |NY |F |
|[Julia, , Williams] |OH |F |
|[Maria, Anne, Jones] |NY |M |
|[Jen, Mary, Brown] |NY |M |
|[Mike, Mary, Williams]|OH |M |
+----------------------+-----+------+
Now, let’s select struct column as-is.
df2.select("name").show(truncate=False)
This returns struct column name
as is.
+----------------------+
|name |
+----------------------+
|[James,, Smith] |
|[Anna, Rose, ] |
|[Julia, , Williams] |
|[Maria, Anne, Jones] |
|[Jen, Mary, Brown] |
|[Mike, Mary, Williams]|
+----------------------+
In order to get the specific column from a struct, you need to explicitly qualify.
df2.select("name.firstname","name.lastname").show(truncate=False)
This outputs firstname
and lastname
from the name struct column.
+---------+--------+
|firstname|lastname|
+---------+--------+
|James |Smith |
|Anna | |
|Julia |Williams|
|Maria |Jones |
|Jen |Brown |
|Mike |Williams|
+---------+--------+
In order to get all columns from struct column.
df2.select("name.*").show(truncate=False)
This yields below output.
+---------+----------+--------+
|firstname|middlename|lastname|
+---------+----------+--------+
|James |null |Smith |
|Anna |Rose | |
|Julia | |Williams|
|Maria |Anne |Jones |
|Jen |Mary |Brown |
|Mike |Mary |Williams|
+---------+----------+--------+
Happy Learning !!