PySpark Select Nested struct Columns

Using PySpark select() transformations one can select the nested struct columns from DataFrame. While working with semi-structured files like JSON or structured files like Avro, Parquet, ORC we often have to deal with complex nested structures.

When you read these files into DataFrame, all nested structure elements are converted into struct type StructType. Read Understand PySpark StructType for a better understanding of StructType.

If you have a struct (StructType) column on PySpark DataFrame, you need to use an explicit column qualifier in order to select the nested struct columns.

Related Articles: Flatten Nested Struct Columns

First, let’s create a new DataFrame with a struct type.

data = [

from pyspark.sql.types import StructType,StructField, StringType        
schema = StructType([
    StructField('name', StructType([
         StructField('firstname', StringType(), True),
         StructField('middlename', StringType(), True),
         StructField('lastname', StringType(), True)
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
df2 = spark.createDataFrame(data = data, schema = schema)
df2.printSchema() # shows all columns

If you notice the column name is a struct type which consists of nested columns firstname, middlename, lastname.

 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- state: string (nullable = true)
 |-- gender: string (nullable = true)

|name                  |state|gender|
|[James,, Smith]       |OH   |M     |
|[Anna, Rose, ]        |NY   |F     |
|[Julia, , Williams]   |OH   |F     |
|[Maria, Anne, Jones]  |NY   |M     |
|[Jen, Mary, Brown]    |NY   |M     |
|[Mike, Mary, Williams]|OH   |M     |

Now, let’s select struct column as-is."name").show(truncate=False)

This returns struct column name as is.

|name                  |
|[James,, Smith]       |
|[Anna, Rose, ]        |
|[Julia, , Williams]   |
|[Maria, Anne, Jones]  |
|[Jen, Mary, Brown]    |
|[Mike, Mary, Williams]|

In order to get the specific column from a struct, you need to explicitly qualify."name.firstname","name.lastname").show(truncate=False)

This outputs firstname and lastname from the name struct column.

|James    |Smith   |
|Anna     |        |
|Julia    |Williams|
|Maria    |Jones   |
|Jen      |Brown   |
|Mike     |Williams|

In order to get all columns from struct column."name.*").show(truncate=False)

This yields below output.

|James    |null      |Smith   |
|Anna     |Rose      |        |
|Julia    |          |Williams|
|Maria    |Anne      |Jones   |
|Jen      |Mary      |Brown   |
|Mike     |Mary      |Williams|

Happy Learning !!

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing PySpark Select Nested struct Columns