PySpark Select Nested struct Columns

Using the PySpark select() and selectExpr() transformations, one can select the nested struct columns from the DataFrame. When working with semi-structured files like JSON or structured files like Avro, Parquet, or ORC, we often have to deal with complex nested structures.

Advertisements

When you read these files into DataFrame, all nested structure elements are converted into struct type StructType. Read Understand PySpark StructType for a better understanding of StructType.

Nested columns in PySpark refer to columns that contain complex data types such as StructType, ArrayType, MapType, or combinations thereof. These complex data types allow you to represent structured or nested data within a single DataFrame column.

When dealing with a struct (StructType) column in a PySpark DataFrame, you should use a specific column qualifier to access the nested struct columns.

Related Articles: Flatten Nested Struct Columns

To explain this scenario with an example, create a DataFrame with a nested column.


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType        

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Data
data = [
        (("James",None,"Smith"),"OH","M"),
        (("Anna","Rose",""),"NY","F"),
        (("Julia","","Williams"),"OH","F"),
        (("Maria","Anne","Jones"),"NY","M"),
        (("Jen","Mary","Brown"),"NY","M"),
        (("Mike","Mary","Williams"),"OH","M")
        ]

# Schema
schema = StructType([
    StructField('name', StructType([
         StructField('firstname', StringType(), True),
         StructField('middlename', StringType(), True),
         StructField('lastname', StringType(), True)
         ])),
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
     ])

# Create DataFrame
df2 = spark.createDataFrame(data = data, schema = schema)
df2.printSchema()
df2.show(truncate=False) # shows all columns

In the above example, the column name is a struct type, which is a nested column that has child columns firstname, middlename, and lastname.


# Output:
root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- state: string (nullable = true)
 |-- gender: string (nullable = true)

+----------------------+-----+------+
|name                  |state|gender|
+----------------------+-----+------+
|[James,, Smith]       |OH   |M     |
|[Anna, Rose, ]        |NY   |F     |
|[Julia, , Williams]   |OH   |F     |
|[Maria, Anne, Jones]  |NY   |M     |
|[Jen, Mary, Brown]    |NY   |M     |
|[Mike, Mary, Williams]|OH   |M     |
+----------------------+-----+------+

Now, using the select(), select the struct column and see what it returns.


# Select struct type
df2.select("name").show()

This returns the nested column name as is. PySpark may truncate long nested columns in the output. To display nested columns without truncation, you can pass the truncate=False parameter to the show() method.


# Output:
+----------------------+
|name                  |
+----------------------+
|[James,, Smith]       |
|[Anna, Rose, ]        |
|[Julia, , Williams]   |
|[Maria, Anne, Jones]  |
|[Jen, Mary, Brown]    |
|[Mike, Mary, Williams]|
+----------------------+

To select nested columns in PySpark DataFrame, you can use the dot notation or the select() method with the appropriate column qualifier.


# Select columns from struct type
df2.select("name.firstname","name.lastname").show(truncate=False)

This outputs the columns firstname and lastname from the struct column.


# Output:
+---------+--------+
|firstname|lastname|
+---------+--------+
|James    |Smith   |
|Anna     |        |
|Julia    |Williams|
|Maria    |Jones   |
|Jen      |Brown   |
|Mike     |Williams|
+---------+--------+

To extract all columns from a struct column from a DataFrame, you can use the select() method along with the * operator.


# Extract all columns from struct type
df2.select("name.*").show(truncate=False)

This results in the below output.


# Output:
+---------+----------+--------+
|firstname|middlename|lastname|
+---------+----------+--------+
|James    |null      |Smith   |
|Anna     |Rose      |        |
|Julia    |          |Williams|
|Maria    |Anne      |Jones   |
|Jen      |Mary      |Brown   |
|Mike     |Mary      |Williams|
+---------+----------+--------+

Happy Learning !!