• Post author:
  • Post category:PySpark
  • Post last modified:October 16, 2025
  • Reading time:15 mins read
You are currently viewing Explain PySpark posexplode() with Examples

In PySpark, the posexplode() function is used to explode an array or map column into multiple rows, just like explode(), but with an additional positional index column. This index column represents the position of each element in the array (starting from 0), which is useful for tracking element order or performing position-based operations.

Advertisements

The posexplode() function is part of the pyspark.sql.functions module and is commonly used when working with arrays, maps, structs, or nested JSON data.

Key Points-

  • posexplode() creates a new row for each element of an array or key-value pair of a map.
  • It adds a position index column (pos) showing the element’s position within the array.
  • When used with arrays, it returns two columns: pos and col.
  • When used with maps, it returns pos, key, and value.
  • Rows with null or empty arrays are removed by default.
  • Use posexplode_outer() to retain rows even when arrays or maps are null or empty.
  • Ideal for flattening complex or nested data while retaining element order.

PySpark posexplode() Function

The PySpark posexplode() function generates a new row for each element in an array or map along with its position. By default, it assigns the column name pos to represent the element’s position and col for the element itself when used with arrays, or key and value when used with maps, unless you specify custom names.

Syntax

Following is the syntax of the poexplode() function.


# Syntax of the posexplode()
from pyspark.sql.functions import posexplode
posexplode(col)

Parameters

  • col: The column name or expression containing an array or map to be exploded.

Return Value

It returns a new column (or multiple columns) with each row representing an array element or map key-value pair, additionally, its position.

Let’s start with a sample DataFrame containing arrays and maps.


# Create SparkSession and Prepare sample Data
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col

spark = SparkSession.builder.appName('pyspark-by-examples').getOrCreate()

arrayData = [
    ('James', ['Java', 'Scala'], {'hair': 'black', 'eye': 'brown'}),
    ('Michael', ['Spark', 'Java', None], {'hair': 'brown', 'eye': None}),
    ('Robert', ['CSharp', ''], {'hair': 'red', 'eye': ''}),
    ('Washington', None, None),
    ('Jefferson', ['1', '2'], {})
]

df = spark.createDataFrame(data=arrayData, schema=['name', 'knownLanguages', 'properties'])
df.printSchema()
df.show(truncate=False)

Yields below the output.

PySaprk posexplode

PySpark posexplode() on Array Column

You can use the posexplode() function on an array column to generate new rows, each containing the element’s index position (pos) and its value (col) in separate columns.


# Posexplode on an array column
df_pos = df.select(df.name, posexplode(df.knownLanguages))
df_pos.show(truncate=False)

Yields below the output.

PySaprk posexplode

PySpark posexplode() on Map Column

You can apply the posexplode() function to a map column in a DataFrame to transform each key-value pair into individual rows. By default, it generates three columns: pos (position), key, and value, unless custom aliases are provided.


# Posexplode map column
df_pos = df.select(df.name, posexplode(df.properties).alias("key", "value"))
df_pos.show(truncate=False)

Yields below the output.


# Output:
+-------+---+----+-----+
|name   |pos|key |value|
+-------+---+----+-----+
|James  |0  |eye |brown|
|James  |1  |hair|black|
|Michael|0  |eye |NULL |
|Michael|1  |hair|brown|
|Robert |0  |eye |     |
|Robert |1  |hair|red  |
+-------+---+----+-----+

This is an ideal PySpark posexplode map example, useful when you need to maintain both order and structure of map-type data.

PySpark posexplode_outer()

When your dataset contains null or empty arrays, posexplode() skips those rows. To retain null rows, use posexplode_outer() instead.


# PySpark posexplode_outer() to get  null values
from pyspark.sql.functions import posexplode_outer

df_outer = df.select(df.name, posexplode_outer(df.knownLanguages))
df_outer.show(truncate=False)

Yields below the output.


# Output:
+----------+----+------+
|name      |pos |col   |
+----------+----+------+
|James     |0   |Java  |
|James     |1   |Scala |
|Michael   |0   |Spark |
|Michael   |1   |Java  |
|Michael   |2   |NULL  |
|Robert    |0   |CSharp|
|Robert    |1   |      |
|Washington|NULL|NULL  |
|Jefferson |0   |1     |
|Jefferson |1   |2     |
+----------+----+------+

PySpark posexplode() JSON Column

You can also apply posexplode() after parsing a JSON column. Here’s an example of PySpark posexplode JSON array using from_json().


# PySpark posexplode() JSON Column
from pyspark.sql.functions import from_json, schema_of_json

json_schema = schema_of_json('{"lang":["Python","Java"],"level":"Intermediate"}')

data_json = [("James", '{"lang":["Python","Java"],"level":"Intermediate"}')]
df_json = spark.createDataFrame(data_json, ["name", "json_data"])

df_parsed = df_json.withColumn("parsed", from_json(col("json_data"), json_schema))
df_exploded_json = df_parsed.select("name", posexplode(col("parsed.lang")).alias("pos", "language"))
df_exploded_json.show(truncate=False)

Yields below the output.


# Output:
+-----+---+--------+
|name |pos|language|
+-----+---+--------+
|James|0  |Python  |
|James|1  |Java    |
+-----+---+--------+

This gives both the element and its position inside the JSON array.

Compare explode() vs posexplode()

The table below highlights the key differences between explode() and posexplode() in PySpark:

FunctionDescription
explode()Generates a new row for each element in an array or map, but does not include the position of elements.
posexplode()Similar to explode(), but adds an additional column indicating the position (index) of each element in the array or map.

Frequently Asked Questions of PySpark posexplode()

What is the PySpark posexplode() function used for?

It’s used to flatten arrays or maps while retaining each element’s index position, which is not available in explode().

What’s the difference between explode() and posexplode()?

posexplode() includes an additional positional column (pos), while explode() only returns the value.

How do I handle null or empty arrays with posexplode()?

Use posexplode_outer() to retain null or empty rows.

How can I use posexplode() on multiple columns?

you can apply multiple posexplode() functions inside a single select() for flattening multiple columns simultaneously.

What are the default output column names?

For arrays: pos, col
For maps: pos, key, value
You can rename them using .alias().

Conclusion

In this article, you learned how to use PySpark posexplode() to flatten arrays and maps into multiple rows while retaining each element’s position index.

We also covered:

  • Using posexplode() on arrays and maps
  • posexplode_outer() function
  • Applying posexplode() on JSON data
  • Comparison with explode()

The posexplode() function is particularly valuable when the order of elements matters, such as in sequence-based data or position-dependent structures.

Happy Learning!!

Reference

Related Articles