How to Convert PySpark Column to Python List?

In order to convert PySpark column to Python List you need to first select the column and perform the collect() on the DataFrame. By default, PySpark DataFrame collect() action returns results in Row() Type but not list hence either you need to pre-transform using map() transformation or post-process in order to convert PySpark DataFrame Column to Python List.

1. Convert PySpark Column to List Using map()

As you see the above output, DataFrame collect() returns a Row Type, hence in order to convert PySpark Column to Python List, first you need to select the DataFrame column you wanted using rdd.map() lambda expression and then collect the specific column of the DataFrame. In the below example, I am extracting the 4th column (3rd index) from DataFrame to the Python list.

Note that list in Python is represented as array, it is one of the most used type in Python.

In the example below map() is a RDD transformation that is used to iterate the each row in a RDD and perform an operation or function using lambda.


# PySpark Column to List
states1=df.rdd.map(lambda x: x[3]).collect()
print(states1)

#['CA', 'NY', 'CA', 'FL']

1.1 Remove Duplicates After Converting to the List

The above code converts the column into a list however, it contains duplicate values, you can remove duplicates either before or after converting to List. The below example removes duplicates from the Python list after converting.


# Remove duplicates after converting to List
from collections import OrderedDict 
res = list(OrderedDict.fromkeys(states1)) 
print(res)

#['CA', 'NY', 'FL']

2. Convert Specific Column to List

Here is another alternative to getting a column as a Python List by referring column name instead of index in map() transformation. once you selected the column, use the collect() function to convert to list.


# Refer column by name you wanted to convert
states2=df.rdd.map(lambda x: x.state).collect()
print(states2)

#['CA', 'NY', 'CA', 'FL']

3. Using flatMap() Transformation

You can also select a column by using select() function of DataFrame and use flatMap() transformation and then collect() to convert PySpark dataframe column to python list. Here flatMap() is a function of RDD hence, you need to convert the DataFrame to RDD by using .rdd


# Using flatMap() to get list
states4=df.select(df.state).rdd.flatMap(lambda x: x).collect()
print(states4)

#['CA', 'NY', 'CA', 'FL']

4. Convert Column to List Using Pandas

Below example Convert the PySpark DataFrame to Pandas, and uses pandas to get the column you want and finally use list() function to convert column to Python list. Python pandas is the most popular open-source library in the python programming language and pandas is widely used for data science/data analysis and machine learning applications.

Once the PySpark DataFrame is converted to pandas, you can select the column you wanted as a Pandas Series and finally call list(series) to convert it to list.


# Convert single column to list using toPandas()
states5=df.select(df.state).toPandas()['state']
states6=list(states5)
print(states6)
#['CA', 'NY', 'CA', 'FL']

5. Getting Column in Row Type

In case you want to collect the DataFrame column in a Row Type use below example, this just returns each row from DataFrame as list of Row type (Each element in the list is a Row type)


# Column in Row Type
states3=df.select(df.state).collect()
print(states3)
#[Row(state='CA'), Row(state='NY'), Row(state='CA'), Row(state='FL')]

6. Convert Multiple Columns to Python List

Finally lets convert multiple PySpark columns to list, In order to do this I will be use again pandas API.


# Multiple columns to list using toPandas()
pandDF=df.select(df.state,df.firstname).toPandas()
print(list(pandDF['state']))
print(list(pandDF['firstname']))
#['CA', 'NY', 'CA', 'FL']
#['James', 'Michael', 'Robert', 'Maria']

7. Frequently Asked Questions

What is the difference between using collect() and rdd.map for column-to-list conversion?

collect() retrieves the entire column as a list in the driver program, which can be memory-intensive if the column is large. rdd.map processes the data in a distributed manner on the worker nodes, which can be more efficient for larger datasets.

What should I consider when using toPandas() to convert a column to a list?

Using toPandas() converts the entire DataFrame to a Pandas DataFrame, which can be memory-intensive for large datasets. It’s suitable for smaller datasets or when you want to work with Pandas data structures.

What are some best practices for efficiently converting columns to lists in PySpark?

Use distributed methods like rdd.map or SQL functions when dealing with large datasets. Be mindful of memory usage and data distribution when working with PySpark.

Conclusion

In this article, I have explained several ways of how to convert the PySpark column to a Python list. Once the column is converted to a list, the list can be easily used for various data modeling and analytical purposes. I have also explained what collect() by default returns and covered how to extract the column to list by using map(), flatMap() e.t.c

Happy Learning !!