Convert PySpark DataFrame Column to Python List

By default, PySpark DataFrame collect() action returns results in Row() Type but not list hence either you need to pre-transform using map() transformation or post-process in order to convert PySpark DataFrame Column to Python List, there are multiple ways to convert the DataFrame column (all values) to Python list some approaches perform better some don’t hence it’s better to know all ways.

list is a data structure in Python that holds a collection of items. List items are enclosed in square brackets, like this [data1, data2, data3].

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('') \

data = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"), \
    ("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL") \

#Outputs below Row Type
#[Row(firstname='James', lastname='Smith', country='USA', state='CA'), Row(firstname='Michael', #lastname='Rose', country='USA', state='NY'), Row(firstname='Robert', lastname='Williams', country='USA', #state='CA'), Row(firstname='Maria', lastname='Jones', country='USA', state='FL')]

Note: collect() action collects all rows from all workers to PySpark Driver, if your data is huge and doesn’t fit in Driver memory it returns Outofmemory error hence, be careful when you are using collect.

1. Convert DataFrame Column to Python List

As you see above output, PySpark DataFrame collect() returns a Row Type, hence in order to convert DataFrame Column to Python List first, you need to select the DataFrame column you wanted using lambda expression and then collect the DataFrame. In the below example, I am extracting the 4th column (3rd index) from DataFrame to the Python list. x: x[3]).collect()
#['CA', 'NY', 'CA', 'FL']

1.1 Remove Duplicates from List

The above code converts the DataFrame column into a Python list however, it contains duplicate values, you can remove duplicates either before or after converting to List. The below example removes duplicates from the Python list after converting.

#Remove duplicates after converting to List
from collections import OrderedDict 
res = list(OrderedDict.fromkeys(states1)) 
#['CA', 'NY', 'FL']

2. Referring Column Name you wanted to Extract

Here is another alternative of getting a DataFrame column as a Python List by referring column name from Row Type. x: x.state).collect()
#['CA', 'NY', 'CA', 'FL']

3. Using flatMap() Transformation

You can also get the list from DataFrame by using PySpark flatMap() transformation x: x).collect()
#['CA', 'NY', 'CA', 'FL']

4. Convert to Python List Using Pandas

Below example Convert the PySpark DataFrame to Pandas, and uses pandas to get the column you want as a Python List.['state']
#['CA', 'NY', 'CA', 'FL']

5. Getting Column in Row Type

In case if you want to collect the DataFrame column in a Row Type use below example.
#[Row(state='CA'), Row(state='NY'), Row(state='CA'), Row(state='FL')]

6. Convert Multiple Columns to Python List,df.firstname).toPandas()
#['CA', 'NY', 'CA', 'FL']
#['James', 'Michael', 'Robert', 'Maria']


In this article, I have explained several ways to get Python List from DataFrame with example. hope these are helpful.

Happy Learning !!


Convert PySpark DataFrame Column to Python List

NNK is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Convert PySpark DataFrame Column to Python List