In order to convert PySpark column to List you need to first select the column and perform the collect() on the DataFrame. By default, PySpark DataFrame collect() action returns results in Row() Type but not list hence either you need to pre-transform using map() transformation or post-process in order to convert PySpark DataFrame Column to Python List.
There are multiple ways to convert to list and some approaches perform better some don’t hence it’s better to know all ways. A
list is a data structure in Python that holds a collection of items. List items are enclosed in square brackets, like this [data1, data2, data3]. whereas the DataFrame in PySpark consists of columns that hold our data and some thing it would be required to convert these columns to Python List.
# Import from pyspark.sql import SparkSession # Create SparkSession spark = SparkSession.builder.master("local") \ .appName('SparkByExamples.com') \ .getOrCreate() # Create DataFrame data = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"), \ ("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL") \ ] columns=["firstname","lastname","country","state"] df=spark.createDataFrame(data=data,schema=columns) print(df.collect())
Note: collect() action collects all rows from all workers to PySpark Driver, hence, if your data is huge and doesn’t fit in Driver memory it returns an Outofmemory error hence, be careful when you are using collect.
1. Convert PySpark Column to List Using map()
As you see the above output, DataFrame collect() returns a Row Type, hence in order to convert PySpark Column to List, first you need to select the DataFrame column you wanted using rdd.map() lambda expression and then collect the specific column of the DataFrame. In the below example, I am extracting the 4th column (3rd index) from DataFrame to the Python list.
Note that list in Python is represented as array, it is one of the most used type in Python.
In the example below map() is a RDD transformation that is used to iterate the each row in a RDD and perform an operation or function using lambda.
# PySpark Column to List states1=df.rdd.map(lambda x: x).collect() print(states1) #['CA', 'NY', 'CA', 'FL']
1.1 Remove Duplicates After Converting to the List
The above code converts the column into a list however, it contains duplicate values, you can remove duplicates either before or after converting to List. The below example removes duplicates from the Python list after converting.
# Remove duplicates after converting to List from collections import OrderedDict res = list(OrderedDict.fromkeys(states1)) print(res) #['CA', 'NY', 'FL']
2. Convert Specific Column to List
Here is another alternative to getting a column as a Python List by referring column name instead of index in map() transformation. once you selected the column, use the collect() function to convert to list.
# Refer column by name you wanted to convert states2=df.rdd.map(lambda x: x.state).collect() print(states2) #['CA', 'NY', 'CA', 'FL']
3. Using flatMap() Transformation
You can also select a column by using select() function of DataFrame and use flatMap() transformation and then collect() to convert PySpark dataframe column to python list. Here flatMap() is a function of RDD hence, you need to convert the DataFrame to RDD by using .rdd
# Using flatMap() to get list states4=df.select(df.state).rdd.flatMap(lambda x: x).collect() print(states4) #['CA', 'NY', 'CA', 'FL']
4. Convert Column to List Using Pandas
Below example Convert the PySpark DataFrame to Pandas, and uses pandas to get the column you want and finally use list() function to convert column to Python list. Python pandas is the most popular open-source library in the python programming language and pandas is widely used for data science/data analysis and machine learning applications.
Once the PySpark DataFrame is converted to pandas, you can select the column you wanted as a Pandas Series and finally call list(series) to convert it to list.
# Convert single column to list using toPandas() states5=df.select(df.state).toPandas()['state'] states6=list(states5) print(states6) #['CA', 'NY', 'CA', 'FL']
5. Getting Column in Row Type
In case you want to collect the DataFrame column in a Row Type use below example, this just returns each row from DataFrame as list of Row type (Each element in the list is a Row type)
# Column in Row Type states3=df.select(df.state).collect() print(states3) #[Row(state='CA'), Row(state='NY'), Row(state='CA'), Row(state='FL')]
6. Convert Multiple Columns to Python List
Finally lets convert multiple PySpark columns to list, In order to do this I will be use again pandas API.
# Multiple columns to list using toPandas() pandDF=df.select(df.state,df.firstname).toPandas() print(list(pandDF['state'])) print(list(pandDF['firstname'])) #['CA', 'NY', 'CA', 'FL'] #['James', 'Michael', 'Robert', 'Maria']
7. Frequently Asked Questions
rdd.mapfor column-to-list conversion?
collect() retrieves the entire column as a list in the driver program, which can be memory-intensive if the column is large.
rdd.map processes the data in a distributed manner on the worker nodes, which can be more efficient for larger datasets.
toPandas()to convert a column to a list?
toPandas() converts the entire DataFrame to a Pandas DataFrame, which can be memory-intensive for large datasets. It’s suitable for smaller datasets or when you want to work with Pandas data structures.
Use distributed methods like
rdd.map or SQL functions when dealing with large datasets. Be mindful of memory usage and data distribution when working with PySpark.
In this article, I have explained several ways of how to convert PySpark column to list. Once converts the column to list, the list can be easily used for various data modeling and analytical purpose. I have also explained what collect() by default returns and covered how to extract the column to list by using map(), flatMap() e.t.c
Happy Learning !!
- PySpark to_timestamp() – Convert String to Timestamp type
- PySpark Convert String to Array Column
- PySpark Convert String Type to Double Type
- PySpark Convert Dictionary/Map to Multiple Columns
- PySpark Convert StructType (struct) to Dictionary/MapType (map)
- PySpark Convert DataFrame Columns to MapType (Dict)
- PySpark Convert DataFrame to RDD