How to Convert PySpark Column to List?

Spread the love

In order to convert PySpark column to List you need to first select the column and perform the collect() on the DataFrame. By default, PySpark DataFrame collect() action returns results in Row() Type but not list hence either you need to pre-transform using map() transformation or post-process in order to convert PySpark DataFrame Column to Python List.

There are multiple ways to convert to list and some approaches perform better some don’t hence it’s better to know all ways. A list is a data structure in Python that holds a collection of items. List items are enclosed in square brackets, like this [data1, data2, data3]. whereas the DataFrame in PySpark consists of columns that hold our data and some thing it would be required to convert these columns to Python List.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

# Create DataFrame
data = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"), \
    ("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL") \
  ]
columns=["firstname","lastname","country","state"]
df=spark.createDataFrame(data=data,schema=columns)
print(df.collect())

Note: collect() action collects all rows from all workers to PySpark Driver, hence, if your data is huge and doesn’t fit in Driver memory it returns an Outofmemory error hence, be careful when you are using collect.

pyspark column to list

1. Convert PySpark Column to List

As you see the above output, DataFrame collect() returns a Row Type, hence in order to convert PySpark Column to List first, you need to select the DataFrame column you wanted using rdd.map() lambda expression and then collect the DataFrame. In the below example, I am extracting the 4th column (3rd index) from DataFrame to the Python list.

Note that list in Python is represented as array, it is one of the most used type in Python.

In the example below map() is a RDD transformation that is used to iterate the each row in a RDD and perform an operation or function using lambda.


# PySpark Column to List
states1=df.rdd.map(lambda x: x[3]).collect()
print(states1)

#['CA', 'NY', 'CA', 'FL']

1.1 Remove Duplicates After Converting to the List

The above code converts the column into a list however, it contains duplicate values, you can remove duplicates either before or after converting to List. The below example removes duplicates from the Python list after converting.


# Remove duplicates after converting to List
from collections import OrderedDict 
res = list(OrderedDict.fromkeys(states1)) 
print(res)

#['CA', 'NY', 'FL']

2. Convert Specific Column to List

Here is another alternative to getting a column as a Python List by referring column name instead of index in map() transformation. once you selected the column, use the collect() function to convert to list.


# Refer column by name you wanted to convert
states2=df.rdd.map(lambda x: x.state).collect()
print(states2)

#['CA', 'NY', 'CA', 'FL']

3. Using flatMap() Transformation

You can also select a column by using select() function of DataFrame and use flatMap() transformation and then collect() to convert PySpark dataframe column to python list. Here flatMap() is a function of RDD hence, you need to convert the DataFrame to RDD by using .rdd


states4=df.select(df.state).rdd.flatMap(lambda x: x).collect()
print(states4)

#['CA', 'NY', 'CA', 'FL']

4. Convert Column to List Using Pandas

Below example Convert the PySpark DataFrame to Pandas, and uses pandas to get the column you want and finally use list() function to convert column to Python list. Python pandas is the most popular open-source library in the python programming language and pandas is widely used for data science/data analysis and machine learning applications. 

Once the PySpark DataFrame is converted to pandas, you can select the column you wanted as a Pandas Series and finally call list(series) to convert it to list.


states5=df.select(df.state).toPandas()['state']
states6=list(states5)
print(states6)
#['CA', 'NY', 'CA', 'FL']

5. Getting Column in Row Type

In case you want to collect the DataFrame column in a Row Type use below example, this just returns each row from DataFrame as list of Row type (Each element in the list is a Row type)


states3=df.select(df.state).collect()
print(states3)
#[Row(state='CA'), Row(state='NY'), Row(state='CA'), Row(state='FL')]

6. Convert Multiple Columns to Python List

Finally lets convert multiple PySpark columns to list, In order to do this I will be use again pandas API.


pandDF=df.select(df.state,df.firstname).toPandas()
print(list(pandDF['state']))
print(list(pandDF['firstname']))
#['CA', 'NY', 'CA', 'FL']
#['James', 'Michael', 'Robert', 'Maria']

Conclusion

In this article, I have explained several ways of how to convert PySpark column to list. Once converts the column to list, the list can be easily used for various data modeling and analytical purpose. I have also explained what collect() by default returns and covered how to extract the column to list by using map(), flatMap() e.t.c

Happy Learning !!

Reference

Naveen (NNK)

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing How to Convert PySpark Column to List?