You are currently viewing Spark – Extract DataFrame Column as List

Let’s see how to convert/extract the Spark DataFrame column as a List (Scala/Java Collection), there are multiple ways to convert this, I will explain most of them with examples. Remember that when you use DataFrame collect() you get Array[Row] not List[Stirng] hence you need to use a map() function to extract the first column from each row before convert it to a Scala/Java Collection list.

I will also cover how to extract the Spark DataFrame column as list with out duplicates.

Let’s Create a Spark DataFrame


val data = Seq(("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")
  )
val columns = Seq("firstname","lastname","country","state")
import spark.implicits._
val df = data.toDF(columns:_*)
df.show()
//+---------+--------+-------+-----+
//|firstname|lastname|country|state|
//+---------+--------+-------+-----+
//|    James|   Smith|    USA|   CA|
//|  Michael|    Rose|    USA|   NY|
//|   Robert|Williams|    USA|   CA|
//|    Maria|   Jones|    USA|   FL|
//+---------+--------+-------+-----+

From above data, I will extract the state values as a List.

1. Example 1 – Spark Convert DataFrame Column to List

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String].

Among all examples explained here this is best approach and performs better with small or large datasets.


val listValues=df.select("state").map(f=>f.getString(0))
                 .collect.toList
println(listValues)
// List(CA, NY, CA, FL)

The above examples extract all values from a DataFrame column as a List including duplicate values. If you wanted to remove the duplicates, use distinct.


println(listValues.distinct)
// List(CA, NY, FL)

The better option would be running distinct() on Spark DataFrame before collecting as List or Array. If you have many values in a list, this performs better.


val dis=df.select("state").distinct().map(f=>f.getString(0))
          .collect().toList
println(dis)
// List(CA, NY, FL)

Example 2 – Using Typed Dataset to Extract Column List

If you are using Dataset, use the below approach, since we are using Typed String encoders we don’t have to use map() transformation


// Using Typed Dataset to Extract Column List
val ex3=df.select('state').as[String]
          .collect.toList
println(ex3)
// List(CA, NY, CA, FL)

Example 3 – Using RDD to Get Column List

In this example, I have used RDD to get Column List and used RDD map() transformation to extract the column we want. RDD collect() action returns Array[Any] . This actually performs better and it is the preferred approach if you are using RDD’s or PySpark DataFrame


// Using RDD to Get Column List
val ex4=df.select("state").rdd.map(row => row(0))
          .collect().toList
println(ex4.toString)
// List(CA, NY, CA, FL)

Example 4 – Uset collectAsList() to Get Column List

Spark also provides collectAsList() action to collect the DataFrame Columns as a java.util.List[Row], If you are using Java this is the way to go.


// Uset collectAsList() to Get Column List
val ex2=df.select("state").map(f=>f.getString(0))
          .collectAsList
println(ex2)
// List(CA, NY, CA, FL)

Example 5 – Other Alternatives to Convert Column to List

This does not perform better. Here, first we are collecting the DataFrame and then extracting the first column from each row on Driver without utilizing the Spark cluster.


// Other Alternatives to Convert Column to List
df.select("state").collect.map(f=>f.getString(0)).toList()

Conclusion

In this article, I have provided many examples of how to extract/convert the Spark DataFrame column as a list with or with out duplicates.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has One Comment

  1. John J

    don’t forget ` import spark.implicits._`

Comments are closed.