Let’s see how to convert/extract the Spark DataFrame column as a List (Scala/Java Collection), there are multiple ways to convert this, I will explain most of them with examples. Remember that when you use DataFrame collect()
you get Array[Row]
not List[Stirng]
hence you need to use a map()
function to extract the first column from each row before convert it to a Scala/Java Collection list.
I will also cover how to extract the Spark DataFrame column as list with out duplicates.
Let’s Create a Spark DataFrame
val data = Seq(("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")
)
val columns = Seq("firstname","lastname","country","state")
import spark.implicits._
val df = data.toDF(columns:_*)
df.show()
//+---------+--------+-------+-----+
//|firstname|lastname|country|state|
//+---------+--------+-------+-----+
//| James| Smith| USA| CA|
//| Michael| Rose| USA| NY|
//| Robert|Williams| USA| CA|
//| Maria| Jones| USA| FL|
//+---------+--------+-------+-----+
From above data, I will extract the state
values as a List.
1. Example 1 – Spark Convert DataFrame Column to List
In order to convert Spark DataFrame Column to List, first select()
the column you want, next use the Spark map() transformation to convert the Row to String, finally collect()
the data to the driver which returns an Array[String]
.
Among all examples explained here this is best approach and performs better with small or large datasets.
val listValues=df.select("state").map(f=>f.getString(0))
.collect.toList
println(listValues)
// List(CA, NY, CA, FL)
The above examples extract all values from a DataFrame column as a List including duplicate values. If you wanted to remove the duplicates, use distinct
.
println(listValues.distinct)
// List(CA, NY, FL)
The better option would be running distinct()
on Spark DataFrame before collecting as List or Array. If you have many values in a list, this performs better.
val dis=df.select("state").distinct().map(f=>f.getString(0))
.collect().toList
println(dis)
// List(CA, NY, FL)
Example 2 – Using Typed Dataset to Extract Column List
If you are using Dataset, use the below approach, since we are using Typed String encoders we don’t have to use map() transformation
// Using Typed Dataset to Extract Column List
val ex3=df.select('state').as[String]
.collect.toList
println(ex3)
// List(CA, NY, CA, FL)
Example 3 – Using RDD to Get Column List
In this example, I have used RDD to get Column List and used RDD map() transformation to extract the column we want. RDD collect()
action returns Array[Any]
. This actually performs better and it is the preferred approach if you are using RDD’s or PySpark DataFrame
// Using RDD to Get Column List
val ex4=df.select("state").rdd.map(row => row(0))
.collect().toList
println(ex4.toString)
// List(CA, NY, CA, FL)
Example 4 – Uset collectAsList() to Get Column List
Spark also provides collectAsList()
action to collect the DataFrame Columns as a java.util.List[Row]
, If you are using Java this is the way to go.
// Uset collectAsList() to Get Column List
val ex2=df.select("state").map(f=>f.getString(0))
.collectAsList
println(ex2)
// List(CA, NY, CA, FL)
Example 5 – Other Alternatives to Convert Column to List
This does not perform better. Here, first we are collecting the DataFrame and then extracting the first column from each row on Driver without utilizing the Spark cluster.
// Other Alternatives to Convert Column to List
df.select("state").collect.map(f=>f.getString(0)).toList()
Conclusion
In this article, I have provided many examples of how to extract/convert the Spark DataFrame column as a list with or with out duplicates.
Happy Learning !!
Related Articles
- Spark Most Used JSON Functions with Examples
- Spark Write DataFrame to CSV File
- Spark 3.0 Features with Examples – Part I
- Spark SQL – Working with Unix Timestamp
- Spark – How to Convert Map into Multiple Columns
- Spark to_date() – Convert timestamp to date
- Spark date_format() – Convert Timestamp to String
don’t forget ` import spark.implicits._`