A Spark DataFrame can be created from various sources for example from Scala’s list of iterable objects. Creating DataFrame from a Scala list of iterable in Apache Spark is a powerful way to test Spark features in your development environment before working with large datasets and performing complex data transformations in a distributed environment.
1. Spark scala List of Iterables
In Scala, you can create a List of Iterables by using the List
constructor and passing in one or more Iterables as arguments. Here’s an example:
// Create List of iterable
val list: List[Iterable[Int]] = List(Seq(1, 2, 3), List(4, 5, 6), Vector(7, 8, 9))
In this example, we’re creating a List of Iterables that contains three elements: a Seq of integers (1, 2, 3), a List of integers (4, 5, 6), and a Vector of integers (7, 8, 9). Note that each element of the List is of type Iterable[Int]
.
Once you have a List of Iterables, you can access each Iterable in the list using the apply
method (which can also be written using the shortcut notation ()
), like this:
// Access elements
val firstIterable: Iterable[Int] = list(0)
val secondIterable: Iterable[Int] = list(1)
val thirdIterable: Iterable[Int] = list(2)
The output of the above extract looks as below:
// Output:
firstIterable: Iterable[Int] = List(1, 2, 3)
secondIterable: Iterable[Int] = List(4, 5, 6)
thirdIterable: Iterable[Int] = Vector(7, 8, 9)
You can then operate on each Iterable as you normally would, using methods like foreach, map, filter, etc.
2. Create DataFrame from Scala List of Iterables
Creating a DataFrame from Scala’s List of Iterables in Apache Spark is a powerful way to work during development time to test the Spark features with a small dataset.
2.1. Using toDF()
method on an RDD:
You can convert the list of iterables to a Spark RDD and then call the map()
function to convert each List of Iterables to a tuple, then call the toDF() method on the resulting RDD to create the DataFrame.
// Import
import org.apache.spark.sql.SparkSession
// Create SparkSession
val spark = SparkSession.builder().appName("Create DataFrame from List of List of Iterables").master("local").getOrCreate()
// Creatre list
val list = List(
List(Seq(1, "John"), Seq("[email protected]")),
List(Seq(2, "Jane"), Seq("[email protected]")),
List(Seq(3, "Bob"), Seq("[email protected]"))
)
// Create RDD
val rdd = spark.sparkContext.parallelize(list)
// Create DataFrame
val df = rdd.map(row => (row(0), row(1))).toDF("person", "email")
df.show()
In this example,
- We’re creating an RDD from the List of Lists, and calling the
map
method to extract the first and second Sequences from each inner List. - We’re also using tuple notation to create a pair of values for each row, where the first value is the person’s ID and name, and the second value is their email address.
- Finally, we’re calling the
toDF
method on the RDD to create a DataFrame, and pass column names as arguments. The resulting DataFrame will have two columns: “person” and “email”.
The dataframe created from the above code looks like this:
// Output:
+-----------+--------------+
| person | email |
+-----------+--------------+
|[1, John] |[[email protected]] |
|[2, Jane] |[[email protected]]|
|[3, Bob] |[[email protected]] |
+-----------+--------------|
Note that the toDF()
method has automatically named the columns based on the names provided as arguments. If you don’t provide any column names, Spark will name the columns as _1
, _2
, and so on.
2.2. Using toDF()
method on a Seq of Seqs:
To Create a DataFrame from Scala’s List of Iterables, You can call the toDF()
method on the List of Iterables directly after converting it to a Seq of Seqs.
// Create List
val list = List(
List(Seq(1, "John"), Seq("[email protected]")),
List(Seq(2, "Jane"), Seq("[email protected]")),
List(Seq(3, "Bob"), Seq("[email protected]"))
)
// Convert
val seq = list.map(row => row.map(seq => seq.toList).toList)
// Create DataFrame
val df = seq.toDF("person", "email")
df.show()
In this example,
- We Defined the List of List of Iterables that represents the data for the DataFrame.
- Convert the List List of Iterables to a Seq of Seqs using the
map
function andtoList
method. - Call the
toDF
method on the resulting Seq of Seqs and pass the column names as arguments to create the DataFrame.
The DataFrame created from the above code looks like this:
// Output:
+-----------+--------------+
| person | email |
+-----------+--------------+
|[1, John] |[[email protected]] |
|[2, Jane] |[[email protected]]|
|[3, Bob] |[[email protected]] |
+-----------+--------------|
2.3. Using case classes and createDataFrame() method
You can define a case class that represents the schema of the DataFrame, and then call the createDataFrame()
method on the List of Iterables after converting it to a Seq of case class instances.
// Case class for Person and Email attributes
case class Person(id: Int, name: String)
case class Email(address: String)
// Creating List of Iterables
val list = List(
List(Person(1, "John"), Email("[email protected]")),
List(Person(2, "Jane"), Email("[email protected]")),
List(Person(3, "Bob"), Email("[email protected]"))
)
// Converting List of Iterables to a List of case class
val seq = list.map(row => row.map(caseClass => caseClass match {
case Person(id, name) => Seq(id, name)
case Email(address) => Seq(address)
}).toList)
// Calling createDataFrame method on the resulting List
val df = spark.createDataFrame(seq.map(row => (row(0), row(1))), Seq("person", "email"))
df.show()
In this example,
- we define a case class
Person
andEmail
that represents the structure of the DataFrame. - We then define a List of List of Iterables called
data
that represents the data for the DataFrame. - We use the
map
function to convert each List of Iterables to a case class instance, where we cast each element to its appropriate type usingasInstanceOf
. - Finally, we call the
createDataFrame
method on the resulting List of case class instances to create the DataFrame.
The dataframe created from the above code looks like this:
// Output:
+-----------+--------------+
| person | email |
+-----------+--------------+
|[1, John] |[[email protected]] |
|[2, Jane] |[[email protected]]|
|[3, Bob] |[[email protected]] |
+-----------+--------------|
3. Conclusion
In conclusion, there are several ways to create a DataFrame from Scala’s List of Iterables in Spark:
- Using the
toDF()
method on a Seq of Seqs: Convert the List of Iterables to a Seq of Seqs and call thetoDF()
method on it, passing the column names as arguments. - Using the
toDF()
method on an RDD: Convert the List of Iterables to an RDD and call thetoDF()
method on it, passing the column names as arguments. - Using case classes and the
createDataFrame()
method: Define a case class that represents the structure of the DataFrame, convert the List of Iterables to a List of case class instances, and call thecreateDataFrame()
method on it.
Each approach has its advantages and disadvantages, and the choice of which one to use depends on the specific use case and the preference of the developer.
Related Articles
- Spark Create DataFrame with Examples
- Spark RDD vs DataFrame vs Dataset
- Filter Spark DataFrame using Values from a List
- Spark – Get Size/Length of Array & Map Column
- Spark Transpose Rows to Columns of DataFrame?
- Spark – How to create an empty Dataset?
- Spark – How to create an empty DataFrame?
- Spark Query Table using JDBC
- Create Java DataFrame in Spark
- Change Column Position in Spark DataFrame
- Why Spark RDDs are immutable?
- Filter Spark DataFrame Based on Date