A Spark DataFrame can be created from various sources for example from Scala’s list of iterable objects. Creating DataFrame from a Scala list of iterable in Apache Spark is a powerful way to test Spark features in your development environment before working with large datasets and performing complex data transformations in a distributed environment.

Advertisements

1. Spark scala List of Iterables

In Scala, you can create a List of Iterables by using the List constructor and passing in one or more Iterables as arguments. Here’s an example:


// Create List of iterable
val list: List[Iterable[Int]] = List(Seq(1, 2, 3), List(4, 5, 6), Vector(7, 8, 9))

In this example, we’re creating a List of Iterables that contains three elements: a Seq of integers (1, 2, 3), a List of integers (4, 5, 6), and a Vector of integers (7, 8, 9). Note that each element of the List is of type Iterable[Int].

Once you have a List of Iterables, you can access each Iterable in the list using the apply method (which can also be written using the shortcut notation ()), like this:


// Access elements
val firstIterable: Iterable[Int] = list(0)
val secondIterable: Iterable[Int] = list(1)
val thirdIterable: Iterable[Int] = list(2)

The output of the above extract looks as below:


// Output:
firstIterable: Iterable[Int] = List(1, 2, 3)
secondIterable: Iterable[Int] = List(4, 5, 6)
thirdIterable: Iterable[Int] = Vector(7, 8, 9)

You can then operate on each Iterable as you normally would, using methods like foreach, map, filter, etc.

2. Create DataFrame from Scala List of Iterables

Creating a DataFrame from Scala’s List of Iterables in Apache Spark is a powerful way to work during development time to test the Spark features with a small dataset.

2.1. Using toDF() method on an RDD:

You can convert the list of iterables to a Spark RDD and then call the map() function to convert each List of Iterables to a tuple, then call the toDF() method on the resulting RDD to create the DataFrame.


// Import
import org.apache.spark.sql.SparkSession

// Create SparkSession
val spark = SparkSession.builder().appName("Create DataFrame from List of List of Iterables").master("local").getOrCreate()

// Creatre list
val list = List(
  List(Seq(1, "John"), Seq("[email protected]")),
  List(Seq(2, "Jane"), Seq("[email protected]")),
  List(Seq(3, "Bob"), Seq("[email protected]"))
)

// Create RDD
val rdd = spark.sparkContext.parallelize(list)

// Create DataFrame
val df = rdd.map(row => (row(0), row(1))).toDF("person", "email")

df.show()

In this example,

  • We’re creating an RDD from the List of Lists, and calling the map method to extract the first and second Sequences from each inner List.
  • We’re also using tuple notation to create a pair of values for each row, where the first value is the person’s ID and name, and the second value is their email address.
  • Finally, we’re calling the toDF method on the RDD to create a DataFrame, and pass column names as arguments. The resulting DataFrame will have two columns: “person” and “email”.

The dataframe created from the above code looks like this:


// Output:
+-----------+--------------+
|  person   |      email   |
+-----------+--------------+
|[1, John]  |[[email protected]]   |
|[2, Jane]  |[[email protected]]|
|[3, Bob]   |[[email protected]] |
+-----------+--------------|

Note that the toDF() method has automatically named the columns based on the names provided as arguments. If you don’t provide any column names, Spark will name the columns as _1, _2, and so on.

2.2. Using toDF() method on a Seq of Seqs:

To Create a DataFrame from Scala’s List of Iterables, You can call the toDF() method on the List of Iterables directly after converting it to a Seq of Seqs.


// Create List
val list = List(
  List(Seq(1, "John"), Seq("[email protected]")),
  List(Seq(2, "Jane"), Seq("[email protected]")),
  List(Seq(3, "Bob"), Seq("[email protected]"))
)

// Convert
val seq = list.map(row => row.map(seq => seq.toList).toList)

// Create DataFrame
val df = seq.toDF("person", "email")

df.show()

In this example,

  1. We Defined the List of List of Iterables that represents the data for the DataFrame.
  2. Convert the List List of Iterables to a Seq of Seqs using the map function and toList method.
  3. Call the toDF method on the resulting Seq of Seqs and pass the column names as arguments to create the DataFrame.

The DataFrame created from the above code looks like this:


// Output:
+-----------+--------------+
|  person   |      email   |
+-----------+--------------+
|[1, John]  |[[email protected]]   |
|[2, Jane]  |[[email protected]]|
|[3, Bob]   |[[email protected]] |
+-----------+--------------|

2.3. Using case classes and createDataFrame() method

You can define a case class that represents the schema of the DataFrame, and then call the createDataFrame() method on the List of Iterables after converting it to a Seq of case class instances.


// Case class for Person and Email attributes
case class Person(id: Int, name: String)
case class Email(address: String)

// Creating List of Iterables 
val list = List(
  List(Person(1, "John"), Email("[email protected]")),
  List(Person(2, "Jane"), Email("[email protected]")),
  List(Person(3, "Bob"), Email("[email protected]"))
)

// Converting List of Iterables to a List of case class
val seq = list.map(row => row.map(caseClass => caseClass match {
  case Person(id, name) => Seq(id, name)
  case Email(address) => Seq(address)
}).toList)

// Calling createDataFrame method on the resulting List 
val df = spark.createDataFrame(seq.map(row => (row(0), row(1))), Seq("person", "email"))

df.show()

In this example,

  1. we define a case class Person and Email that represents the structure of the DataFrame.
  2. We then define a List of List of Iterables called data that represents the data for the DataFrame.
  3. We use the map function to convert each List of Iterables to a case class instance, where we cast each element to its appropriate type using asInstanceOf.
  4. Finally, we call the createDataFrame method on the resulting List of case class instances to create the DataFrame.

The dataframe created from the above code looks like this:


// Output:
+-----------+--------------+
|  person   |      email   |
+-----------+--------------+
|[1, John]  |[[email protected]]   |
|[2, Jane]  |[[email protected]]|
|[3, Bob]   |[[email protected]] |
+-----------+--------------|

3. Conclusion

In conclusion, there are several ways to create a DataFrame from Scala’s List of Iterables in Spark:

  1. Using the toDF() method on a Seq of Seqs: Convert the List of Iterables to a Seq of Seqs and call the toDF() method on it, passing the column names as arguments.
  2. Using the toDF() method on an RDD: Convert the List of Iterables to an RDD and call the toDF() method on it, passing the column names as arguments.
  3. Using case classes and the createDataFrame() method: Define a case class that represents the structure of the DataFrame, convert the List of Iterables to a List of case class instances, and call the createDataFrame() method on it.

Each approach has its advantages and disadvantages, and the choice of which one to use depends on the specific use case and the preference of the developer.

Related Articles

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.