Create a Spark RDD using Parallelize

Let’s see how to create Spark RDD using parallelize with sparkContext.parallelize() method and using Spark shell and Scala example.

Before we start let me explain what is RDD, Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark, It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

  • Spark Parallelizing an existing collection in your driver program

Below is an example of how to create an RDD using a parallelize method from Sparkcontext. sparkContext.parallelize(Array(1,2,3,4,5,6,7,8,9,10)) creates an RDD with an Array of Integers.

Using sc.parallelize on Spark Shell or REPL

Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD.

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at

Using Spark sparkContext.parallelize in Scala

If you are using scala, get SparkContext object from SparkSession and use sparkContext.parallelize() to create rdd, this function also has another signature which additionally takes integer argument to specifies the number of partitions. Partitions are basic units of parallelism in Apache Spark. RDDs in Apache Spark are a collection of partitions.

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

object RDDParallelize {

  def main(args: Array[String]): Unit = {
      val spark:SparkSession = SparkSession.builder().master("local[1]")
      val rdd:RDD[Int] = spark.sparkContext.parallelize(List(1,2,3,4,5))
      val rddCollect:Array[Int] = rdd.collect()
      println("Number of Partitions: "+rdd.getNumPartitions)
      println("Action: First element: "+rdd.first())
      println("Action: RDD converted to Array[Int] : ")

By executing the above program you should see below output.

Number of Partitions: 1
Action: First element: 1
Action: RDD converted to Array[Int] : 

create empty RDD by using sparkContext.parallelize


The complete code can be downloaded from GitHub – Spark Scala Examples project

NNK is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 13 Comments

  1. mohammed

    i really appreciate what you have done ..i cleared my interview because of you only ..again thank you

    1. NNK

      Congratulation Mohammed. I am glad this site helped you to clear the interview.

  2. Anonymous

    the information is very helpful to beginners
    good work

  3. Tejas

    Hi . Just an appreciation post. you are doing great , this blog of Spark by examples is good for learning and look at the coding part & examples in Spark-Scala , which i did not find anywhere else. Keep it up.

  4. Ashwathy

    I tried implementing this piece of code in IntelliJ mavin.
    The number of partitions created there was = 1 .
    How is the number of partitions decided in spark?

  5. Anonymous

    Thank you

  6. Siraj

    Thanks for the detailed analysis of Spark RDD.
    I had one request to make. Can you please do a post on how to implement SCD#2.0 in Spark-scala.!!

  7. rah

    one humble request, can you please give example with Java also .

  8. Anonymous

    how to stop this moving text
    i am uanble to read it

    1. NNK

      Hi Apologies for inconvenience. scrolling issue has been fixed.

  9. Akash

    when we use sparkSession or sqlContext or hiveContext,it will be converted into RDD which follows SparkContext means every time when you will use any spark functionality,you have to create SparkContext (mandatory).

  10. rachana

    just a small qn, in spark 2.x we are using spark.session insterd of spark/sql context but while defining the rdd why we are using sc insterd of spark session?

    1. NNK

      Hi Rachana, Not every method of SparkContext is defined in SparkSession. for example, all operations on RDD are still present in SparkContext hence you have to use sc to create RDD. hope this helps.

You are currently viewing Create a Spark RDD using Parallelize