Let’s see how to create Spark RDD using sparkContext.parallelize()
method and using Spark shell and Scala example.
Before we start let me explain what is RDD, Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark, It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
Note: Spark Parallelizes an existing collection in your driver program.
Below is an example of how to create an RDD using a parallelize method from Sparkcontext. For example, sparkContext.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
creates an RDD with an Array of Integers.
Using sc.parallelize on Spark Shell or REPL
Spark shell provides SparkContext variable “sc”, use sc.parallelize()
to create an RDD.
scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at
:24
Using Spark sparkContext.parallelize in Scala
If you are using scala, get SparkContext object from SparkSession and use sparkContext.parallelize()
to create rdd. This function also has another signature which additionally takes an integer argument to specify the number of partitions. Partitions are basic units of parallelism in Apache Spark. RDDs in Apache Spark are a collection of partitions that are executed by processors to achieve parallelism. The parallelize method takes a collection of objects, such as a list, tuple, or set, and creates an RDD from them. The RDD is then distributed across the nodes in the Spark cluster for parallel processing.
// Imports
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDParallelize {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder().master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val rdd:RDD[Int] = spark.sparkContext.parallelize(List(1,2,3,4,5))
val rddCollect:Array[Int] = rdd.collect()
println("Number of Partitions: "+rdd.getNumPartitions)
println("Action: First element: "+rdd.first())
println("Action: RDD converted to Array[Int] : ")
rddCollect.foreach(println)
}
}
By executing the above program you should see the below output.
Number of Partitions: 1
Action: First element: 1
Action: RDD converted to Array[Int] :
1
2
3
4
5
Create empty RDD by using sparkContext.parallelize
We can create emptyRdd in two ways , one is using sparkContext.parallelize() and the other is using sparkContext.emptyRDD() method. The difference is being creating between these two ways is the former one creates no part files on disk , whereas the latter one creates the part files on Disk.
Using sparkContext.parallelize
// Create emptyRdd using parallelize
sparkContext.parallelize(Seq.empty[String])
Using sparkContext.emptyRDD
// Create emptyRdd using emptyRDD
val emptyRDD = sparkContext.emptyRDD[String]
emptyRDD.saveAsTextFile("src/main/output2/EmptyRDD2")
When above code with emptyRDD is executed it creates multiple part files which are empty.
The complete code can be downloaded from GitHub – Spark Scala Examples project
Related Articles
- Spark RDD fold() function example
- Spark RDD reduce() function example
- Spark RDD Actions with examples
- Different ways to create Spark RDD
- Convert Spark RDD to DataFrame | Dataset
- Spark RDD Transformations with examples
- Spark sortByKey() with RDD Example
- Spark – How to create an empty RDD?
- How to create an RDD using parallelize
hello,
just a small qn, in spark 2.x we are using spark.session insterd of spark/sql context but while defining the rdd why we are using sc insterd of spark session?
Hi Rachana, Not every method of SparkContext is defined in SparkSession. for example, all operations on RDD are still present in SparkContext hence you have to use sc to create RDD. hope this helps.
hello,
just a small qn, in spark 2.x we are using spark.session insterd of spark/sql context but while defining the rdd why we are using sc insterd of spark session?
Hi Rachana, Not every method of SparkContext is defined in SparkSession. for example, all operations on RDD are still present in SparkContext hence you have to use sc to create RDD. hope this helps.
when we use sparkSession or sqlContext or hiveContext,it will be converted into RDD which follows SparkContext means every time when you will use any spark functionality,you have to create SparkContext (mandatory).
when we use sparkSession or sqlContext or hiveContext,it will be converted into RDD which follows SparkContext means every time when you will use any spark functionality,you have to create SparkContext (mandatory).
how to stop this moving text
i am uanble to read it
Hi Apologies for inconvenience. scrolling issue has been fixed.
how to stop this moving text
i am uanble to read it
Hi Apologies for inconvenience. scrolling issue has been fixed.
one humble request, can you please give example with Java also .
one humble request, can you please give example with Java also .
Hi,
Thanks for the detailed analysis of Spark RDD.
I had one request to make. Can you please do a post on how to implement SCD#2.0 in Spark-scala.!!
Thanks.
Hi,
Thanks for the detailed analysis of Spark RDD.
I had one request to make. Can you please do a post on how to implement SCD#2.0 in Spark-scala.!!
Thanks.
Thank you
Thank you
I tried implementing this piece of code in IntelliJ mavin.
The number of partitions created there was = 1 .
How is the number of partitions decided in spark?
I tried implementing this piece of code in IntelliJ mavin.
The number of partitions created there was = 1 .
How is the number of partitions decided in spark?
Hi . Just an appreciation post. you are doing great , this blog of Spark by examples is good for learning and look at the coding part & examples in Spark-Scala , which i did not find anywhere else. Keep it up.
Hi . Just an appreciation post. you are doing great , this blog of Spark by examples is good for learning and look at the coding part & examples in Spark-Scala , which i did not find anywhere else. Keep it up.
the information is very helpful to beginners
good work
i really appreciate what you have done ..i cleared my interview because of you only ..again thank you
Congratulation Mohammed. I am glad this site helped you to clear the interview.
i really appreciate what you have done ..i cleared my interview because of you only ..again thank you
Congratulation Mohammed. I am glad this site helped you to clear the interview.
My Heart Felt Thanks to the creator , Author of this site. By this site, i got an opportunity to get a job
from an MNC
Thank you Mohan for the comment. I am glad it helped you to get the job.