PySpark parallelize() - Create RDD from a list data

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:PySpark
Post last modified:March 27, 2024
Reading time:5 mins read

You are currently viewing PySpark parallelize() – Create RDD from a list data

PySpark parallelize() is a function in SparkContext and is used to create an RDD from a list collection. In this article, I will explain the usage of parallelize to create RDD and how to create an empty RDD with PySpark example.

Using sc.parallelize on PySpark Shell or REPL

PySpark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD.


rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,10])

Using PySpark sparkContext.parallelize() in application

Since PySpark 2.0, First, you need to create a SparkSession which internally creates a SparkContext for you.


import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
sparkContext=spark.sparkContext

Now, use sparkContext.parallelize() to create rdd from a list or collection.


rdd=sparkContext.parallelize([1,2,3,4,5])
rddCollect = rdd.collect()
print("Number of Partitions: "+str(rdd.getNumPartitions()))
print("Action: First element: "+str(rdd.first()))
print(rddCollect)

By executing the above program you should see below output.


Number of Partitions: 4
Action: First element: 1
[1, 2, 3, 4, 5]

parallelize() function also has another signature which additionally takes integer argument to specifies the number of partitions. Partitions are basic units of parallelism in PySpark.

Remember, RDDs in PySpark are a collection of partitions.

create empty RDD by using sparkContext.parallelize

Some times we may need to create empty RDD and you can also use parallelize() in order to create it.


emptyRDD = sparkContext.emptyRDD()
emptyRDD2 = rdd=sparkContext.parallelize([])

print("is Empty RDD : "+str(emptyRDD2.isEmpty()))

The complete code can be downloaded from GitHub – PySpark Examples project

Tags: emptyRDD, parallelize, RDD

This Post Has 2 Comments

NNK March 1, 2021

Thanks Nic. I have fixed it now. Hope articles are good enough for learning. Please suggest for any improvements.
nlc February 27, 2021

`sparkContext.parallelize([1,2,3,4,5])`
shoud be
`rdd = sparkContext.parallelize([1,2,3,4,5])`

Comments are closed.

Using sc.parallelize on PySpark Shell or REPL

Using PySpark sparkContext.parallelize() in application

create empty RDD by using sparkContext.parallelize

Related Articles

This Post Has 2 Comments