PySpark parallelize() – Create RDD from a list data

  • Post author:
  • Post category:PySpark
  • Post last modified:November 29, 2023
  • Reading time:5 mins read

PySpark parallelize() is a function in SparkContext and is used to create an RDD from a list collection. In this article, I will explain the usage of parallelize to create RDD and how to create an empty RDD with PySpark example.

Before we start let me explain what is RDD, Resilient Distributed Datasets (RDD) is a fundamental data structure of PySpark, It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

  • PySpark Parallelizing an existing collection in your driver program.

Below is an example of how to create an RDD using a parallelize method from Sparkcontext. sparkContext.parallelize([1,2,3,4,5,6,7,8,9,10]) creates an RDD with a list of Integers.

Using sc.parallelize on PySpark Shell or REPL

PySpark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD.


rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,10])

Using PySpark sparkContext.parallelize() in application

Since PySpark 2.0, First, you need to create a SparkSession which internally creates a SparkContext for you.


import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
sparkContext=spark.sparkContext

Now, use sparkContext.parallelize() to create rdd from a list or collection.


rdd=sparkContext.parallelize([1,2,3,4,5])
rddCollect = rdd.collect()
print("Number of Partitions: "+str(rdd.getNumPartitions()))
print("Action: First element: "+str(rdd.first()))
print(rddCollect)

By executing the above program you should see below output.


Number of Partitions: 4
Action: First element: 1
[1, 2, 3, 4, 5]

parallelize() function also has another signature which additionally takes integer argument to specifies the number of partitions. Partitions are basic units of parallelism in PySpark.

Remember, RDDs in PySpark are a collection of partitions.

create empty RDD by using sparkContext.parallelize

Some times we may need to create empty RDD and you can also use parallelize() in order to create it.


emptyRDD = sparkContext.emptyRDD()
emptyRDD2 = rdd=sparkContext.parallelize([])

print("is Empty RDD : "+str(emptyRDD2.isEmpty()))

The complete code can be downloaded from GitHub – PySpark Examples project

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has 2 Comments

  1. nlc

    `sparkContext.parallelize([1,2,3,4,5])`
    shoud be
    `rdd = sparkContext.parallelize([1,2,3,4,5])`

    1. NNK

      Thanks Nic. I have fixed it now. Hope articles are good enough for learning. Please suggest for any improvements.