There are several ways to create an RDD in PySpark, In this article, I will cover the be
- parallelizing an existing collection and
- referencing a dataset in an external storage system (
HDFS
,S3
and many more).
Before we look into examples, first let’s create a PySpark SparkSession using the builder pattern method defined in SparkSession class. While initializing, we need to provide the master and application name as shown below. In realtime application, you will pass master from spark-submit instead of hardcoding on Spark application.
# Create SparkSession
from pyspark.sql import SparkSession
spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
Note: Creating SparkSession object, internally creates one SparkContext per JVM.
1. Create RDD using sparkContext.parallelize()
By using parallelize()
function of SparkContext (sparkContext.parallelize() ) you can create an RDD. This function loads the existing collection from your driver program into parallelizing RDD. This is a basic method to create RDD and is used when you already have data in memory that is either loaded from a file or from a database. and it required all data to be present on the driver program prior to creating RDD.
#Create RDD from parallelize
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd=spark.sparkContext.parallelize(data)
For production applications, we mostly create RDD by using external storage systems like HDFS
, S3
, HBase
e.t.c. To make it simple I am using files from the local system or loading them from the python list to create RDD.
2. Create RDD using sparkContext.textFile()
Using textFile() method we can read a text (.txt) file into RDD.
#Create RDD from external Data source
rdd2 = spark.sparkContext.textFile("/path/textFile.txt")
3. Create RDD using sparkContext.wholeTextFiles()
wholeTextFiles() function returns a PairRDD with the key being the file path and value being file content.
#Reads entire file into a RDD as single record.
rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")
Besides using text files, we can also create RDD from CSV file, JSON, and other formats.
4. Create empty RDD using sparkContext.emptyRDD
Using emptyRDD()
method on sparkContext, we can create an RDD with no data. This method creates an empty PySpark RDD with no partition.
# Creates empty RDD with no partition
rdd = spark.sparkContext.emptyRDD
# rddString = spark.sparkContext.emptyRDD[String]
5. Creating empty RDD with partition
Some times we may need to write an empty RDD to files by partition, In this case, you should create an empty RDD with partition.
#Create empty RDD with partition
rdd2 = spark.sparkContext.parallelize([],10) #This creates 10 partitions
6. Conclusion
In this article, you have learned how to create an PySpark RDD in different ways and create an empty RDD and create an empty RDD with partitions.
Related Articles
- PySpark RDD Actions with examples
- PySpark RDD Transformations with examples
- Convert PySpark RDD to DataFrame
- PySpark Convert DataFrame to RDD
- Print the contents of RDD in Spark & PySpark
- PySpark Row using on DataFrame and RDD
- PySpark – Create an Empty DataFrame & RDD
- PySpark parallelize() – Create RDD from a list data