There are several ways to create an RDD in PySpark, In this article, I will cover the be
- parallelizing an existing collection and
- referencing a dataset in an external storage system (
S3and many more).
Before we look into examples, first let’s create a PySpark SparkSession using the builder pattern method defined in SparkSession class. While initializing, we need to provide the master and application name as shown below. In realtime application, you will pass master from spark-submit instead of hardcoding on Spark application.
# Create SparkSession from pyspark.sql import SparkSession spark:SparkSession = SparkSession.builder() .master("local") .appName("SparkByExamples.com") .getOrCreate()
1. Create RDD using sparkContext.parallelize()
parallelize() function of SparkContext (sparkContext.parallelize() ) you can create an RDD. This function loads the existing collection from your driver program into parallelizing RDD. This is a basic method to create RDD and is used when you already have data in memory that is either loaded from a file or from a database. and it required all data to be present on the driver program prior to creating RDD.
#Create RDD from parallelize data = [1,2,3,4,5,6,7,8,9,10,11,12] rdd=spark.sparkContext.parallelize(data)
For production applications, we mostly create RDD by using external storage systems like
HBase e.t.c. To make it simple I am using files from the local system or loading them from the python list to create RDD.
2. Create RDD using sparkContext.textFile()
Using textFile() method we can read a text (.txt) file into RDD.
#Create RDD from external Data source rdd2 = spark.sparkContext.textFile("/path/textFile.txt")
3. Create RDD using sparkContext.wholeTextFiles()
#Reads entire file into a RDD as single record. rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")
Besides using text files, we can also create RDD from CSV file, JSON, and other formats.
4. Create empty RDD using sparkContext.emptyRDD
emptyRDD() method on sparkContext, we can create an RDD with no data. This method creates an empty PySpark RDD with no partition.
# Creates empty RDD with no partition rdd = spark.sparkContext.emptyRDD # rddString = spark.sparkContext.emptyRDD[String]
5. Creating empty RDD with partition
Some times we may need to write an empty RDD to files by partition, In this case, you should create an empty RDD with partition.
#Create empty RDD with partition rdd2 = spark.sparkContext.parallelize(,10) #This creates 10 partitions
In this article, you have learned how to create an PySpark RDD in different ways and create an empty RDD and create an empty RDD with partitions.