PySpark Create RDD with Examples

There are several ways to create an RDD in PySpark, In this article, I will cover the be

Before we look into examples, first let’s create a PySpark SparkSession using the builder pattern method defined in SparkSession class. While initializing, we need to provide the master and application name as shown below. In realtime application, you will pass master from spark-submit instead of hardcoding on Spark application.


# Create SparkSession
from pyspark.sql import SparkSession
spark:SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate()    

Note: Creating SparkSession object, internally creates one SparkContext per JVM.

1. Create RDD using sparkContext.parallelize()

By using parallelize() function of SparkContext (sparkContext.parallelize() ) you can create an RDD. This function loads the existing collection from your driver program into parallelizing RDD. This is a basic method to create RDD and is used when you already have data in memory that is either loaded from a file or from a database. and it required all data to be present on the driver program prior to creating RDD.


#Create RDD from parallelize    
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd=spark.sparkContext.parallelize(data)

For production applications, we mostly create RDD by using external storage systems like HDFS, S3, HBase e.t.c. To make it simple I am using files from the local system or loading them from the python list to create RDD.

2. Create RDD using sparkContext.textFile()

Using textFile() method we can read a text (.txt) file into RDD.


#Create RDD from external Data source
rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

3. Create RDD using sparkContext.wholeTextFiles()

wholeTextFiles() function returns a PairRDD with the key being the file path and value being file content.


#Reads entire file into a RDD as single record.
rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")

Besides using text files, we can also create RDD from CSV file, JSON, and other formats.

4. Create empty RDD using sparkContext.emptyRDD

Using emptyRDD() method on sparkContext, we can create an RDD with no data. This method creates an empty PySpark RDD with no partition.


# Creates empty RDD with no partition    
rdd = spark.sparkContext.emptyRDD 
# rddString = spark.sparkContext.emptyRDD[String]

5. Creating empty RDD with partition

Some times we may need to write an empty RDD to files by partition, In this case, you should create an empty RDD with partition.


#Create empty RDD with partition
rdd2 = spark.sparkContext.parallelize([],10) #This creates 10 partitions

6. Conclusion

In this article, you have learned how to create an PySpark RDD in different ways and create an empty RDD and create an empty RDD with partitions.

pyspark create rdd

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing PySpark Create RDD with Examples