Spark – How to create an empty RDD?

We often need to create empty RDD in Spark, and empty RDD can be created in several ways, for example, with partition, without partition, and with pair RDD. In this article, we will see these with Scala, Java and Pyspark examples. Spark sc.emptyRDD - Creates empty RDD with no partitionCreate…

Continue Reading Spark – How to create an empty RDD?

Convert Spark RDD to DataFrame | Dataset

While working in Apache Spark with Scala, we often need to Convert Spark RDD to DataFrame and Dataset as these provide more advantages over RDD. For instance, DataFrame is a distributed collection of data organized into named columns similar to Database tables and provides optimization and performance improvement. In this…

Continue Reading Convert Spark RDD to DataFrame | Dataset

Different ways to create Spark RDD

Spark RDD can be created in several ways using Scala & Pyspark languages, for example, It can be created by using sparkContext.parallelize(), from text file, from another RDD, DataFrame, and Dataset. Though we have covered most of the examples in Scala here, the same concept can be used to create…

Continue Reading Different ways to create Spark RDD

Create a Spark RDD using Parallelize

Let's see how to create Spark RDD using parallelize with sparkContext.parallelize() method and using Spark shell and Scala example. Before we start let me explain what is RDD, Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark, It is an immutable distributed collection of objects. Each dataset in RDD is…

Continue Reading Create a Spark RDD using Parallelize