Convert PySpark RDD to DataFrame

In PySpark, toDF() function of the RDD is used to convert RDD to DataFrame. We would need to convert RDD to DataFrame as DataFrame provides more advantages over RDD. For instance, DataFrame is a distributed collection of data organized into named columns similar to Database tables and provides optimization and…

Continue Reading Convert PySpark RDD to DataFrame

PySpark parallelize() – Create RDD from a list data

PySpark parallelize() is a function in SparkContext and is used to create an RDD from a list collection. In this article, I will explain the usage of parallelize to create RDD and how to create an empty RDD with PySpark example. Before we start let me explain what is RDD,…

Continue Reading PySpark parallelize() – Create RDD from a list data

Convert Spark RDD to DataFrame | Dataset

While working in Apache Spark with Scala, we often need to Convert Spark RDD to DataFrame and Dataset as these provide more advantages over RDD. For instance, DataFrame is a distributed collection of data organized into named columns similar to Database tables and provides optimization and performance improvement. In this…

Continue Reading Convert Spark RDD to DataFrame | Dataset

Different ways to create Spark RDD

Spark RDD can be created in several ways using Scala & Pyspark languages, for example, It can be created by using sparkContext.parallelize(), from text file, from another RDD, DataFrame, and Dataset. Though we have covered most of the examples in Scala here, the same concept can be used to create…

Continue Reading Different ways to create Spark RDD

Create a Spark RDD using Parallelize

Let's see how to create Spark RDD using parallelize with sparkContext.parallelize() method and using Spark shell and Scala example. Before we start let me explain what is RDD, Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark, It is an immutable distributed collection of objects. Each dataset in RDD is…

Continue Reading Create a Spark RDD using Parallelize