Spark Persistence Storage Levels

All different persistence (persist() method) storage level Spark/PySpark supports are available at org.apache.spark.storage.StorageLevel and pyspark.StorageLevel classes respectively. The storage level specifies how and where to persist or cache a Spark/PySpark RDD, DataFrame and Dataset. All these Storage levels are passed as an argument to the persist() method of the Spark/Pyspark RDD, DataFrame…

Continue Reading Spark Persistence Storage Levels

Spark RDD fold() function example

In this tutorial, you will learn fold syntax, usage and how to use Spark RDD fold() function in order to calculate min, max, and a total of the elements with Scala example and the same approach could be used for Java and PySpark (python). Syntax def fold(zeroValue: T)(op: (T, T)…

Continue Reading Spark RDD fold() function example

Spark RDD reduce() function example

Spark RDD reduce() aggregate action function is used to calculate min, max, and total of elements in a dataset, In this tutorial, I will explain RDD reduce function syntax and usage with scala language and the same approach could be used with Java and PySpark (python) languages. Syntax def reduce(f:…

Continue Reading Spark RDD reduce() function example

Spark RDD aggregate() operation example

In this tutorial, you will learn how to aggregate elements using Spark RDD aggregate() action to calculate min, max, total, and count of RDD elements with scala language, and the same approach could be used for Java and PySpark (python). RDD aggregate() Syntax def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U,…

Continue Reading Spark RDD aggregate() operation example

Spark RDD Actions with examples

RDD actions are operations that return the raw values, In other words, any RDD function that returns other than RDD[T] is considered as an action in spark programming. In this tutorial, we will learn RDD actions with Scala examples. As mentioned in RDD Transformations, all transformations are lazy meaning they do…

Continue Reading Spark RDD Actions with examples

Spark Pair RDD Functions

Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair, In this tutorial, we will learn these functions with Scala examples. Pair RDD's are come in handy when you need to apply transformations like hash partition, set operations, joins e.t.c. All these functions are…

Continue Reading Spark Pair RDD Functions

Spark RDD Transformations with examples

RDD Transformations are Spark operations when executed on RDD, it results in a single or multiple new RDD's. Since RDD are immutable in nature, transformations always create new RDD without updating an existing one hence, this creates an RDD lineage. RDD Lineage is also known as the RDD operator graph or RDD…

Continue Reading Spark RDD Transformations with examples

Spark Load CSV File into RDD

In this tutorial, I will explain how to load a CSV file into Spark RDD using a Scala example. Using the textFile() the method in SparkContext class we can read CSV files, multiple CSV files (based on pattern matching), or all files from a directory into RDD [String] object. Before…

Continue Reading Spark Load CSV File into RDD

Spark – Read multiple text files into single RDD?

Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern. textFile() - Read single or…

Continue Reading Spark – Read multiple text files into single RDD?