Spark RDD aggregateByKey()
In Spark/Pyspark aggregateByKey() is one of the fundamental transformations of RDD. The most common problem while working with key-value pairs is grouping values and aggregating them considering a standard key.…
In Spark/Pyspark aggregateByKey() is one of the fundamental transformations of RDD. The most common problem while working with key-value pairs is grouping values and aggregating them considering a standard key.…
Spark/Pyspark RDD join supports all basic Join Types like INNER, LEFT, RIGHT and OUTER JOIN. Spark RRD Joins are wider transformations that result in data shuffling over the network hence they have huge performance issues when…
The Spark or PySpark groupByKey() is the most frequently used wide transformation operation that involves shuffling of data across the executors when data is not partitioned on the Key. It…
Spark sortByKey() transformation is an RDD operation that is used to sort the values of the key by ascending or descending order. sortByKey() function operates on pair RDD (key/value pair)…
In Spark foreachPartition() is used when you have a heavy initialization (like database connection) and wanted to initialize once per partition where as foreach() is used to apply a function…
In Spark, foreach() is an action operation that is available in RDD, DataFrame, and Dataset to iterate/loop over each element in the dataset, It is similar to for with advance…
Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function. It is a wider transformation as it shuffles data across multiple partitions…
Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame, and Dataset and finally returns a new RDD/Dataset respectively. In this…
Spark flatMap() transformation flattens the RDD/DataFrame column after applying the function on every element and returns a new RDD/DataFrame respectively. The returned RDD/DataFrame can have the same count or more…
In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks.…