Hive Aggregate Functions (UDAF) with Examples

Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single value, when used with a group, it aggregates all values in each group and returns one value for each group. Like in SQL, Aggregate Functions in Hive can be used with…

Continue Reading Hive Aggregate Functions (UDAF) with Examples

PySpark orderBy() and sort() explained

You can use either sort() or orderBy() function of PySpark DataFrame to sort DataFrame by ascending or descending order based on single or multiple columns, you can also do sorting using PySpark SQL sorting functions, In this article, I will explain all these different ways using PySpark examples. Using sort()…

Continue Reading PySpark orderBy() and sort() explained

Start H2O Cluster on Hadoop (External Backend)

In external backend mode, H2O cluster runs externally outside of the Spark application, this provides more stability cluster as it doesn't go down when Spark executors being kill and provide high availability of H2O cluster. You can set an external backend by using the below configuration. 1. Downloading External Jar…

Continue Reading Start H2O Cluster on Hadoop (External Backend)

Install & Running Sparkling Water on Ubuntu

In this tutorial, you will learn how to install H2O Sparkling Water on Linux Ubuntu and running H2O sparkling-shell and Flow web interface. In order to run Sparkling Water, you need to have an Apache Spark installed. Sparkling Water enables users to run H2O machine learning algorithms on the Spark…

Continue Reading Install & Running Sparkling Water on Ubuntu

Spark RDD Actions with examples

RDD actions are operations that return the raw values, In other words, any RDD function that returns other than RDD[T] is considered as an action in spark programming. In this tutorial, we will learn RDD actions with Scala examples. As mentioned in RDD Transformations, all transformations are lazy meaning they do…

Continue Reading Spark RDD Actions with examples

Spark Pair RDD Functions

Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair, In this tutorial, we will learn these functions with Scala examples. Pair RDD's are come in handy when you need to apply transformations like hash partition, set operations, joins e.t.c. All these functions are…

Continue Reading Spark Pair RDD Functions

Spark SQL StructType & StructField with examples

Spark SQL StructType & StructField classes are used to programmatically specify the schema to the DataFrame and creating complex columns like nested struct, array and map columns. StructType is a collection of StructField's. Using StructField we can define column name, column data type, nullable column (boolean to specify if the…

Continue Reading Spark SQL StructType & StructField with examples