Hive Aggregate Functions (UDAF) with Examples

Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single value, when used with a group, it aggregates all values in each group and returns one value for each group.

Hive Aggregate Functions List

Hive Aggregate Functions	Syntax & Description
COUNT()	Returns the count of all rows in a table including rows containing NULL values When you specify a column as an input, it ignores NULL values in the column for the count. Also ignores duplicates by using DISTINCT. Return: BIGINT
SUM()	Returns the sum of all values in a column. When used with a group it returns the sum for each group. Also ignores duplicates by using DISTINCT. Return: DOUBLE
AVG()	Returns the average of all values in a column. When used with a group it returns an average for each group. Return: DOUBLE
MIN()	Returns the minimum value of the column from all rows. When used with a group it returns a minimum for each group. Return: DOUBLE
MAX()	Returns the maximum value of the column from all rows. When used with a group it returns a maximum for each group. Return: DOUBLE
Variance(col), var_pop(col)	Returns the variance of a numeric column for all rows or for each group. Return: DOUBLE
Var_samp(col)	Returns the unbiased sample variance of a numeric column or for each group. Return: DOUBLE
Stddev_pop(col)	Returns the statistical standard deviation of all values in a column or for each group. Return: DOUBLE
Stddev_samp(col)	Returns the sample statistical standard deviation of all values in a column or for each group. Return: DOUBLE
Covar_pop(col1, col2)	Returns the population covariance of a pair of numeric columns for all rows or for each group. Return: DOUBLE
Covar_samp(col1, col2)	Returns the sample covariance of a pair of numeric for all rows or for each group. Return: DOUBLE
Corr(col1, col2)	Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group. Return: DOUBLE
Percentile(BIGINT col, p)	For each group, it returns the exact percentile of a column. p must be between 0 and 1. Return: DOUBLE
Percentile(BIGINT col, array(p₁ [, p₂]…))	Returns the exact percentiles p₁, p₂, … of a column in the group. p_i must be between 0 and 1. Return: array<double>
Regr_Avgx(independent, dependent)	Equivalent to avg(dependent). Return: DOUBLE
Regr_avgy(independent, dependent)	Equivalent to avg(independent). Return: DOUBLE
Regr_count(independent, dependent)	Returns the number of non-null pairs used to fit the linear regression line. Return: DOUBLE
Regr_intercept(independent, dependent)	Returns the y-intercept of the linear regression line. Return: DOUBLE
Regr_r2(independent, dependent)	Returns the coefficient of determination for the regression. As of Hive 2.2.0. Return: DOUBLE
Regr_slope(independent, dependent)	Returns the slope of the linear regression line Return: DOUBLE
Regr_sxx(independent, dependent)	Equivalent to regr_count(independent, dependent) * var_pop(dependent). Return: DOUBLE
Regr_sxy(independent, dependent)	Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent). Return:
Regr_syy(independent, dependent)	Equivalent to regr_count(independent, dependent) * var_pop(independent). Return: DOUBLE
Histogram_Numeric(col, b)	Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. Return: Array<struct {`'x','y'`}>
Collect_set(col)	Returns a collection of elements in a group as a set by eliminating duplicate elements. Return: Array
Collect_list(col)	Returns a collection of elements in a group as a list including duplicate elements. Return: Array
Ntile(INTEGER x)	This assigns the bucket number to each row in a partition after partition into x groups. Return: INTEGER

Hive Aggregate Functions Examples

Below are most used Hive Aggregate functions examples.

Hive Select Count and Count Distinct


Syntax: count(*)
Syntax: count(expr)
Syntax: count(DISTINCT expr[, expr...])

Return: BIGINT

count(*) – Returns the count of all rows in a table including rows containing NULL values.

count(expr) – Returns the total number of rows for expression excluding null.

count(DISTINCT expr[, expr]) – Returns the count of distinct rows of expression (or expressions) excluding null values.

You can optimize the performance of this using hive.optimize.distinct.rewrite.

Example


jdbc:hive2://> select count(*) from employee;
+------+
| _c0  |
+------+
| 7    |
+------+
jdbc:hive2://> select count(salary) from employee;
+------+
| _c0  |
+------+
| 5    |
+------+
select count(distinct gender, salary) from employee;
+------+
| _c0  |
+------+
| 4    |
+------+

Hive Sum of a Column and sum of Distinct Columns


Syntax: sum(col)
Syntax: sum(DISTINCT col)

Return: DOUBLE

Returns the total sum of the elements in the group or the sum of the distinct values of the column in the group.


jdbc:hive2://> select sum(salary) from employee;
+---------+
|   _c0   |
+---------+
| 110000  |
+---------+
jdbc:hive2://> select sum(distinct salary) from employee;
+---------+
|   _c0   |
+---------+
| 100000  |
+---------+
jdbc:hive2://> select age,sum(salary) from employee group by age;
+-------+--------+
|  age  |  _c1   |
+-------+--------+
| NULL  | NULL   |
| 20    | 40000  |
| 30    | 20000  |
| 40    | 20000  |
| 41    | 30000  |
+-------+--------+

Hive Average (Avg) of a Column & Average of Distinct Column


Syntax: avg(col)
Syntax: avg(DISTINCT col)

Return: DOUBLE

Returns the average of the elements in the group or the average of the distinct values of the column in the group.

Example


jdbc:hive2://> select avg(salary) from employee group by age;
+-------------+
|     _c0     |
+-------------+
| 22000.0000  |
+-------------+
jdbc:hive2://> select avg(distinct salary) from employee group by age;
+-------------+
|     _c0     |
+-------------+
| 25000.0000  |
+-------------+
jdbc:hive2://> select age,avg(salary) from employee group by age;
+-------+-------------+
|  age  |     _c1     |
+-------+-------------+
| NULL  | NULL        |
| 20    | 40000.0000  |
| 30    | 10000.0000  |
| 40    | 20000.0000  |
| 41    | 30000.0000  |
+-------+-------------+

min(col) – Get Minimum value of a column

Returns the minimum of the column in the group.
Return: DOUBLE


jdbc:hive2://> select min(salary) from employee;
+--------+
|  _c0   |
+--------+
| 10000  |
+--------+

max(col) – Get Maximum value of a column

Returns the maximum value of the column in the group.
Return: DOUBLE


jdbc:hive2://> select max(salary) from employee;
+--------+
|  _c0   |
+--------+
| 40000  |
+--------+

collect_set(col) – Collapse the records by Group and Converts into an Array

Returns a set of objects with duplicate elements eliminated.
Return: Array


jdbc:hive2://> select gender, collect_set(age) from employee group by gender;
+---------+----------+
| gender  |   _c1    |
+---------+----------+
|         | []       |
| F       | [40,20]  |
| M       | [30,41]  |
+---------+----------+

collect_list(col) – Collapse the records by Group and Converts values into an Array

Returns a list of objects with duplicates. (As of Hive 0.13.0.)
Return: Array


jdbc:hive2://> select gender, collect_list(age) from employee group by gender;
+---------+----------------+
| gender  |      _c1       |
+---------+----------------+
|         | []             |
| F       | [40,20]        |
| M       | [30,41,30,30]  |
+---------+----------------+

variance(col), var_pop(col)

The variance() and var_pop() aggregation functions returns the statistical variance of column in a group.
Return: DOUBLE


jdbc:hive2://> select variance(salary) from employee;
jdbc:hive2://> select var_pop(salary) from employee;
+---------+
|   _c0   |
+---------+
| 1.36E8  |
+---------+

var_samp(col)

The var_samp() function returns the statistical variance of column in a group.
Return: DOUBLE


jdbc:hive2://> select var_samp(salary) from employee;
+--------+
|  _c0   |
+--------+
| 1.7E8  |
+--------+

stddev_pop(col) – Get the Standard Deviation of a column

The stddev_pop() aggregation function returns the statistical standard deviation of numeric column values provided in the group.
Return: DOUBLE


jdbc:hive2://> select stddev_pop(salary) from employee;
+---------------------+
|         _c0         |
+---------------------+
| 11661.903789690601  |
+---------------------+

stddev_samp(col) – Get the Standard Deviation of a column

The stddev_samp() aggregation function returns the statistical standard deviation of numeric column values provided in the group.
Return: DOUBLE


jdbc:hive2://> select stddev_samp(salary) from employee;
+---------------------+
|         _c0         |
+---------------------+
| 13038.404810405298  |
+---------------------+

Happy Learning !!