You are currently viewing Hive Aggregate Functions (UDAF) with Examples

Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single value, when used with a group, it aggregates all values in each group and returns one value for each group.

Advertisements

Like in SQL, Aggregate Functions in Hive can be used with or without GROUP BY functions however these aggregation functions are mostly used with GROUP BY hence, here I will cover examples of how to use aggregation functions with and without applying groups.

Note: Most of these functions ignore NULL values.

Below are some of the examples we will see in details besides syntax, usage and return types.

Hive Aggregate Functions List

Hive Aggregate FunctionsSyntax & Description
COUNT()Returns the count of all rows in a table including rows containing NULL values
When you specify a column as an input, it ignores NULL values in the column for the count.
Also ignores duplicates by using DISTINCT.
Return: BIGINT
SUM()Returns the sum of all values in a column.
When used with a group it returns the sum for each group.
Also ignores duplicates by using DISTINCT.
Return: DOUBLE
AVG()Returns the average of all values in a column.
When used with a group it returns an average for each group.
Return: DOUBLE
MIN()Returns the minimum value of the column from all rows.
When used with a group it returns a minimum for each group.
Return: DOUBLE
MAX()Returns the maximum value of the column from all rows.
When used with a group it returns a maximum for each group.
Return: DOUBLE
Variance(col), var_pop(col)Returns the variance of a numeric column for all rows or for each group.
Return: DOUBLE
Var_samp(col)Returns the unbiased sample variance of a numeric column or for each group.
Return: DOUBLE
Stddev_pop(col)Returns the statistical standard deviation of all values in a column or for each group.
Return: DOUBLE
Stddev_samp(col)Returns the sample statistical standard deviation of all values in a column or for each group.
Return: DOUBLE
Covar_pop(col1, col2)Returns the population covariance of a pair of numeric columns for all rows or for each group.
Return: DOUBLE
Covar_samp(col1, col2)Returns the sample covariance of a pair of numeric for all rows or for each group.
Return: DOUBLE
Corr(col1, col2)Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
Return: DOUBLE
Percentile(BIGINT col, p)For each group, it returns the exact percentile of a column.
p must be between 0 and 1.
Return: DOUBLE
Percentile(BIGINT col, array(p1 [, p2]…))Returns the exact percentiles p1, p2, … of a column in the group. pi must be between 0 and 1.
Return: array<double>
Regr_Avgx(independent, dependent)Equivalent to avg(dependent).
Return: DOUBLE
Regr_avgy(independent, dependent)Equivalent to avg(independent).
Return: DOUBLE
Regr_count(independent, dependent)Returns the number of non-null pairs used to fit the linear regression line.
Return: DOUBLE
Regr_intercept(independent, dependent)Returns the y-intercept of the linear regression line.
Return: DOUBLE
Regr_r2(independent, dependent)Returns the coefficient of determination for the regression. As of Hive 2.2.0.
Return: DOUBLE
Regr_slope(independent, dependent)Returns the slope of the linear regression line
Return: DOUBLE
Regr_sxx(independent, dependent)Equivalent to regr_count(independent, dependent) * var_pop(dependent).
Return: DOUBLE
Regr_sxy(independent, dependent)Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent).
Return:
Regr_syy(independent, dependent)Equivalent to regr_count(independent, dependent) * var_pop(independent).
Return: DOUBLE
Histogram_Numeric(col, b)Computes a histogram of a numeric column in the group using b non-uniformly spaced bins.
Return: Array<struct {'x','y'}>
Collect_set(col)Returns a collection of elements in a group as a set by eliminating duplicate elements.
Return: Array
Collect_list(col)Returns a collection of elements in a group as a list including duplicate elements.
Return: Array
Ntile(INTEGER x)This assigns the bucket number to each row in a partition after partition into x groups.
Return: INTEGER

Hive Aggregate Functions Examples

Below are most used Hive Aggregate functions examples.

Hive Select Count and Count Distinct


Syntax: count(*)
Syntax: count(expr)
Syntax: count(DISTINCT expr[, expr...])

Return: BIGINT

count(*) – Returns the count of all rows in a table including rows containing NULL values.

count(expr) – Returns the total number of rows for expression excluding null.

count(DISTINCT expr[, expr]) – Returns the count of distinct rows of expression (or expressions) excluding null values.

You can optimize the performance of this using hive.optimize.distinct.rewrite.

Example


jdbc:hive2://> select count(*) from employee;
+------+
| _c0  |
+------+
| 7    |
+------+
jdbc:hive2://> select count(salary) from employee;
+------+
| _c0  |
+------+
| 5    |
+------+
select count(distinct gender, salary) from employee;
+------+
| _c0  |
+------+
| 4    |
+------+

Hive Sum of a Column and sum of Distinct Columns


Syntax: sum(col)
Syntax: sum(DISTINCT col)

Return: DOUBLE

Returns the total sum of the elements in the group or the sum of the distinct values of the column in the group.


jdbc:hive2://> select sum(salary) from employee;
+---------+
|   _c0   |
+---------+
| 110000  |
+---------+
jdbc:hive2://> select sum(distinct salary) from employee;
+---------+
|   _c0   |
+---------+
| 100000  |
+---------+
jdbc:hive2://> select age,sum(salary) from employee group by age;
+-------+--------+
|  age  |  _c1   |
+-------+--------+
| NULL  | NULL   |
| 20    | 40000  |
| 30    | 20000  |
| 40    | 20000  |
| 41    | 30000  |
+-------+--------+

Hive Average (Avg) of a Column & Average of Distinct Column


Syntax: avg(col)
Syntax: avg(DISTINCT col)

Return: DOUBLE

Returns the average of the elements in the group or the average of the distinct values of the column in the group.

Example


jdbc:hive2://> select avg(salary) from employee group by age;
+-------------+
|     _c0     |
+-------------+
| 22000.0000  |
+-------------+
jdbc:hive2://> select avg(distinct salary) from employee group by age;
+-------------+
|     _c0     |
+-------------+
| 25000.0000  |
+-------------+
jdbc:hive2://> select age,avg(salary) from employee group by age;
+-------+-------------+
|  age  |     _c1     |
+-------+-------------+
| NULL  | NULL        |
| 20    | 40000.0000  |
| 30    | 10000.0000  |
| 40    | 20000.0000  |
| 41    | 30000.0000  |
+-------+-------------+

min(col) – Get Minimum value of a column

Returns the minimum of the column in the group.
Return: DOUBLE


jdbc:hive2://> select min(salary) from employee;
+--------+
|  _c0   |
+--------+
| 10000  |
+--------+

max(col) – Get Maximum value of a column

Returns the maximum value of the column in the group.
Return: DOUBLE


jdbc:hive2://> select max(salary) from employee;
+--------+
|  _c0   |
+--------+
| 40000  |
+--------+

collect_set(col) – Collapse the records by Group and Converts into an Array

Returns a set of objects with duplicate elements eliminated.
Return: Array


jdbc:hive2://> select gender, collect_set(age) from employee group by gender;
+---------+----------+
| gender  |   _c1    |
+---------+----------+
|         | []       |
| F       | [40,20]  |
| M       | [30,41]  |
+---------+----------+

collect_list(col) – Collapse the records by Group and Converts values into an Array

Returns a list of objects with duplicates. (As of Hive 0.13.0.)
Return: Array


jdbc:hive2://> select gender, collect_list(age) from employee group by gender;
+---------+----------------+
| gender  |      _c1       |
+---------+----------------+
|         | []             |
| F       | [40,20]        |
| M       | [30,41,30,30]  |
+---------+----------------+

variance(col), var_pop(col)

The variance() and var_pop() aggregation functions returns the statistical variance of column in a group.
Return: DOUBLE


jdbc:hive2://> select variance(salary) from employee;
jdbc:hive2://> select var_pop(salary) from employee;
+---------+
|   _c0   |
+---------+
| 1.36E8  |
+---------+

var_samp(col)

The var_samp() function returns the statistical variance of column in a group.
Return: DOUBLE


jdbc:hive2://> select var_samp(salary) from employee;
+--------+
|  _c0   |
+--------+
| 1.7E8  |
+--------+

stddev_pop(col) – Get the Standard Deviation of a column

The stddev_pop() aggregation function returns the statistical standard deviation of numeric column values provided in the group.
Return: DOUBLE


jdbc:hive2://> select stddev_pop(salary) from employee;
+---------------------+
|         _c0         |
+---------------------+
| 11661.903789690601  |
+---------------------+

stddev_samp(col) – Get the Standard Deviation of a column

The stddev_samp() aggregation function returns the statistical standard deviation of numeric column values provided in the group.
Return: DOUBLE


jdbc:hive2://> select stddev_samp(salary) from employee;
+---------------------+
|         _c0         |
+---------------------+
| 13038.404810405298  |
+---------------------+

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply