Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single value, when used with a group, it aggregates all values in each group and returns one value for each group.
Like in SQL, Aggregate Functions in Hive can be used with or without GROUP BY functions however these aggregation functions are mostly used with GROUP BY hence, here I will cover examples of how to use aggregation functions with and without applying groups.
Note: Most of these functions ignore NULL values.
Below are some of the examples we will see in details besides syntax, usage and return types.
- Hive Select Count and Count Distinct
- Hive Sum of a Column and sum of Distinct column
- Get a Distinct column of Average in Hive
- Get Minimum value of a column
- Get Maximum value of a column
Hive Aggregate Functions List
Hive Aggregate Functions | Syntax & Description |
---|---|
COUNT() | Returns the count of all rows in a table including rows containing NULL values When you specify a column as an input, it ignores NULL values in the column for the count. Also ignores duplicates by using DISTINCT. Return: BIGINT |
SUM() | Returns the sum of all values in a column. When used with a group it returns the sum for each group. Also ignores duplicates by using DISTINCT. Return: DOUBLE |
AVG() | Returns the average of all values in a column. When used with a group it returns an average for each group. Return: DOUBLE |
MIN() | Returns the minimum value of the column from all rows. When used with a group it returns a minimum for each group. Return: DOUBLE |
MAX() | Returns the maximum value of the column from all rows. When used with a group it returns a maximum for each group. Return: DOUBLE |
Variance(col), var_pop(col) | Returns the variance of a numeric column for all rows or for each group. Return: DOUBLE |
Var_samp(col) | Returns the unbiased sample variance of a numeric column or for each group. Return: DOUBLE |
Stddev_pop(col) | Returns the statistical standard deviation of all values in a column or for each group. Return: DOUBLE |
Stddev_samp(col) | Returns the sample statistical standard deviation of all values in a column or for each group. Return: DOUBLE |
Covar_pop(col1, col2) | Returns the population covariance of a pair of numeric columns for all rows or for each group. Return: DOUBLE |
Covar_samp(col1, col2) | Returns the sample covariance of a pair of numeric for all rows or for each group. Return: DOUBLE |
Corr(col1, col2) | Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group. Return: DOUBLE |
Percentile(BIGINT col, p) | For each group, it returns the exact percentile of a column. p must be between 0 and 1. Return: DOUBLE |
Percentile(BIGINT col, array(p1 [, p2]…)) | Returns the exact percentiles p1, p2, … of a column in the group. pi must be between 0 and 1. Return: array<double> |
Regr_Avgx(independent, dependent) | Equivalent to avg(dependent). Return: DOUBLE |
Regr_avgy(independent, dependent) | Equivalent to avg(independent). Return: DOUBLE |
Regr_count(independent, dependent) | Returns the number of non-null pairs used to fit the linear regression line. Return: DOUBLE |
Regr_intercept(independent, dependent) | Returns the y-intercept of the linear regression line. Return: DOUBLE |
Regr_r2(independent, dependent) | Returns the coefficient of determination for the regression. As of Hive 2.2.0. Return: DOUBLE |
Regr_slope(independent, dependent) | Returns the slope of the linear regression line Return: DOUBLE |
Regr_sxx(independent, dependent) | Equivalent to regr_count(independent, dependent) * var_pop(dependent). Return: DOUBLE |
Regr_sxy(independent, dependent) | Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent). Return: |
Regr_syy(independent, dependent) | Equivalent to regr_count(independent, dependent) * var_pop(independent). Return: DOUBLE |
Histogram_Numeric(col, b) | Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. Return: Array<struct { 'x','y' }> |
Collect_set(col) | Returns a collection of elements in a group as a set by eliminating duplicate elements. Return: Array |
Collect_list(col) | Returns a collection of elements in a group as a list including duplicate elements. Return: Array |
Ntile(INTEGER x) | This assigns the bucket number to each row in a partition after partition into x groups. Return: INTEGER |
Hive Aggregate Functions Examples
Below are most used Hive Aggregate functions examples.
Hive Select Count and Count Distinct
Syntax: count(*)
Syntax: count(expr)
Syntax: count(DISTINCT expr[, expr...])
Return: BIGINT
count(*)
– Returns the count of all rows in a table including rows containing NULL values.
count(expr)
– Returns the total number of rows for expression excluding null.
count(DISTINCT expr[, expr])
– Returns the count of distinct rows of expression (or expressions) excluding null values.
You can optimize the performance of this using hive.optimize.distinct.rewrite
.
Example
jdbc:hive2://> select count(*) from employee;
+------+
| _c0 |
+------+
| 7 |
+------+
jdbc:hive2://> select count(salary) from employee;
+------+
| _c0 |
+------+
| 5 |
+------+
select count(distinct gender, salary) from employee;
+------+
| _c0 |
+------+
| 4 |
+------+
Hive Sum of a Column and sum of Distinct Columns
Syntax: sum(col)
Syntax: sum(DISTINCT col)
Return: DOUBLE
Returns the total sum of the elements in the group or the sum of the distinct values of the column in the group.
jdbc:hive2://> select sum(salary) from employee;
+---------+
| _c0 |
+---------+
| 110000 |
+---------+
jdbc:hive2://> select sum(distinct salary) from employee;
+---------+
| _c0 |
+---------+
| 100000 |
+---------+
jdbc:hive2://> select age,sum(salary) from employee group by age;
+-------+--------+
| age | _c1 |
+-------+--------+
| NULL | NULL |
| 20 | 40000 |
| 30 | 20000 |
| 40 | 20000 |
| 41 | 30000 |
+-------+--------+
Hive Average (Avg) of a Column & Average of Distinct Column
Syntax: avg(col)
Syntax: avg(DISTINCT col)
Return: DOUBLE
Returns the average of the elements in the group or the average of the distinct values of the column in the group.
Example
jdbc:hive2://> select avg(salary) from employee group by age;
+-------------+
| _c0 |
+-------------+
| 22000.0000 |
+-------------+
jdbc:hive2://> select avg(distinct salary) from employee group by age;
+-------------+
| _c0 |
+-------------+
| 25000.0000 |
+-------------+
jdbc:hive2://> select age,avg(salary) from employee group by age;
+-------+-------------+
| age | _c1 |
+-------+-------------+
| NULL | NULL |
| 20 | 40000.0000 |
| 30 | 10000.0000 |
| 40 | 20000.0000 |
| 41 | 30000.0000 |
+-------+-------------+
min(col) – Get Minimum value of a column
Returns the minimum of the column in the group.
Return: DOUBLE
jdbc:hive2://> select min(salary) from employee;
+--------+
| _c0 |
+--------+
| 10000 |
+--------+
max(col) – Get Maximum value of a column
Returns the maximum value of the column in the group.
Return: DOUBLE
jdbc:hive2://> select max(salary) from employee;
+--------+
| _c0 |
+--------+
| 40000 |
+--------+
collect_set(col) – Collapse the records by Group and Converts into an Array
Returns a set of objects with duplicate elements eliminated.
Return: Array
jdbc:hive2://> select gender, collect_set(age) from employee group by gender;
+---------+----------+
| gender | _c1 |
+---------+----------+
| | [] |
| F | [40,20] |
| M | [30,41] |
+---------+----------+
collect_list(col) – Collapse the records by Group and Converts values into an Array
Returns a list of objects with duplicates. (As of Hive 0.13.0.)
Return: Array
jdbc:hive2://> select gender, collect_list(age) from employee group by gender;
+---------+----------------+
| gender | _c1 |
+---------+----------------+
| | [] |
| F | [40,20] |
| M | [30,41,30,30] |
+---------+----------------+
variance(col), var_pop(col)
The variance()
and var_pop()
aggregation functions returns the statistical variance of column in a group.
Return: DOUBLE
jdbc:hive2://> select variance(salary) from employee;
jdbc:hive2://> select var_pop(salary) from employee;
+---------+
| _c0 |
+---------+
| 1.36E8 |
+---------+
var_samp(col)
The var_samp()
function returns the statistical variance of column in a group.
Return: DOUBLE
jdbc:hive2://> select var_samp(salary) from employee;
+--------+
| _c0 |
+--------+
| 1.7E8 |
+--------+
stddev_pop(col) – Get the Standard Deviation of a column
The stddev_pop()
aggregation function returns the statistical standard deviation of numeric column values provided in the group.
Return: DOUBLE
jdbc:hive2://> select stddev_pop(salary) from employee;
+---------------------+
| _c0 |
+---------------------+
| 11661.903789690601 |
+---------------------+
stddev_samp(col) – Get the Standard Deviation of a column
The stddev_samp()
aggregation function returns the statistical standard deviation of numeric column values provided in the group.
Return: DOUBLE
jdbc:hive2://> select stddev_samp(salary) from employee;
+---------------------+
| _c0 |
+---------------------+
| 13038.404810405298 |
+---------------------+
Happy Learning !!