You are currently viewing Hive Collection Functions with Examples

Hive comes with a set of collection functions to work with Map and Array data types. These functions are used to find the size of the array, map types, get all map keys, values, sort array, and finding if an element exists in an array.

Advertisements

Related:

Hive Collection Functions List

Collection FunctionsUsage & Description
size(Map<K.V>)Returns length of the map type.
Return: int
size(Array<T>)Returns length of the array type .columnReturn: int
map_keys(Map<K.V>)Returns all keys from the map.
Return: array<K>
map_values(Map<K.V>)Returns all values from the map.
Return: array<V>
array_contains(Array<T>, value)Returns TRUE if the array contains a value.
Return: boolean
sort_array(Array<T>)Sorts the input array in ascending order.
Return: array<t>
Hive Collection Functions

Hive Collection Functions Examples

Before we jump into Hive collection functions examples, let’s create a Hive table with Array and Map types.


CREATE TABLE IF NOT EXISTS collection_example (
 id int,
 languages list,
 properties map
)
COMMENT 'This is Hive collection Example'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':';

On this table, both array and map collections values are separated by pipe (|) <em>COLLECTION ITEMS TERMINATED BY '|'</em> and map key and values are separated by colon (:) MAP KEYS TERMINATED BY ':'

Now, create a collection_types.csv file with below contents. To make it simple I will use local file system instead of HDFS.


1,Java|C#|Python,age:21|gender:M
2,PHP|Java|Scala,age:22|gender:F
3,PHP,age:40|gender:F
4,Java|Go|SAS,age:34|gender:M
5,,age:50

Load this file into our table using below command.


jdbc:hive2://> LOAD DATA LOCAL INPATH '/tmp/collection_types.csv' 
 . . . . . . > INTO TABLE collection_example;
jdbc:hive2://>SELECT * FROM collection_example; // Returns below data
Hive collection Functions
Hive Collection Types Example

size(Map<K.V>)

Hive size() function is used to get the length of the map type column. This returns the number of elements in the map type.

Return: int


jdbc:hive2://> SELECT properties,size(properties) as size 
 . . . . . . > FROM collection_example;
OK
+----------------------------+------+
|         properties         | size |
+----------------------------+------+
| {"age":"21","gender":"M"}  | 2    |
| {"age":"22","gender":"F"}  | 2    |
| {"age":"40","gender":"F"}  | 2    |
| {"age":"34","gender":"M"}  | 2    |
| {"age":"50"}               | 1    |
+----------------------------+------+
5 rows selected (0.238 seconds)
0: jdbc:hive2://>

All of the rows on table contains 2 map elements and last record has just one.

size(Array<T>)

Hive size() has another alternative that takes an array type column as an argument and returns the number of elements(length) in the array type.

Return: int


jdbc:hive2://> SELECT languages,size(languages) as size 
 . . . . . . > FROM collection_example;
OK
+-------------------------+------+
|        languages        | size |
+-------------------------+------+
| ["Java","C#","Python"]  | 3    |
| ["PHP","Java","Scala"]  | 3    |
| ["PHP"]                 | 1    |
| ["Java","Go","SAS"]     | 3    |
| []                      | 0    |
+-------------------------+------+
5 rows selected (0.258 seconds)

map_keys(Map<K.V>)

map_keys() function returns an array containing all keys of the map column.
Return: array<K>


jdbc:hive2://> SELECT properties,map_keys(properties) as map_keys 
 . . . . . . > FROM collection_example;
OK
+----------------------------+-------------------+
|         properties         |        map_keys   |
+----------------------------+-------------------+
| {"age":"21","gender":"M"}  | ["age","gender"]  |
| {"age":"22","gender":"F"}  | ["age","gender"]  |
| {"age":"40","gender":"F"}  | ["age","gender"]  |
| {"age":"34","gender":"M"}  | ["age","gender"]  |
| {"age":"50"}               | ["age"]           |
+----------------------------+-------------------+
5 rows selected (0.273 seconds)
0: jdbc:hive2://>

map_values(Map<K.V>)

map_values() function returns an array containing all values of the map column.

Return: array<V>


jdbc:hive2://> SELECT properties,map_values(properties) as map_values 
 . . . . . . > FROM collection_example;
OK
+----------------------------+-------------+
|         properties         | map_values  |
+----------------------------+-------------+
| {"age":"21","gender":"M"}  | ["21","M"]  |
| {"age":"22","gender":"F"}  | ["22","F"]  |
| {"age":"40","gender":"F"}  | ["40","F"]  |
| {"age":"34","gender":"M"}  | ["34","M"]  |
| {"age":"50"}               | ["50"]      |
+----------------------------+-------------+
5 rows selected (0.193 seconds)
0: jdbc:hive2://>

array_contains(Array<T>, value)

array_contains() function returns TRUE if the array column contains a value.

Return: boolean


jdbc:hive2://> SELECT languages,array_contains(languages,'Java') 
 . . . . . . > FROM collection_example;
OK
+-------------------------+--------+
|        languages        |  _c1   |
+-------------------------+--------+
| ["Java","C#","Python"]  | true   |
| ["PHP","Java","Scala"]  | true   |
| ["PHP"]                 | false  |
| ["Java","Go","SAS"]     | true   |
| []                      | false  |
+-------------------------+--------+
5 rows selected (0.359 seconds)
0: jdbc:hive2://>

sort_array(Array<T>)

sort_array() function sorts the input array column in ascending order.

Return: array<t>


jdbc:hive2://> SELECT sort_array(languages) 
 . . . . . . > FROM collection_example;
OK
+-------------------------+
|           _c0           |
+-------------------------+
| ["C#","Java","Python"]  |
| ["Java","PHP","Scala"]  |
| ["PHP"]                 |
| ["Go","Java","SAS"]     |
| []                      |
+-------------------------+
5 rows selected (0.232 seconds)
0: jdbc:hive2://>

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply