Hive Bucketing is a way to split the table into a managed number of clusters with or without partitions. With partitions, Hive divides(creates a directory) the table into smaller parts for every distinct value of a column whereas with bucketing you can specify the number of buckets to create at the time of creating a Hive table.
In my previous article, I have explained Hive Partitions with Examples, in this article let’s learn Hive Bucketing with Examples, the advantages of using bucketing, limitations, and how bucketing works.
What is Hive Bucketing
Hive Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files, (By specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-defined number into buckets.
Bucketing can be created on just one column, you can also create bucketing on a partitioned table to further split the data to improve the query performance of the partitioned table.
Each bucket is stored as a file within the table’s directory or the partitions directories on HDFS.
Records with the same value in a column will always be stored in the same bucket.
Hive bucketing commonly created in two scenarios.
- Create a bucket on top of the Partitioned table to further divide the table for better query performance.
- Create Bucketing on the table where you cannot choose the partition column due to (too many distinct values on columns).
In our example below, I will be explaining the first approach where I create Bucketing on top of the partitioned table.
Hive Bucketing Advantages
Before jumping into the Advantages of Hive bucketing, first let’s see the limitation of Partition, with the partition you cannot control the number of partitions as it creates a partition for every distinct value of the partitioned column; which ideally creates a subdirectory for each partition inside the table directory on HDFS.
If you have too many distinct values on the partitioned column, you will end up with too many directories on HDFS which leads to higher maintenance for Name Node.
Now let’s see the Advantages of Bucketing
- Hive Bucketing overcomes creating too many directories by specifying the number of buckets you wanted to create (you are in control).
- On a larger table, creating a bucketing gives you 2-3x better query performance than a non-bucket table.
Hive Create Bucketing Table
To create a Hive table with bucketing, use CLUSTERED BY
clause with the column name you wanted to bucket and the count of the buckets.
CREATE TABLE zipcodes(
RecordNumber int,
Country string,
City string,
Zipcode int)
PARTITIONED BY(state string)
CLUSTERED BY (Zipcode) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Load Data into Bucket
Loading/inserting data into the Bucketing table would be the same as inserting data into the table.
If you are using Hive < 2.x version, you need to set the hive.enforce.bucketing
property to true
. You don’t have to set this if you are using Hive 2.x or later.
#This property is not needed if you are using Hive 2.x or later
set hive.enforce.bucketing = true;
Now let’s load the csv file into the table, since our partitioned column state
is the last column on the input file we can directly load the data into the table without needing to create a temporary table.
Below examples loads the zipcodes from HDFS into Hive partitioned table where we have a bucketing on zipcode
column.
LOAD DATA INPATH '/data/zipcodes.csv' INTO TABLE zipcodes;
On below image, each file is a bucket.
On above image, each file is a bucket which contains records for that specific bucket.
Select Data From Bucket
Since our zipcode is partitioned on state and bucketing on zipcode, if you use these columns on where condition your query returns faster results.
0: jdbc:hive2://> SELECT * FROM zipcodes WHERE state='PR' and zipcode=704;
OK
+------------------------+-------------------+----------------------+-------------------+-----------------+s
| zipcodes.recordnumber | zipcodes.country | zipcodes.city | zipcodes.zipcode | zipcodes.state |
+------------------------+-------------------+----------------------+-------------------+-----------------+
| 3 | US | SECT LANAUSSE | 704 | PR |
| 2 | US | PASEO COSTA DEL SUR | 704 | PR |
| 4 | US | URB EUGENE RICE | 704 | PR |
| 1 | US | PARC PARQUE | 704 | PR |
+------------------------+-------------------+----------------------+-------------------+-----------------+
4 rows selected (0.381 seconds)
0: jdbc:hive2://>
How Hive Distribute the Rows Across the Buckets?
I got this from Hive wiki.
In general, the bucket number is determined by the expression
hash_function(bucketing_column) mod num_buckets
. (There’s a ‘0x7FFFFFFF in there too, but that’s not that important). The hash_function depends on the type of the bucketing column. For an int, it’s easy,hash_int(i) == i
. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id’s that end in 0 to be in bucket 1, all user_id’s that end in a 1 to be in bucket 2, etc. For other datatypes, it’s a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that’s derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id’s in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
How to Decide the Number of Buckets?
When you decide to create bucket on Hive table, million dollar question you should ask to yourself is how many buckets to create? so let’s see how to decide the number of buckets to create.
TO-DO
Conclusion
In summary Hive Bucketing is a performance improvement technique by dividing larger tables into smaller manageable parts by using the hashing technique. Bucketing can also be done on a partitioned table to further divide.
Related Articles
- Hive Partitioning vs Bucketing with Examples?
- Connect to Hive using JDBC connection
- Hive Cast Function to Convert Data Type
- Hive Conditional Functions with Examples
- How to replace NULL values with Default in Hive
- Hive Collection Functions with Examples
- Hive Aggregate Functions (UDAF) with Examples
- Hive Data Types – Primitive | Complex
- Hive Built-in String Functions with Examples
- Apache Hive Installation on Ubuntu
How to Decide the Number of Buckets?
Answer :
Lets take a scenario Where table size is: 2300 MB, HDFS Block Size: 128 MB
Now, Divide 2300/128=17.96
Now, remember number of bucket will always be in the power of 2.
So we need to find n such that 2^n > 17.96
n=5
So, I am going to use number of buckets as 2^5=32.