Hive Bucketing Explained with Examples

Hive Bucketing is a way to split the table into a managed number of clusters with or without partitions. With partitions, Hive divides(creates a directory) the table into smaller parts for every distinct value of a column whereas with bucketing you can specify the number of buckets to create at the time of creating a Hive table.

What is Hive Bucketing

Hive Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files, (By specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-defined number into buckets.

Bucketing can be created on just one column, you can also create bucketing on a partitioned table to further split the data to improve the query performance of the partitioned table.

Each bucket is stored as a file within the table’s directory or the partitions directories on HDFS.

Records with the same value in a column will always be stored in the same bucket.

Hive bucketing commonly created in two scenarios.

Create a bucket on top of the Partitioned table to further divide the table for better query performance.
Create Bucketing on the table where you cannot choose the partition column due to (too many distinct values on columns).

In our example below, I will be explaining the first approach where I create Bucketing on top of the partitioned table.

Hive Bucketing Advantages

Before jumping into the Advantages of Hive bucketing, first let’s see the limitation of Partition, with the partition you cannot control the number of partitions as it creates a partition for every distinct value of the partitioned column; which ideally creates a subdirectory for each partition inside the table directory on HDFS.

If you have too many distinct values on the partitioned column, you will end up with too many directories on HDFS which leads to higher maintenance for Name Node.

Now let’s see the Advantages of Bucketing

Hive Bucketing overcomes creating too many directories by specifying the number of buckets you wanted to create (you are in control).
On a larger table, creating a bucketing gives you 2-3x better query performance than a non-bucket table.

Hive Create Bucketing Table

To create a Hive table with bucketing, use CLUSTERED BY clause with the column name you wanted to bucket and the count of the buckets.


CREATE TABLE zipcodes(
RecordNumber int,
Country string,
City string,
Zipcode int)
PARTITIONED BY(state string)
CLUSTERED BY (Zipcode) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Load Data into Bucket

Loading/inserting data into the Bucketing table would be the same as inserting data into the table.

If you are using Hive < 2.x version, you need to set the hive.enforce.bucketing property to true. You don’t have to set this if you are using Hive 2.x or later.


#This property is not needed if you are using Hive 2.x or later
set hive.enforce.bucketing = true;

Now let’s load the csv file into the table, since our partitioned column state is the last column on the input file we can directly load the data into the table without needing to create a temporary table.

Below examples loads the zipcodes from HDFS into Hive partitioned table where we have a bucketing on zipcode column.


LOAD DATA INPATH '/data/zipcodes.csv' INTO TABLE zipcodes;

On below image, each file is a bucket.

On above image, each file is a bucket which contains records for that specific bucket.

Select Data From Bucket

Since our zipcode is partitioned on state and bucketing on zipcode, if you use these columns on where condition your query returns faster results.


0: jdbc:hive2://> SELECT * FROM zipcodes WHERE state='PR' and zipcode=704;
OK
+------------------------+-------------------+----------------------+-------------------+-----------------+s
| zipcodes.recordnumber  | zipcodes.country  |    zipcodes.city     | zipcodes.zipcode  | zipcodes.state  |
+------------------------+-------------------+----------------------+-------------------+-----------------+
| 3                      | US                | SECT LANAUSSE        | 704               | PR              |
| 2                      | US                | PASEO COSTA DEL SUR  | 704               | PR              |
| 4                      | US                | URB EUGENE RICE      | 704               | PR              |
| 1                      | US                | PARC PARQUE          | 704               | PR              |
+------------------------+-------------------+----------------------+-------------------+-----------------+
4 rows selected (0.381 seconds)
0: jdbc:hive2://>

How Hive Distribute the Rows Across the Buckets?

I got this from Hive wiki.

In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There’s a ‘0x7FFFFFFF in there too, but that’s not that important). The hash_function depends on the type of the bucketing column. For an int, it’s easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id’s that end in 0 to be in bucket 1, all user_id’s that end in a 1 to be in bucket 2, etc. For other datatypes, it’s a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that’s derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id’s in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

How to Decide the Number of Buckets?

When you decide to create bucket on Hive table, million dollar question you should ask to yourself is how many buckets to create? so let’s see how to decide the number of buckets to create.

TO-DO

Conclusion

In summary Hive Bucketing is a performance improvement technique by dividing larger tables into smaller manageable parts by using the hashing technique. Bucketing can also be done on a partitioned table to further divide.

References

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

This Post Has One Comment

Saurav Kumar December 30, 2020

How to Decide the Number of Buckets?

Answer :

Lets take a scenario Where table size is: 2300 MB, HDFS Block Size: 128 MB

Now, Divide 2300/128=17.96

Now, remember number of bucket will always be in the power of 2.

So we need to find n such that 2^n > 17.96

n=5

So, I am going to use number of buckets as 2^5=32.