In this article, I will explain what is Hive Partitioning and Bucketing, the difference between Hive Partitioning vs Bucketing by exploring the advantages and disadvantages of each features with examples.
At a high level, Hive Partition is a way to split the large table into smaller tables based on the values of a column(one partition for each distinct values) whereas Bucket is a technique to divide the data in a manageable form (you can specify how many buckets you want).
There are advantages and disadvantages of Partition vs Bucket so you need to choose these based on your data size and the types of Hive queries you run, let’s see the differences in detail.
Hive Partitioning vs Bucketing
Both Partitioning and Bucketing in Hive are used to improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file system (HDFS). The major difference between Partitioning vs Bucketing lives in the way how they split the data.
Hive Partition is a way to organize large tables into smaller logical tables based on values of columns; one logical table (partition) for each distinct value. In Hive, tables are created as a directory on HDFS. A table can have one or more partitions that correspond to a sub-directory for each partition inside a table directory.
let’s assume you have a US census table which contains zip code, city, state and other columns. Creating a partition on state splits the table into around 50 partitions, when searching for a zipcode with in a state (state=’CA’ and zipCode =’92704′) results in faster as it need to scan only in a state=CA partition directory.
When creating partitions you have to be very cautious with the number of partitions it creates, as having too many partitions creates too many sub-directories in a table directory which bring unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Hive Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files, (By specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-defined number into buckets.
Bucketing can be created on just one column, you can also create bucketing on a partitioned table to further split the data which further improves the query performance of the partitioned table.
Each bucket is stored as a file within the table’s directory or the partitions directories. Note that partition creates a directory and you can have a partition on one or more columns; these are some of the differences between Hive partition and bucket.
From our example, we already have a partition on state which leads to around 50 subdirectories on a table directory, and creating a bucketing 10 on zipcode column creates 10 files for each partitioned subdirectory.
Some Points to Remember
- Creating a partition on zipcode of US population is not a good practice as it creates nearly 42,000 directories on HDFS (US has nearly 42,000 zip codes).
- Since Bucketing works on hashing, if the data is not equally distributed between hashes, it results in in-equal files and may get into performance issues.
Hive Partitioning vs Bucketing Example
Start your Hive Beeline or Hive terminal and rune the below statement to understand the syntax.
Hive Partitioning Example
Create the managed table as below with the partition column as state.
CREATE TABLE zipcodes( RecordNumber int, Country string, City string, Zipcode int) PARTITIONED BY(state string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
If you load the zipcodes into this table, you will see the below directories on HDFS.
Refer to Hive Partitions with Example to know how to load data into Partitioned table, show, update, and drop partitions.
Hive Bucketing Example
In the below example, we are creating a bucketing on zipcode column on top of partitioned by state.
CREATE TABLE zipcodes( RecordNumber int, Country string, City string, Zipcode int) PARTITIONED BY(state string) CLUSTERED BY Zipcode INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
You can also create a bucket on the table column without having partitioned first.
Differences Between Hive Partitioning vs Bucketing
Below are some of the differences between Partitioning vs bucketing
|Directory is created on HDFS for each partition.||File is created on HDFS for each bucket.|
|You can have one or more Partition columns||You can have only one Bucketing column|
|You can’t manage the number of partitions to create||You can manage the number of buckets to create by specifying the count|
|NA||Bucketing can be created on a partitioned table|
|Uses PARTITIONED BY||Uses CLUSTERED BY|
In this Hive Partitioning vs Bucketing article, you have learned how to improve the performance of the queries by doing Partition and Bucket on Hive tables. These two approaches split the table into defined partitions and/or buckets, which distributes the data into smaller and more manageable parts. This eliminates table scans when you performing queries on partition and bucket columns.