Hive Partitions Explained with Examples

Hive partitions are used to split the larger table into several smaller parts based on one or multiple columns (partition key, for example, date, state e.t.c). The hive partition is similar to table partitioning available in SQL server or any other RDBMS database tables.

What are Hive Partitions?

Hive table partition is a way to split a large table into smaller logical tables based on one or more partition keys. These smaller logical tables are not visible to users and users still access the data from just one table.

Partition eliminates creating smaller tables, accessing, and managing them separately.

As data is loaded into the partitioned table, Hive internally divides the records based on the partition key. It then stores the data for each partition in a sub-directory within the tables directory on HDFS. The directory is named after the partition key and its corresponding value.

Furthermore, it’s important to highlight that during the data loading process into the partitioned table, Hive omits the partition key from the actual loaded file on HDFS. This is because the partition key is redundant information, retrievable from the partition folder name. We’ll delve deeper into this concept with examples in the upcoming sessions.

Partition Table Advantages

As you are aware Hive is a data warehouse database for the Hadoop and used to store large big data sets. Below are some of the advantages using Hive partition tables.

Fast access to the data
Provides the ability to perform an operation on a smaller dataset

Create Hive Partition Table

To create a partitioned table in Hive, you can use the PARTITIONED BY clause along with the CREATE TABLE statement. Let’s create a partition table and load data from the CSV file.

The data file I’m using to demonstrate partition has columns RecordNumber, Country, City, Zipcode, and State columns. I will be using State as a partition column.

Let’s create the managed table as shown below.


# Create internal or managed table
CREATE TABLE zipcodes(
RecordNumber int,
Country string,
City string,
Zipcode int)
PARTITIONED BY(state string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Load Data into Partition Table

Download the zipcodes.CSV from GitHub, upload it to HDFS, and finally load the CSV file into a partition table.


hdfs dfs -put zipcodes.csv /data/
jdbc:hive2://127.0.0.1:10000> LOAD DATA INPATH '/data/zipcodes.csv' INTO TABLE zipcodes;

Show All Partitions on Hive Table

After loading the data into the Hive partition table, you can use SHOW PARTITIONS command to see all partitions that are present.


jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition  |
+------------+
| state=AL   |
| state=AZ   |
| state=FL   |
| state=NC   |
| state=PR   |
| state=TX   |
+------------+
6 rows selected (0.074 seconds)

Alternatively, if you know the Hive store location on the HDFS for your table, you can run the HDFS command to check the partitions.

For each partition on the table, you will see a folder created with the partition column name and the partition value.

hive partition hdfs — Hive Partition files on HDFS

Add New Partition to the Hive Table

A new partition can be added to the table using the ALERT TABLE statement, you can also specify the location where you wanted to store partition data on HDFS.


jdbc:hive2://127.0.0.1:10000>ALTER TABLE zipcodes ADD PARTITION (state='CA') LOCATION '/user/data/zipcodes_ca';

SHOW PARTITIONS return the added partition.


jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition  |
+------------+
| state=AL   |
| state=AZ   |
| state=CA   |
| state=FL   |
| state=NC   |
| state=PR   |
| state=TX   |
+------------+
7 rows selected (0.081 seconds)

Rename or Update Hive Partition

Using ALTER TABLE, you can also rename or update the specific partition.


jdbc:hive2://127.0.0.1:10000> ALTER TABLE zipcodes PARTITION (state='AL') RENAME TO PARTITION (state='NY');

Manually Renaming Partitions on HDFS

Just rename the partition directory by accessing the HDFS. Use hdfs dfs -mv command to rename. This ideally moves the directory name from old to new name.


hdfs dfs -mv /user/hive/warehouse/zipcodes/state=NY /user/hive/warehouse/zipcodes/state=AL

When you make manual modifications to partitions directly on HDFS, it’s essential to run the MSCK REPAIR TABLE command to ensure the Hive Metastore is updated accordingly. Failing to do this can lead to inconsistent results.

SELECT doesn’t show the renamed partition
SHOW PARTITIONS still shows the older partition


jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition  |
+------------+
| state=AZ   |
| state=CA   |
| state=FL   |
| state=NC   |
| state=NY   |
| state=PR   |
| state=TX   |
+------------+


jdbc:hive2://127.0.0.1:10000>MSCK REPAIR TABLE zipcodes SYNC PARTITIONS;

The command above synchronizes the ‘zipcodes’ table on the Hive Metastore. Next, execute the show partition command to display the ‘state=AL’ partition.

Drop Hive Partition

Dropping a partition can also be performed using ALTER TABLE tablename DROP


jdbc:hive2://127.0.0.1:10000> ALTER TABLE zipcodes DROP IF EXISTS PARTITION (state='AL');

Not using IF EXISTS result in error when specified partition not exists.

Manually Dropping Partitions on HDFS

Similarly, you can also drop the partition directory from HDFS using the below command.


user@namenode:~/hive$ hdfs dfs -rm -R /user/hive/warehouse/zipcodes/state=AL
Deleted /user/hive/warehouse/zipcodes/state=AL

When you run the SELECT command on the table, it will not display records from removed partitions. However, SHOW PARTITIONS will still list the deleted partitions.


jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+-----------------------------------+
|             partition             |
+-----------------------------------+
| state=AL                          |
| state=AZ                          |
| state=FL                          |
| state=NC                          |
| state=PR                          |
| state=TX                          |
+-----------------------------------+
6 rows selected (0.096 seconds)

Run MSCK REPAIR TABLE as shown below to fix it.


jdbc:hive2://127.0.0.1:10000> MSCK REPAIR TABLE zipcodes SYNC PARTITIONS;
jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+-----------------------------------+
|             partition             |
+-----------------------------------+
| state=AZ                          |
| state=FL                          |
| state=NC                          |
| state=PR                          |
| state=TX                          |
+-----------------------------------+
5 rows selected (0.096 seconds)

Other Useful Partition Commands

Below are some of the additional partition commands that are useful.

How to Filter Partitions?

If you have 100’s of partitions, you can check if a specific partition exists on the table using SHOW PARTITIONS tablename PARTITION


SHOW PARTITIONS zipcodes PARTITION(state='NC');
+------------+
| partition  |
+------------+
| state=NC   |
+------------+
1 row selected (0.182 seconds)

How to Know Specific Partition Location on HDFS?

Either of the below statements is used to know the HDFS location of each partition.


DESCRIBE FORMATTED zipcodes PARTITION(state='PR');
SHOW TABLE EXTENDED LIKE zipcodes PARTITION(state='PR');

Running SHOW TABLE EXTENDED on table and partition results in the below output. location attribute shows the location of the partition file on HDFS.

Conclusion

In this article, you have learned Hive table partition is used to split a larger table into smaller tables by splitting based on one or multiple partitions columns also learned the following

PARTITIONED BY is used to create a partition table
ALTER TABLE is used to add, rename, drop partitions
SHOW PARTITIONS is used to show the partitions of the table
MSCK REPAIR is used to synch Hive Metastore with the HDFS data.

Reference

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Happy Learning !!