You are currently viewing Hive Partitions Explained with Examples

Hive partitions are used to split the larger table into several smaller parts based on one or multiple columns (partition key, for example, date, state e.t.c). The hive partition is similar to table partitioning available in SQL server or any other RDBMS database tables.

In this article you will learn what is Hive partition, why do we need partitions, its advantages, and finally how to create a partition table and performing some partition operations like add, rename, update, and delete partitions.

What are Hive Partitions?

Hive table partition is a way to split a large table into smaller logical tables based on one or more partition keys. These smaller logical tables are not visible to users and users still access the data from just one table.

Partition eliminates creating smaller tables, accessing, and managing them separately.

When you load the data into the partition table, Hive internally splits the records based on the partition key and stores each partition data into a sub-directory of tables directory on HDFS. The name of the directory would be partition key and it’s value.

Also, note that while loading the data into the partition table, Hive eliminates the partition key from the actual loaded file on HDFS as it is redundant information and could be get from the partition folder name, will see this with examples in the next sessions.

Partition Table Advantages

As you are aware Hive is a data warehouse database for the Hadoop and used to store large big data sets. Below are some of the advantages using Hive partition tables.

  • Fast accessed to the data
  • Provides the ability to perform an operation on a smaller dataset

Create Hive Partition Table

To create a Hive table with partitions, you need to use PARTITIONED BY clause along with the column you wanted to partition and its type. Let’s create a table and Load the CSV file.

The data file that I am using to explain partitions can be downloaded from GitHub, It’s a simplified zipcodes codes where I have RecordNumber, Country, City, Zipcode, and State columns. I will be using State as a partition column.

Start your Hive beeline or Hive terminal and create the managed table as below.


CREATE TABLE zipcodes(
RecordNumber int,
Country string,
City string,
Zipcode int)
PARTITIONED BY(state string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Load Data into Partition Table

Download the zipcodes.CSV from GitHub, upload it to HDFS, and finally load the CSV file into a partition table.


hdfs dfs -put zipcodes.csv /data/
jdbc:hive2://127.0.0.1:10000> LOAD DATA INPATH '/data/zipcodes.csv' INTO TABLE zipcodes;

Show All Partitions on Hive Table

After loading the data into the Hive partition table, you can use SHOW PARTITIONS command to see all partitions that are present.


jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition  |
+------------+
| state=AL   |
| state=AZ   |
| state=FL   |
| state=NC   |
| state=PR   |
| state=TX   |
+------------+
6 rows selected (0.074 seconds)

Alternatively, if you know the Hive store location on the HDFS for your table, you can run the HDFS command to check the partitions.

For each partition on the table, you will see a folder created with the partition column name and the partition value.

hive partition hdfs
Hive Partition files on HDFS

Add New Partition to the Hive Table

A new partition can be added to the table using the ALERT TABLE statement, you can also specify the location where you wanted to store partition data on HDFS.


jdbc:hive2://127.0.0.1:10000>ALTER TABLE zipcodes ADD PARTITION (state='CA') LOCATION '/user/data/zipcodes_ca';

SHOW PARTITIONS return the added partition.


jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition  |
+------------+
| state=AL   |
| state=AZ   |
| state=CA   |
| state=FL   |
| state=NC   |
| state=PR   |
| state=TX   |
+------------+
7 rows selected (0.081 seconds)

Rename or Update Hive Partition

Using ALTER TABLE, you can also rename or update the specific partition.


jdbc:hive2://127.0.0.1:10000> ALTER TABLE zipcodes PARTITION (state='AL') RENAME TO PARTITION (state='NY');

Manually Renaming Partitions on HDFS

Alternatively, you can also rename the partition directory on the HDFS.


hdfs dfs -mv /user/hive/warehouse/zipcodes/state=NY /user/hive/warehouse/zipcodes/state=AL

When you manually modify the partitions directly on HDFS, you need to run MSCK REPAIR TABLE to update the Hive Metastore. Not doing so will result in inconsistent results.

  • SELECT doesn’t show the renamed partition
  • SHOW PARTITIONS still shows the older partition

jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition  |
+------------+
| state=AZ   |
| state=CA   |
| state=FL   |
| state=NC   |
| state=NY   |
| state=PR   |
| state=TX   |
+------------+

jdbc:hive2://127.0.0.1:10000>MSCK REPAIR TABLE zipcodes SYNC PARTITIONS;

Above command synchronize zipcodes table on Hive Metastore. Now run the show partition command which shows state=AL partition.

Drop Hive Partition

Dropping a partition can also be performed using ALTER TABLE tablename DROP


jdbc:hive2://127.0.0.1:10000> ALTER TABLE zipcodes DROP IF EXISTS PARTITION (state='AL');

Not using IF EXISTS result in error when specified partition not exists.

Manually Dropping Partitions on HDFS

You can also delete the partition directly from HDFS using below command.


user@namenode:~/hive$ hdfs dfs -rm -R /user/hive/warehouse/zipcodes/state=AL
Deleted /user/hive/warehouse/zipcodes/state=AL

Running SELECT command on the table doesn’t show the records from removed partitions, however, SHOW PARTITIONS still shows the deleted partitions.


jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+-----------------------------------+
|             partition             |
+-----------------------------------+
| state=AL                          |
| state=AZ                          |
| state=FL                          |
| state=NC                          |
| state=PR                          |
| state=TX                          |
+-----------------------------------+
6 rows selected (0.096 seconds)

In order to fix this, you need to run MSCK REPAIR TABLE as shown below.


jdbc:hive2://127.0.0.1:10000> MSCK REPAIR TABLE zipcodes SYNC PARTITIONS;
jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+-----------------------------------+
|             partition             |
+-----------------------------------+
| state=AZ                          |
| state=FL                          |
| state=NC                          |
| state=PR                          |
| state=TX                          |
+-----------------------------------+
5 rows selected (0.096 seconds)

Other Useful Partition Commands

Below are some of the additional partition commands that are useful.

How to Filter Partitions?

If you have 100’s of partitions, you can check if a specific partition exists on the table using SHOW PARTITIONS tablename PARTITION


SHOW PARTITIONS zipcodes PARTITION(state='NC');
+------------+
| partition  |
+------------+
| state=NC   |
+------------+
1 row selected (0.182 seconds)

How to Know Specific Partition Location on HDFS?

Either of the below statements is used to know the HDFS location of each partition.


DESCRIBE FORMATTED zipcodes PARTITION(state='PR');
SHOW TABLE EXTENDED LIKE zipcodes PARTITION(state='PR');

Running SHOW TABLE EXTENDED on table and partition results in the below output. location attribute shows the location of the partition file on HDFS.

hive show partition

Conclusion

In this article, you have learned Hive table partition is used to split a larger table into smaller tables by splitting based on one or multiple partitions columns also learned the following

  • PARTITIONED BY is used to create a partition table
  • ALTER TABLE is used to add, rename, drop partitions
  • SHOW PARTITIONS is used to show the partitions of the table
  • MSCK REPAIR is used to synch Hive Metastore with the HDFS data.

Reference

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply