Hive partitions are used to split the larger table into several smaller parts based on one or multiple columns (partition key, for example, date, state e.t.c). The hive partition is similar to table partitioning available in SQL server or any other RDBMS database tables.
In this article, you’ll discover the concept of Hive partitioning, its significance, benefits, and step-by-step instructions for creating a partitioned table. Additionally, you’ll learn how to execute various partition operations including adding, renaming, updating, and deleting partitions.
- What are Hive Partitions
- Partition Table Advantages
- Create Hive Partition Table
- Load Data into Partition Table
- Show All Partitions
- Add New Partition
- Rename or Update Partition
- Drop Hive Partition
What are Hive Partitions?
Hive table partition is a way to split a large table into smaller logical tables based on one or more partition keys. These smaller logical tables are not visible to users and users still access the data from just one table.
Partition eliminates creating smaller tables, accessing, and managing them separately.
As data is loaded into the partitioned table, Hive internally divides the records based on the partition key. It then stores the data for each partition in a sub-directory within the tables directory on HDFS. The directory is named after the partition key and its corresponding value.
Furthermore, it’s important to highlight that during the data loading process into the partitioned table, Hive omits the partition key from the actual loaded file on HDFS. This is because the partition key is redundant information, retrievable from the partition folder name. We’ll delve deeper into this concept with examples in the upcoming sessions.
Partition Table Advantages
As you are aware Hive is a data warehouse database for the Hadoop and used to store large big data sets. Below are some of the advantages using Hive partition tables.
- Fast access to the data
- Provides the ability to perform an operation on a smaller dataset
Create Hive Partition Table
To create a partitioned table in Hive, you can use the PARTITIONED BY
clause along with the CREATE TABLE
statement. Let’s create a partition table and load data from the CSV file.
The data file I’m using to demonstrate partition has columns RecordNumber, Country, City, Zipcode, and State columns. I will be using State as a partition column.
Let’s create the managed table as shown below.
# Create internal or managed table
CREATE TABLE zipcodes(
RecordNumber int,
Country string,
City string,
Zipcode int)
PARTITIONED BY(state string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Load Data into Partition Table
Download the zipcodes.CSV from GitHub, upload it to HDFS, and finally load the CSV file into a partition table.
hdfs dfs -put zipcodes.csv /data/
jdbc:hive2://127.0.0.1:10000> LOAD DATA INPATH '/data/zipcodes.csv' INTO TABLE zipcodes;
Show All Partitions on Hive Table
After loading the data into the Hive partition table, you can use SHOW PARTITIONS
command to see all partitions that are present.
jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition |
+------------+
| state=AL |
| state=AZ |
| state=FL |
| state=NC |
| state=PR |
| state=TX |
+------------+
6 rows selected (0.074 seconds)
Alternatively, if you know the Hive store location on the HDFS for your table, you can run the HDFS command to check the partitions.
For each partition on the table, you will see a folder created with the partition column name and the partition value.
Add New Partition to the Hive Table
A new partition can be added to the table using the ALERT TABLE
statement, you can also specify the location where you wanted to store partition data on HDFS.
jdbc:hive2://127.0.0.1:10000>ALTER TABLE zipcodes ADD PARTITION (state='CA') LOCATION '/user/data/zipcodes_ca';
SHOW PARTITIONS
return the added partition.
jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition |
+------------+
| state=AL |
| state=AZ |
| state=CA |
| state=FL |
| state=NC |
| state=PR |
| state=TX |
+------------+
7 rows selected (0.081 seconds)
Rename or Update Hive Partition
Using ALTER TABLE
, you can also rename or update the specific partition.
jdbc:hive2://127.0.0.1:10000> ALTER TABLE zipcodes PARTITION (state='AL') RENAME TO PARTITION (state='NY');
Manually Renaming Partitions on HDFS
Just rename the partition directory by accessing the HDFS. Use hdfs dfs -mv
command to rename. This ideally moves the directory name from old to new name.
hdfs dfs -mv /user/hive/warehouse/zipcodes/state=NY /user/hive/warehouse/zipcodes/state=AL
When you make manual modifications to partitions directly on HDFS, it’s essential to run the MSCK REPAIR TABLE command to ensure the Hive Metastore is updated accordingly. Failing to do this can lead to inconsistent results.
SELECT
doesn’t show the renamed partitionSHOW PARTITIONS
still shows the older partition
jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition |
+------------+
| state=AZ |
| state=CA |
| state=FL |
| state=NC |
| state=NY |
| state=PR |
| state=TX |
+------------+
jdbc:hive2://127.0.0.1:10000>MSCK REPAIR TABLE zipcodes SYNC PARTITIONS;
The command above synchronizes the ‘zipcodes’ table on the Hive Metastore. Next, execute the show partition command to display the ‘state=AL’ partition.
Drop Hive Partition
Dropping a partition can also be performed using ALTER TABLE tablename DROP
jdbc:hive2://127.0.0.1:10000> ALTER TABLE zipcodes DROP IF EXISTS PARTITION (state='AL');
Not using IF EXISTS
result in error when specified partition not exists.
Manually Dropping Partitions on HDFS
Similarly, you can also drop the partition directory from HDFS using the below command.
user@namenode:~/hive$ hdfs dfs -rm -R /user/hive/warehouse/zipcodes/state=AL
Deleted /user/hive/warehouse/zipcodes/state=AL
When you run the SELECT command on the table, it will not display records from removed partitions. However, SHOW PARTITIONS will still list the deleted partitions.
jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+-----------------------------------+
| partition |
+-----------------------------------+
| state=AL |
| state=AZ |
| state=FL |
| state=NC |
| state=PR |
| state=TX |
+-----------------------------------+
6 rows selected (0.096 seconds)
Run MSCK REPAIR TABLE
as shown below to fix it.
jdbc:hive2://127.0.0.1:10000> MSCK REPAIR TABLE zipcodes SYNC PARTITIONS;
jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+-----------------------------------+
| partition |
+-----------------------------------+
| state=AZ |
| state=FL |
| state=NC |
| state=PR |
| state=TX |
+-----------------------------------+
5 rows selected (0.096 seconds)
Other Useful Partition Commands
Below are some of the additional partition commands that are useful.
How to Filter Partitions?
If you have 100’s of partitions, you can check if a specific partition exists on the table using SHOW PARTITIONS tablename PARTITION
SHOW PARTITIONS zipcodes PARTITION(state='NC');
+------------+
| partition |
+------------+
| state=NC |
+------------+
1 row selected (0.182 seconds)
How to Know Specific Partition Location on HDFS?
Either of the below statements is used to know the HDFS location of each partition.
DESCRIBE FORMATTED zipcodes PARTITION(state='PR');
SHOW TABLE EXTENDED LIKE zipcodes PARTITION(state='PR');
Running SHOW TABLE EXTENDED
on table and partition results in the below output. location attribute shows the location of the partition file on HDFS.
Conclusion
In this article, you have learned Hive table partition is used to split a larger table into smaller tables by splitting based on one or multiple partitions columns also learned the following
PARTITIONED BY
is used to create a partition tableALTER TABLE
is used to add, rename, drop partitionsSHOW PARTITIONS
is used to show the partitions of the tableMSCK REPAIR
is used to synch Hive Metastore with the HDFS data.
Related Articles
- Hive – How to Show All Partitions of a Table?
- Hive Relational | Arithmetic | Logical Operators
- Apache Hive Installation on Ubuntu
- Hive Bucketing Explained with Examples
- Hive Partitioning vs Bucketing with Examples?
- Hive – INSERT INTO vs INSERT OVERWRITE Explained
- Hive Load CSV File into Table
Reference
Happy Learning !!