You are currently viewing Hive – INSERT INTO vs INSERT OVERWRITE Explained

While working with Hive, we often come across two different types of insert HiveQL commands INSERT INTO and INSERT OVERWRITE to load data into tables and partitions. In this article, I will explain the difference between Hive INSERT INTO vs INSERT OVERWRITE statements with various Hive SQL query examples.

If you have a file and you wanted to load into the table, refer to Hive Load CSV File into Table

In order to explain Hive INSERT INTO vs INSERT OVERWRITE with examples let’s assume we have the employee table with the below contents.

hive insert into vs insert overwrite

I will be using this table for most of the examples below.

1. INSERT INTO Syntax & Examples

Hive first introduced INSERT INTO starting version 0.8 which is used to append the data/records/rows into a table or partition. This doesn’t modify the existing data.

1.1 Syntax

The Hive INSERT INTO syntax will be as follows.


INSERT INTO TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...)] 
select_statement1 FROM from_statement;

1.2 Examples

Example 1: This is a simple insert command to insert a single record into the table.


0: jdbc:hive2://> INSERT INTO employee VALUES 
(11,'Ram',50,'M');

Example 2: This examples inserts multiple rows at a time into the table.


INSERT INTO employee VALUES 
(13,'Mari',51,'M'),
(14,'Pat',34,'F');

Example 3: Let’s see how to insert data into selected columns.


INSERT INTO EMP.EMPLOYEE(id,name) VALUES 
(20,'Bhavi');

Since we are not inserting the data into age and gender columns, these columns inserted with NULL values.

Example 4: You can also use the result of the select query into a table. Here I have created a new Hive table and inserted data from the result of the select query.


CREATE TABLE employee_tmp LIKE employee; 
INSERT INTO employee_tmp SELECT * FROM employee;

SELECT statement on the above example can be any valid select query for example you can add WHERE condition to the SELECT query to filter the rows.

1.3 With Partition Table

To explain INSERT INTO with a partitioned Table, let’s assume we have a ZIPCODES table with STATE as the partition key.

You need to specify the PARTITION optional clause to insert into a specific partition.

Example 5: This example appends the records into FL partition of the Hive partitioned table.


INSERT INTO zipcodes PARTITION(state='FL') VALUES 
(891,'US','TAMPA',33605);

Example 6: Another example to insert data into Hive partition. Here it’s mandatory to keep the partition column as the last column.


INSERT INTO zipcodes PARTITION(state) VALUES 
(891,'US','TAMPA',33605,'FL');

You can also use examples from 1 to 4 to insert into the partitioned table, remember when using these approaches you would need to have the partition column as the last column.

Besides these you can also Load file into Hive partitioned table.

2. INSERT OVERWRITE Syntax & Examples

INSERT OVERWRITE is used to replace any existing data in the table or partition and insert with the new rows.

When working with the partition you can also specify to overwrite only when the partition exists using the IF NOT EXISTS option.

2.1 Syntax

The Hive INSERT OVERWRITE syntax will be as follows. 


INSERT OVERWRITE TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...) 
[IF NOT EXISTS]] 
select_statement1 FROM from_statement;

2.3 Examples

Example 1: This INSERT OVERWRITE example deletes all data from the Hive table and inserts the row specified with the VALUES.


INSERT OVERWRITE TABLE EMP.EMPLOYEE VALUES (11,'Ram',51,'M');
+--------------+----------------+---------------+------------------+
| employee.id  | employee.name  | employee.age  | employee.gender  |
+--------------+----------------+---------------+------------------+
| 11           | Ram            | 51            | M                |
+--------------+----------------+---------------+------------------+
1 row selected (0.241 seconds)

INSERT OVERWRITE also supports all examples specified with INSERT INTO, I will leave these to you to explore.

2.4 With Partitioned Table

To explain INSERT OVERWRITE with a partitioned table, let’s assume we have a ZIPCODES table with STATE as the partition key.

To insert data into a specific partition, you need to specify the PARTITION optional clause.

Example 2: INSERT OVERWRITE with PARTITION clause removes the records from the specified partition and inserts the new records into the partition without touching other partitions.


INSERT OVERWRITE TABLE zipcodes PARTITION(state='FL') VALUES 
(896,'US','TAMPA',33607);

Example 2: You can also write without PARTITION clause as shown below. When you use this approach make sure to keep the partition column as the last column.


INSERT OVERWRITE TABLE zipcodes 
VALUES (896,'US','TAMPA',33607,'FL');

Example 4: By using IF NOT EXISTS, Hive checks if the partition already presents, If it presents it skips the insert.


INSERT OVERWRITE TABLE zipcodes PARTITION(state='NJ') 
IF NOT EXISTS VALUES (921,'US','New Jersey',7097);

Example 5:


INSERT OVERWRITE TABLE zipcodes PARTITION(state='NJ') 
IF NOT EXISTS select id,city,zipcode from other_table;

2.5 Export Table to LOCAL or HDFS

INSERT OVERWRITE statement is also used to export Hive table into HDFS or LOCAL directory, in order to do so, you need to use the DIRECTORY clause.


INSERT OVERWRITE DIRECTORY '/user/data/output/export' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
SELECT * FROM employee;

Let’s run the HDFS command to check the exported file. ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' is used to export the file in CSV format.


hdfs dfs -cat /user/data/output/export/000000_0
1,James,30,M
2,Ann,40,F
3,Jeff,41,M
4,Jennifer,20,F

You can also directly export the table into LOCAL directory.


INSERT OVERWRITE LOCAL DIRECTORY '/tmp/export' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
SELECT * FROM employee;

Check the local system directory to confirm.


cat /tmp/export/000000_0
1,James,30,M
2,Ann,40,F
3,Jeff,41,M
4,Jennifer,20,F
hiveuser@namenode:~/hive$

Conclusion

In summary the difference between Hive INSERT INTO vs INSERT OVERWRITE, INSERT INTO is used to append the data into Hive tables and partitioned tables and INSERT OVERWRITE is used to remove the existing data from the table and insert the new data.

Happy Learning !!

Reference

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply