Apache Hive Tutorial with Examples

What is Apache Hive?

Apache Hive is an open-source data warehouse solution for Hadoop infrastructure. It is used to process structured data of large datasets and provides a way to run HiveQL queries.

What not?

Hive not designed for OLTP processing
It’s not a relational database (RDBMS)
Not used for row-level updates for real-time systems.

Apache Hive Advantages?

Supports large datasets
Runs on Hadoop infrastructure which uses commodity hardware
Supports SQL syntax
Provides Beeline client which is used to connect from Java, Scala, C#, Python, and many more languages.

Different ways to process Hive data

Map-reduce application
Pig scripts
HiveQL

Hive Installation

Apache Hive Installation on Hadoop HDFS

Start HiveServer2 & Connect Beeline

Hive Clients

Hive CLI (Deprecated in new Hive version)
Hive Connect to Beeline

HiveQL DDL Commands

HiveQL DML Commands

Hive Partition and Bucket

Hive Java Examples

Hive Scala Examples

Hive Spark Examples

Spark Union Hive Tables from different Databases

Hive PySpark Examples

Hive Error or Exceptions

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium