Apache HBase is an open-source, distributed, scalable non-relational database for storing big data on the Apache Hadoop platform, this HBase Tutorial will help you in getting understanding of What is HBase?, it’s advantages, Installation, and interacting with Hbase database using shell commands.
And at the end of the guide, we will see how to access Hbase from Hive and work with HBase programmatically using Spark RDD, Spark DataFrame using Scala examples.
Table of Contents
- HBase Tutorial Introduction, History & Architecture
- HBase Advantages & Disadvantages
- Installation & HBase Shell Commands
- Learn Accessing HBase using Hive
- HBase Tutorial with Spark & Scala
HBase Tutorial Introduction, History & Architecture
Introduction
HBase provides Google Bigtable-like capabilities on top of the Hadoop Distributed File System (HDFS). It is designed for data lake use cases and is not typically used for web and mobile applications. Unlike the Relational Database (SQL), It is a column database a.k.a NoSQL Database.
To run HBase on a cluster, you should have Apache Hadoop Install as it uses Hadoop distributed cluster and HDFS to store data. Alternatively, for DEV where you don’t have a cluster, you should have HBase install as Standalone mode. Most of the examples in this tutorial will run on Standalone mode for simplicity.
History
HBase has evolved over a period of time. Apache HBase began as a project by the company Powerset for Natural Language Search, which was handling massive and sparse data sets.
- 2006: BigTable paper published by Google.
- Late 2006: HBase development starts.
- 2008 to 2010: HBase becomes Hadoop sub-project and releases several initial version.
- 2010: HBase becomes an Apache top-level project.
- 2011: First version Released from Apache HBase
Architecture
HBase Advantages & Disadvantages
Benefits
Following are a few advantages of HBase to name.
- Leverages HDFS file storage hence, you can store large data set and can perform analytical queries on tables in lesser time.
- Columns can be added and removed dynamically when needed.
- Easy to Scale with limited effort.
- Random read and write operations
- Highly fault-tolerant storage for storing large quantities of sparse data
- Runs on commodity hardware.
- Read from and Write to table takes less time.
- Provides REST API to access the table
- Easy to use with Java and Scala.
- Supports parallel processing via MapReduce and Spark
- Queries execute parallel across the cluster ( at data locality)
Limitations
Before you start using HBase, understand the below limitations as these may impact your project and later you may come access unexpected issues and delays to your timelines. Following are a few disadvantages of HBase to name.
- Unlike SQL’s, HBase is very tough for querying and can be done very few SQL like operations.
- Filtering Data on shell commands is not easy and requires Java imports.
- It is not possible to do join operations on different HBase tables.
- HBase integrates with Map-reduce jobs which result in high I/O.
- HBase is CPU and Memory intensive.
- Sorting could be done only on Row keys.
- Doesn’t support keying on multiple columns, in other words, a compound key is not supported.
- Single point of failure
- Require to use Hive on top of HBase table to run SQL like queries.
- No support of transactions.
HBase vs Cassandra
HBase | Cassandra |
Modeled on BigTable (Google) | Modeled on DynamoDB (Amazon) |
Required HDFS to store data | Doesn’t need HDFS |
leverages Hadoop infrastructure | |
Hbase needs HMaster, Regions, and Zookeeper | Is a single node type. Every node treated equally |
Single point of failure when HMaster goes down | No single point of Failure |
Single row write | Single row read |
Supports Range based Row Scans | Not supported range based row scans |
Optimized for Reading | Optimized for Writing |
HBase Installation & Setup Modes
You can set up and run HBase in several modes. Rea
- Standalone mode – All HBase services run in a single JVM.
- Pseudo-distribution mode – where it runs all HBase services (Master, RegionServers and Zookeeper) in a single node but each service in its own JVM
- Cluster mode – Where all services run in different nodes; this would be used for production.
Standalone Step-by-Step setup guide
Considering you are learning HBase on DEV or your local system where you will not have a cluster with multiple nodes setup, In this tutorial, we will learn how to set up and start the HBase server in Standalone mode.
Pre-requisites
- You should have Java 8 or later installed on your system
In this tutorial, we consider you already have a Java installed on your system and reset of the sections provides a step-by-step guide of setting up HBase in standalone mode.
As a first step, Download Apache HBase and unzip to any folder, let’s assume you have extracted it to <HBase_Home> folder.
Regardless of which environment you wanted to setup and run HBase, you would require to edit conf/hbase-env.sh
file and set JAVA_HOME to the Java installation location as shown below.
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
HBase requires a directory to store its files, including those for Zookeeper. You can specify this location by editing the conf/hbase-site.xml file. If not specified, HBase will automatically create directories under the /tmp directory. To configure a specific location, you can set it to store HBase data in the /Users/hbaseuser/hbase directory and Zookeeper files in the /Users/hbaseuser/zookeeper directory.
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///Users/hbaseuser/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/hbaseuser/zookeeper</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
</configuration>
Run bin/start-hbase.sh script to start HBase server.
system:bin hbaseuser$ ./start-hbase.sh
running master, logging to /Users/hbaseuser/Applications/hbase-2.2.0/bin/../logs/hbase-hbaseuser-master-system.out
From command line/terminal run jps
command to verify HMaster service running. This command should show HMaster, single HRegionServer and Zookeeper services are running in single JVM.
system:bin hbaseuser$ jps
16360 HMaster
54027
16860 Jps
HBase Tutorial – Shell Commands
The HBase distribution includes a tool called “hbase shell,” which allows interaction with the HBase server. It’s useful for manually executing commands. To explore these commands, you can run ./hbase shell from the bin directory of the HBase installation.
system:bin hbaseuser$ ./hbase shell
2019-08-22 12:44:29,966 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.0, rUnknown, Tue Jun 11 04:30:30 UTC 2019
Took 0.0017 seconds
hbase(main):001:0>
HBase Shell commands are broken down into 13 groups to interact with Database, In this HBase tutorial we will learn usage, syntax, description, and examples for few general usage commands.
You can get the usage of each shell command by running help ‘<command>’ | ‘<group-name>’ or just entering command name without parameters on the HBase shell. While trying these commands, make sure table names, rows, columns all must be enclosed in quote characters.
Create Table
Use create command to create a table. It takes table name and column family as mandatory arguments. The syntax to create the table is as follows.
Syntax: create ‘<name_space:table_name>’, ‘column_family’
Note that namespace for table name is optional and when not specified it creates a table in the default
namespace. column_family is mandatory, at least one needs in order to create a table successfully. when not specified, it returns an error. Below example creates a table ’emp’ with ‘office’ column family.
List Table
Use list command to show all user tables in the HBase. The list also supports optional regular expression to filter the output. The syntax to list the tables is as follows.
Syntax: list ‘<namespace>:<regular expression>’
This returns all users table in the database, you can also use with a regular expression to filter the results
Describe Table
Use describe command to describe the details and configuration of the HBase table. For example, version, compression, blocksize, replication e.t.c. The syntax to describe the table is as follows.
Syntax: describe <‘namespace’:’table_name’>
Insert Data to Table
Use put command to insert data to rows and columns on a table. This would be similar to insert statement on RDBMS but, the syntax is completely different.
Syntax: put ‘<name_space:table_name>’, ‘<row_key>’ ‘<cf:column_name>’, ‘<value>’
hbase(main):060:0> put 'emp', '1' , 'office:name', 'Scott'
hbase(main):060:0> put 'emp', '2' , 'office:name', 'Mark'
hbase(main):061:0> put 'emp', '2' , 'office:gender', 'M'
hbase(main):062:0> put 'emp', '2' , 'office:age', '30'
hbase(main):063:0> put 'emp', '2' , 'office:age', '50'
In above examples, notice that we have added 2 rows; row key ‘1’ with one column ‘office:name’ and row key ‘2’ with three columns ‘office:name’, ‘office:gender’ and ‘office:age’. If you are coming from RDBMS world, you probably would confuse with this. Once you understand how column database works it’s not that difficult to get around it.
Also, note that last command from above example actually inserts a new column ‘office:age’ at row key ‘2’ with ’50’
Internally, HBase doesn’t do an update but it assigns a column with new timestamp and scan fetches the latest data from columns.
hbase(main):017:0> put 'emp', '3', 'office:salary', '10000'
Took 0.0359 seconds
hbase(main):018:0> put 'emp', '3', 'office:name', 'Jeff'
Took 0.0021 seconds
hbase(main):019:0> put 'emp', '3', 'office:salary', '20000'
Took 0.0032 seconds
hbase(main):020:0> put 'emp', '3', 'office:salary', '30000'
Took 0.0021 seconds
hbase(main):021:0> put 'emp', '3', 'office:salary', '40000'
Took 0.0025 seconds
hbase(main):027:0> put 'emp','1','office:age','20'
hbase(main):027:0> put 'emp','3','office:age','30'
Reading Data from a Table
Use scan command to get the data from the HBase table. By default, it fetches all data from the table.
Syntax: scan ‘<name_space:table_name>’
This returns all rows from table.
This scan’s the ’emp’ table to return name and age columns from starting row 1 and ending row 3.
Use get to retrieve the data from a single row and it’s columns. The syntax for command get is as follows.
Syntax: get ‘<namespace>:<table_name>’, ‘<row_key>’, ‘<column_key>’
This returns all columns for row ‘2’ from ’emp’ table.
We can also specify which columns to return.
Disabling Table
Use disable
to disable a table. Prior to delete a table or change its setting, first, you need to disable the table. The syntax to disable the table is as follows.
Syntax: disable ‘<namespace>:<table_name>’
Let’s disable the ’emp’ table and then will see how to check if the table disabled.
Use is_disabled
to check if the table is disabled. When it disabled it returns ‘true
‘
Let’s check if the table disabled by using describe
Note that accessing disabled table results in an error.
Enabling Table
Use enalbe
to enable a disabled table. You need to enable a disabled table first to perform any regular commands., The syntax to enable the table is as follows.
Syntax: enable ‘<namespace>:<table_name>’
Below enables ’emp’ table.
Deleting Rows
Use deleteall
to remove a specified row from an HBase table. This takes table name and row as a mandatory argument; optionally column and timestamp. It also supports deleting a row range using a row key prefix. The syntax for deleteall
is as follows.
Syntax: deleteall ‘<table_name>’, ‘row_key’, ‘<column_name>’
Use delete
command, to remove a column at a row from a table. Let’s see syntax and some examples. Unlike deleteall
, delete command takes ‘column cell’ as a mandatory argument along with a table and row key. Optionally it takes timestamp. The syntax for delete is as follows.
Syntax: delete ‘<table_name>’, ‘row_key’, ‘<column_name>’
Dropping Table
Use drop
command to delete a table. You should disable a table first before you drop it.
Syntax: drop ‘<table_name>’
Below commands disable “emp” table and drop it.
hbase(main):011:0> disable 'emp'
0 row(s) in 1.3660 seconds
hbase(main):012:0> drop 'emp'
0 row(s) in 0.5630 seconds
Use drop_all
command to delete many tables using a regular expression.
hbase(main):041:0> drop_all ‘em.*’
Running SQL queries on HBase using Hive Tutorial
Running SQL like queries are not possible on HBase, In this tutorial, we will leverage to use Apache Hive to access HBase and get the benefit of SQL syntax. In order to do this first, you need to have Apache hive installed and set up to access Hbase, this tutorial does not cover this.
Having said that, you need to use HBaseStorageHandler
java class from hive-hbase-handler-x.y.z.jar to register HBase tables with the Hive metastore. You can optionally specify the HBase table as EXTERNAL
, in which case Hive will not create to drop that table directly and you’ll have to use the HBase shell to do so.
As mentioned earlier, the storage handler is part of hive-hbase-handler-x.y.z.jar
, which must be available on the Hive client, along with HBase, Guava and ZooKeeper jars.
You would need to run below hive command in order to integrate with HBase and make sure you pass HBase master node and its port to --hiveconf
$HIVE_HOME/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-x.x.x.jar,$HIVE_SRC/build/dist/lib/hbase-x.x.x.jar,$HIVE_SRC/build/dist/lib/zookeeper-x.x.x.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar --hiveconf hbase.master=hbase.host.name:60000
Please use the version of these jars according to the Hive & HBase versions you are using.
Hive creating a new HBase table
To create a new HBase table from Hive shell, use the STORED BY
clause on while creating a table.
CREATE TABLE hivehbasetable(key INT, name STRING, gender STRING, age STRING, salary STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,name:name,gender:gender,age:age,salary:salary") TBLPROPERTIES("hbase.table.name" = "emp");
Let’s look at some properties on above command.
- hbase.columns.mapping: This property is required and is used to map the column names between HBase and Hive tables.
- hbase.table.name: This property is optional; it controls the name of the table as known by HBase, and allows the Hive table to have a different name. In this example, the table is known as
hbase_table_1
within Hive, and asxyz
within HBase. If not specified, then the Hive and HBase table names will be identical. - hbase.mapred.output.outputtable: This property is optional; it’s needed if you plan to insert data to the table (the property is used by
hbase.mapreduce.TableOutputFormat
)
Hive accessing existing HBase table
If you want to access an existing HBase table from Hive, use CREATE EXTERNAL TABLE: Again, hbase.columns.mapping
is required (and will be validated against the existing HBase table’s column families), whereas hbase.table.name
is optional. The hbase.mapred.output.outputtable
is optional.
CREATE EXTERNAL TABLE hbase_hive_table(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
TBLPROPERTIES("hbase.table.name" = "old_table", "hbase.mapred.output.outputtable" = "old_table");
This creates a new Hive table “hbase_hive_table” for existing Hbase “old_table”
You may not need to reference every HBase columns in Hive table, but those that are not mapped will be inaccessible via the Hive table; it’s possible to map multiple Hive tables to the same HBase table
Finally, let’s run a few SQL queries use Hive, which pulls the data from HBase table.
select count(key) from hbase_hive_table;
HBase with Spark & Scala Tutorial
This HBase tutorial will provide a few pointers of using Spark with Hbase and several easy working examples of running Spark programs on HBase tables using Scala language. we should able to run bulk operations on HBase tables by leveraging Spark parallelism and it benefits Using Spark HBase connectors API, for example, bulk inserting Spark RDD to a table, bulk deleting millions of records and other bulk operations.
Also, we will learn how to use DataSource API’s to operate HBase with Spark-SQL on DataFrame and DataSet. With the DataFrame and DataSet support, the library leverages DataFrame catalyst optimization and project Tungeston for performance.
Accessing Hbase from Spark RDD
coming soon..
Accessing Hbase from Spark DataFrame
coming soon..
Thanks for Reading !!