HBase Tutorial with Spark Scala Examples

Apache HBase is an open-source, distributed, scalable non-relational database for storing big data on the Apache Hadoop platform, this HBase Tutorial will help you in getting understanding of What is HBase?, it’s advantages, Installation, and interacting with Hbase database using shell commands.

Advertisements

And at the end of the guide, we will see how to access Hbase from Hive and work with HBase programmatically using Spark RDD, Spark DataFrame using Scala examples.

Table of Contents

HBase Tutorial Introduction, History & Architecture

Introduction

HBase provides Google Bigtable-like capabilities on top of the Hadoop Distributed File System (HDFS). It is designed for data lake use cases and is not typically used for web and mobile applications. Unlike the Relational Database (SQL), It is a column database a.k.a NoSQL Database.

To run HBase on a cluster, you should have Apache Hadoop Install as it uses Hadoop distributed cluster and HDFS to store data. Alternatively, for DEV where you don’t have a cluster, you should have HBase install as Standalone mode. Most of the examples in this tutorial will run on Standalone mode for simplicity.

History

HBase has evolved over a period of time. Apache HBase began as a project by the company Powerset for Natural Language Search, which was handling massive and sparse data sets.

  • 2006: BigTable paper published by Google.
  • Late 2006:  HBase development starts.
  • 2008 to 2010: HBase becomes Hadoop sub-project and releases several initial version.
  • 2010: HBase becomes an Apache top-level project.
  • 2011: First version Released from Apache HBase

Architecture

HBase Advantages & Disadvantages

Benefits

Following are a few advantages of HBase to name.

  • Leverages HDFS file storage hence, you can store large data set and can perform analytical queries on tables in lesser time.
  • Columns can be added and removed dynamically when needed.
  • Easy to Scale with limited effort.
  • Random read and write operations
  • Highly fault-tolerant storage for storing large quantities of sparse data
  • Runs on commodity hardware.
  • Read from and Write to table takes less time.
  • Provides REST API to access the table
  • Easy to use with Java and Scala.
  • Supports parallel processing via MapReduce and Spark
  • Queries execute parallel across the cluster ( at data locality)

Limitations

Before you start using HBase, understand the below limitations as these may impact your project and later you may come access unexpected issues and delays to your timelines. Following are a few disadvantages of HBase to name.

  • Unlike SQL’s, HBase is very tough for querying and can be done very few SQL like operations.
  • Filtering Data on shell commands is not easy and requires Java imports.
  • It is not possible to do join operations on different HBase tables.
  • HBase integrates with Map-reduce jobs which result in high I/O.
  • HBase is CPU and Memory intensive.
  • Sorting could be done only on Row keys.
  • Doesn’t support keying on multiple columns, in other words, a compound key is not supported.
  • Single point of failure
  • Require to use Hive on top of HBase table to run SQL like queries.
  • No support of transactions.

HBase vs Cassandra

HBaseCassandra
Modeled on BigTable (Google) Modeled on DynamoDB (Amazon)
Required HDFS to store data Doesn’t need HDFS
leverages Hadoop infrastructure  
Hbase needs HMaster, Regions, and ZookeeperIs a single node type. Every node treated equally
Single point of failure when HMaster goes down No single point of Failure
Single row write Single row read
Supports Range based Row Scans Not supported range based row scans
Optimized for Reading Optimized for Writing
 

HBase Installation & Setup Modes

You can set up and run HBase in several modes. Rea

  • Standalone mode – All HBase services run in a single JVM.
  • Pseudo-distribution mode – where it runs all HBase services (Master, RegionServers and Zookeeper) in a single node but each service in its own JVM
  • Cluster mode – Where all services run in different nodes; this would be used for production.

Standalone Step-by-Step setup guide

Considering you are learning HBase on DEV or your local system where you will not have a cluster with multiple nodes setup, In this tutorial, we will learn how to set up and start the HBase server in Standalone mode.

Pre-requisites

  • You should have Java 8 or later installed on your system

In this tutorial, we consider you already have a Java installed on your system and reset of the sections provides a step-by-step guide of setting up HBase in standalone mode.

As a first step, Download Apache HBase and unzip to any folder, let’s assume you have extracted it to <HBase_Home> folder.

Regardless of which environment you wanted to setup and run HBase, you would require to edit conf/hbase-env.sh file and set JAVA_HOME to the Java installation location as shown below.


export JAVA_HOME=/usr/jdk64/jdk1.8.0_112

HBase requires a directory to store its files, including those for Zookeeper. You can specify this location by editing the conf/hbase-site.xml file. If not specified, HBase will automatically create directories under the /tmp directory. To configure a specific location, you can set it to store HBase data in the /Users/hbaseuser/hbase directory and Zookeeper files in the /Users/hbaseuser/zookeeper directory.

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///Users/hbaseuser/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/Users/hbaseuser/zookeeper</value>
  </property>
  <property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
  </property>
</configuration>

Run bin/start-hbase.sh script to start HBase server.


system:bin hbaseuser$ ./start-hbase.sh 
running master, logging to /Users/hbaseuser/Applications/hbase-2.2.0/bin/../logs/hbase-hbaseuser-master-system.out

From command line/terminal run jps command to verify HMaster service running. This command should show HMaster, single HRegionServer and Zookeeper services are running in single JVM.


system:bin hbaseuser$ jps
16360 HMaster
54027 
16860 Jps

HBase Tutorial – Shell Commands

The HBase distribution includes a tool called “hbase shell,” which allows interaction with the HBase server. It’s useful for manually executing commands. To explore these commands, you can run ./hbase shell from the bin directory of the HBase installation.


system:bin hbaseuser$ ./hbase shell
2019-08-22 12:44:29,966 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.0, rUnknown, Tue Jun 11 04:30:30 UTC 2019
Took 0.0017 seconds                                                             
hbase(main):001:0>

HBase Shell commands are broken down into 13 groups to interact with Database, In this HBase tutorial we will learn usage, syntax, description, and examples for few general usage commands.

You can get the usage of each shell command by running help ‘<command>’ | ‘<group-name>’ or just entering command name without parameters on the HBase shell. While trying these commands, make sure table names, rows, columns all must be enclosed in quote characters.

Create Table

Use create command to create a table. It takes table name and column family as mandatory arguments. The syntax to create the table is as follows.


Syntax: create ‘<name_space:table_name>’, ‘column_family’

Note that namespace for table name is optional and when not specified it creates a table in the default namespace. column_family is mandatory, at least one needs in order to create a table successfully. when not specified, it returns an error. Below example creates a table ’emp’ with ‘office’ column family.

hbase tutorial | create shell command

List Table

Use list command to show all user tables in the HBase. The list also supports optional regular expression to filter the output. The syntax to list the tables is as follows.


Syntax: list ‘<namespace>:<regular expression>’

This returns all users table in the database, you can also use with a regular expression to filter the results

hbase tutorial | list shell command

Describe Table

Use describe command to describe the details and configuration of the HBase table. For example, version, compression, blocksize, replication e.t.c. The syntax to describe the table is as follows.

Syntax: describe <‘namespace’:’table_name’>
learn hbase | describe shell command

Insert Data to Table

Use put command to insert data to rows and columns on a table. This would be similar to insert statement on RDBMS but, the syntax is completely different.

Syntax: put ‘<name_space:table_name>’, ‘<row_key>’ ‘<cf:column_name>’, ‘<value>’

hbase(main):060:0> put 'emp', '1' , 'office:name', 'Scott'
hbase(main):060:0> put 'emp', '2' , 'office:name', 'Mark'     
hbase(main):061:0> put 'emp', '2' , 'office:gender', 'M'     
hbase(main):062:0> put 'emp', '2' , 'office:age', '30'
hbase(main):063:0> put 'emp', '2' , 'office:age', '50'

In above examples, notice that we have added 2 rows; row key ‘1’ with one column ‘office:name’ and row key ‘2’ with three columns ‘office:name’, ‘office:gender’ and ‘office:age’. If you are coming from RDBMS world, you probably would confuse with this. Once you understand how column database works it’s not that difficult to get around it.

Also, note that last command from above example actually inserts a new column ‘office:age’ at row key ‘2’ with ’50’

Internally, HBase doesn’t do an update but it assigns a column with new timestamp and scan fetches the latest data from columns.


hbase(main):017:0> put 'emp', '3', 'office:salary', '10000'
Took 0.0359 seconds
hbase(main):018:0> put 'emp', '3', 'office:name', 'Jeff'
Took 0.0021 seconds
hbase(main):019:0> put 'emp', '3', 'office:salary', '20000'
Took 0.0032 seconds
hbase(main):020:0> put 'emp', '3', 'office:salary', '30000'
Took 0.0021 seconds
hbase(main):021:0> put 'emp', '3', 'office:salary', '40000'
Took 0.0025 seconds
hbase(main):027:0> put 'emp','1','office:age','20'
hbase(main):027:0> put 'emp','3','office:age','30'

Reading Data from a Table

Use scan command to get the data from the HBase table. By default, it fetches all data from the table.


Syntax: scan ‘<name_space:table_name>’

This returns all rows from table.

hbase scan shell command

This scan’s the ’emp’ table to return name and age columns from starting row 1 and ending row 3.

hbase scan shell command

Use get to retrieve the data from a single row and it’s columns. The syntax for command get is as follows.


Syntax: get ‘<namespace>:<table_name>’, ‘<row_key>’, ‘<column_key>’

This returns all columns for row ‘2’ from ’emp’ table.

get shell command

We can also specify which columns to return.

get shell command

Disabling Table

Use disable to disable a table. Prior to delete a table or change its setting, first, you need to disable the table. The syntax to disable the table is as follows.

Syntax: disable ‘<namespace>:<table_name>’

Let’s disable the ’emp’ table and then will see how to check if the table disabled.

disable shell command

Use is_disabled to check if the table is disabled. When it disabled it returns ‘true

disable shell command

Let’s check if the table disabled by using describe

describe shell command

Note that accessing disabled table results in an error.

Enabling Table

Use enalbe to enable a disabled table. You need to enable a disabled table first to perform any regular commands., The syntax to enable the table is as follows.

Syntax: enable ‘<namespace>:<table_name>’

Below enables ’emp’ table.

enable shell command

Deleting Rows

Use deleteall to remove a specified row from an HBase table. This takes table name and row as a mandatory argument; optionally column and timestamp. It also supports deleting a row range using a row key prefix. The syntax for deleteall is as follows.

Syntax: deleteall ‘<table_name>’, ‘row_key’, ‘<column_name>’

Use delete command, to remove a column at a row from a table. Let’s see syntax and some examples. Unlike deleteall, delete command takes ‘column cell’ as a mandatory argument along with a table and row key. Optionally it takes timestamp. The syntax for delete is as follows.

Syntax: delete ‘<table_name>’, ‘row_key’, ‘<column_name>’

Dropping Table

Use drop command to delete a table. You should disable a table first before you drop it.

Syntax: drop ‘<table_name>’

Below commands disable “emp” table and drop it.


hbase(main):011:0> disable 'emp'
0 row(s) in 1.3660 seconds

hbase(main):012:0> drop 'emp'
0 row(s) in 0.5630 seconds

Use drop_all command to delete many tables using a regular expression.

 hbase(main):041:0> drop_all ‘em.*’ 

Running SQL queries on HBase using Hive Tutorial

Running SQL like queries are not possible on HBase, In this tutorial, we will leverage to use Apache Hive to access HBase and get the benefit of SQL syntax. In order to do this first, you need to have Apache hive installed and set up to access Hbase, this tutorial does not cover this.

Having said that, you need to use HBaseStorageHandler java class from hive-hbase-handler-x.y.z.jar to register HBase tables with the Hive metastore. You can optionally specify the HBase table as EXTERNAL, in which case Hive will not create to drop that table directly and you’ll have to use the HBase shell to do so.

As mentioned earlier, the storage handler is part of hive-hbase-handler-x.y.z.jar, which must be available on the Hive client, along with HBase, Guava and ZooKeeper jars.

You would need to run below hive command in order to integrate with HBase and make sure you pass HBase master node and its port to --hiveconf


$HIVE_HOME/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-x.x.x.jar,$HIVE_SRC/build/dist/lib/hbase-x.x.x.jar,$HIVE_SRC/build/dist/lib/zookeeper-x.x.x.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar --hiveconf hbase.master=hbase.host.name:60000

Please use the version of these jars according to the Hive & HBase versions you are using.

Hive creating a new HBase table

To create a new HBase table from Hive shell, use the STORED BY clause on while creating a table.


CREATE TABLE hivehbasetable(key INT, name STRING,  gender STRING, age STRING, salary STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,name:name,gender:gender,age:age,salary:salary") TBLPROPERTIES("hbase.table.name" = "emp");

Let’s look at some properties on above command.

  • hbase.columns.mapping: This property is required and is used to map the column names between HBase and Hive tables.
  • hbase.table.name: This property is optional; it controls the name of the table as known by HBase, and allows the Hive table to have a different name. In this example, the table is known as hbase_table_1 within Hive, and as xyz within HBase. If not specified, then the Hive and HBase table names will be identical.
  • hbase.mapred.output.outputtable: This property is optional; it’s needed if you plan to insert data to the table (the property is used by hbase.mapreduce.TableOutputFormat)

Hive accessing existing HBase table

If you want to access an existing HBase table from Hive, use CREATE EXTERNAL TABLE: Again, hbase.columns.mapping is required (and will be validated against the existing HBase table’s column families), whereas hbase.table.name is optional. The hbase.mapred.output.outputtable is optional.


CREATE EXTERNAL TABLE hbase_hive_table(key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
TBLPROPERTIES("hbase.table.name" = "old_table", "hbase.mapred.output.outputtable" = "old_table");

This creates a new Hive table “hbase_hive_table” for existing Hbase “old_table”

You may not need to reference every HBase columns in Hive table, but those that are not mapped will be inaccessible via the Hive table; it’s possible to map multiple Hive tables to the same HBase table

Finally, let’s run a few SQL queries use Hive, which pulls the data from HBase table.



select count(key) from hbase_hive_table;

HBase with Spark & Scala Tutorial

This HBase tutorial will provide a few pointers of using Spark with Hbase and several easy working examples of running Spark programs on HBase tables using Scala language. we should able to run bulk operations on HBase tables by leveraging Spark parallelism and it benefits Using Spark HBase connectors API, for example, bulk inserting Spark RDD to a table, bulk deleting millions of records and other bulk operations.

Also, we will learn how to use DataSource API’s to operate HBase with Spark-SQL on DataFrame and DataSet. With the DataFrame and DataSet support, the library leverages DataFrame catalyst optimization and project Tungeston for performance.

Accessing Hbase from Spark RDD

coming soon..

Accessing Hbase from Spark DataFrame

coming soon..

Thanks for Reading !!