• Post author:
  • Post category:HBase
  • Post last modified:March 27, 2024
  • Reading time:4 mins read

In this tutorial, you will learn how to use HBase Scan to filter the rows/records from a table using predicate conditions on columns similar to the WHERE clause in SQL. In order to use filters, you need to import certain Java classes into HBase Shell.

First, Let’s print the data we are going to work with using scan. If you don’t have the data, please insert the data to HBase table.

As we have learned in previous chapters, the scan is used to read the data from HBase table.


hbase> scan 'emp'
ROW                         COLUMN+CELL                                                                  
 1                          column=office:age, timestamp=1567542138673, value=20                         
 1                          column=office:name, timestamp=1567541857878, value=Scott                     
 2                          column=office:age, timestamp=1567541901009, value=50                         
 2                          column=office:gender, timestamp=1567541880523, value=M                       
 2                          column=office:name, timestamp=1567541868638, value=Mark                      
 3                          column=office:age, timestamp=1567542149583, value=30                         
 3                          column=office:name, timestamp=1567542103821, value=Jeff                      
 3                          column=office:salary, timestamp=1567542130044, value=40000                   
3 row(s)
Took 0.0823 seconds                                                                                      

SingleColumnValueFilter

In order to filter the rows on the HBase shell using Scan, you need to import the org.apache.hadoop.hbase.filter.SingleColumnValueFilter class along with some other class explained below


hbase> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter 
=> [Java::OrgApacheHadoopHbaseFilter::SingleColumnValueFilter]

hbase> import org.apache.hadoop.hbase.filter.CompareFilter
=> [Java::OrgApacheHadoopHbaseFilter::CompareFilter]

hbase> import org.apache.hadoop.hbase.filter.BinaryComparator
=> [Java::OrgApacheHadoopHbaseFilter::BinaryComparator]

Now, let’s run some Filter examples

Example 1: This example returns name == ‘Jeff’ by using CompareFilter::CompareOp.valueOf('EQUAL'),BinaryComparator.new(Bytes.toBytes('Jeff'))


hbase> scan 'emp', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('office'), Bytes.toBytes('name'), CompareFilter::CompareOp.valueOf('EQUAL'),BinaryComparator.new(Bytes.toBytes('Jeff')))}
ROW                         COLUMN+CELL                                                                  
 3                          column=office:age, timestamp=1567542149583, value=30                         
 3                          column=office:name, timestamp=1567542103821, value=Jeff                      
 3                          column=office:salary, timestamp=1567542130044, value=40000                   
1 row(s)
Took 0.0480 seconds         

Example 2: Let’s see how to filter age greater than or equal to 50. CompareFilter::CompareOp.valueOf('GREATER_OR_EQUAL'),BinaryComparator.new(Bytes.toBytes('50'))


hbase> scan 'emp', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('office'), Bytes.toBytes('age'), CompareFilter::CompareOp.valueOf('GREATER_OR_EQUAL'),BinaryComparator.new(Bytes.toBytes('50')))}
ROW                         COLUMN+CELL                                                                  
 2                          column=office:age, timestamp=1567541901009, value=50                         
 2                          column=office:gender, timestamp=1567541880523, value=M                       
 2                          column=office:name, timestamp=1567541868638, value=Mark                      
1 row(s)
Took 0.0180 seconds                                

Example 3: This example check 40000 values on call columns and returns the one that matches.


hbase> scan 'emp', {FILTER => "ValueFilter (=,'binaryprefix:40000')"}
ROW                         COLUMN+CELL                                                                  
 3                          column=office:salary, timestamp=1567542130044, value=40000                   
1 row(s)
Took 0.0034 seconds 

References:

HBase filtering

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium