Skip to content

Apache Beam and HBase

Apache Beam and HBase

HBase is a NoSql database, which allows you to store data in many different formats (like pictures, pdfs, textfiles and others) and gives the ability to do fast and efficient data lookups. HBase has two APIs to chose from – Java API and HBase Shell. We can also connect HBase with some different tools like Hive or Phoenix and use SQL. HBase also integrates with Apache Beam via HBaseIO transform. Here I show how to connect HBase and Beam in Kerberised environment.

HOW TO

I was using Beam 2.3 Java SDK and Spark 2.2 on CDH Hadoop cluster v. 5.12.1 with Kerberos.

For how to start working with Beam check my previous post.

 
Here’s an example use of HBaseIO:

Tricky thing is that HBaseIO.read() returns a Result object, whereas for writing we need to have a Mutation object. It means we need to specify some transform in between:

Read data can be limited either with withScan, withKeyRange or withFilter methods:

More info can be found here.

 

Kerberos authentication

To make HBaseIO work with Kerberos I had to:

  • pass the keytab file 2 times (each time with different name). Once with files option and secondly with keytab option. First one needs to available on HDFS with proper reading permissions set, while the second one needs to be stored locally due to pending issue.
    Keytabs need to have different names, otherwise I was getting:

     

  • pass the principal parameter
  • set hbase.security.authentication=kerberos
  • specify hbase.zookeeper.quorum property

 

Libraries

Following jars were needed for my processing:

  • beam-sdks-java-io-hadoop-common-2.3.0.jar
  • snappy-java-1.1.7.1.jar
  • beam-sdks-java-io-hbase-2.4.0.jar
  • hbase-client-1.2.6.jar
  • hbase-server-1.2.6.jar
  • hbase-common-1.2.6.jar
  • hbase-protocol-1.2.6.jar
  • htrace-core-3.1.0-incubating.jar

Note: Passed HBase jars cannot have version higher than 1.2.6 (this one is used by Beam 2.3). Using newer ones may result in errors:

 

Running job

I used spark2-submit command to run my job:

My pom.xml contained following dependencies:

 

Features

With HBaseIO I tested 3 scenarios.

  • Reading data from HBase

I had no issues with reading data from HBase.

  • Writing data to HBase

Also no issues detected while writing to HBase.
Note: HBase table needs to be specified upfront.

  • HBase lookups for retrieving data by key

This scenario also passed the test. Beam very nicely integrates with HBase and allows to easily retrieve data by specifying record row key.

In my opinion Beam and HBase integrate really nicely. I spotted no issue and no missing features along the way.

0

Leave a Reply

Your email address will not be published. Required fields are marked *