HBase Summary

Wednesday, 22 June 2016

HBase Summary

Data Modeling Overview

HBase is different from RDBMS in the sense that it has cells and column families..

Unlike in RDBMS, HBase has row, column family, column and timestamp

in there.

One dimension you don't see in below picture is the time stamp associated

to value in the cell.

Customer regional server has multiple column families and data is stored in HFile.

Best Practices..

If we don't have hotspotting, there will be nice distribution of data across the cluster.

Note that row key is repeated with every column and cell. It occupies significant amount of
space.

Securing HBase

Server side configuration:

Client side configuration:

MapReduce Integration with HBase

bin/hbase mapredcp command returns the class-path for mapreduce dependencies

HBase is atomic and consistent (not eventual consistence..)

Ideal for local testing.

Installing HBase in Local Mode

Set hbase.rootdir and hbase.zookeeper.property.datadir in conf/hbase-site.xml to write
data other than /tmp.

bin/start-hbase.sh command can be used to start HBase..

bin/stop-hbase.sh command can be used to stop HBase..

HBase cluster can have up to 9 back up masters.

HBase Web-Based Management Console

Using the HBase shell

Make sure HBase is running before starting the shell.
bin/hbase shell command can be used to start the shell.

Using the HBase as a Data Sink for MapReduce Jobs

TableMapReduceUtil is HBase specific util class that will setup configuration needed for
HBase.

Using the HBase as a Data Source for MapReduce Jobs

TableMapReduceUtil.initTbaleMapperJob takes name of HBase table used for mapper, scan (may contain
filters), mapper class, key (ImmutableBytesWritable.class) and values (IntWritable.class).

Bulk Loading Data

Splitting Map Tasks when Sourcing an HBase Table

Accessing Other HBase Tables within a MapReduce Job

Taking a Snapshot