hbase用coprocessor实现二级索引 | 邓的博客

HBase在0.92之后引入了coprocessors，提供了一系列的钩子，让我们能够轻易实现访问控制和二级索引的特性。下面简单介绍下两种coprocessors，第一种是Observers，它实际类似于触发器，第二种是Endpoint，它类似与存储过程。由于这里只用到了Observers，所以只介绍Observers，想要更详细的介绍请查阅（https://blogs.apache.org/hbase/entry/coprocessor_introduction）。observers分为三种：

RegionObserver：提供数据操作事件钩子；

WALObserver：提供WAL（write ahead log）相关操作事件钩子；

MasterObserver：提供DDL操作事件钩子。

[HBase] Hbase Coprocessors - 芒果先生Mango的专栏 - 博客频道 - CSDN.NET

本文是笔者学习过程中的简单笔记，日后会逐渐增加内容，主要参考资料是《Hbase The Definitive Guide》。

我们可以通过Filter来减少从Server到Client在网络上传输的数据总量，以提升效率。通过HBase的Coprocessor特性,我们甚至可以将计算(computation)移动到数据所在的节点。

Introduction to Coprocessors

coprocessor使你能够直接在每个region server上执行任意的代码。更精确地说，它提供一些通过事件触发的功能，以region为基础执行code；这很像关系型数据库系统中的procedures(存储过程)。

在使用coprocessor时，你需要基于特定的interface创建专门的类,以jar包的形式提供给region server (如：可以将jar包放到$HBASE_HOME/lib/目录下)。这些coprocessor类可以通过配置文件静态加载,也可以在程序代码中动态加载。

corpocessor 框架提供了两种coprocessor基类：

1.Observer

这种coprocessor跟触发器相像:当特定的时间发生时，回调函数就会执行。

RegionObserver

处理数据操纵事件(data manipulationevents),这种coprocessor是和表的region紧密相连的。可以看作DML Coprocessor

MasterObserver

处理数据管理事件,是cluster范围的coprocessor。可以看做DDL Coprocessor

WALObserver

处理 write-ahead log processing 事件

2.Endpoint

The Coprocessor Class

所有的coprocessor类必须实现org.apache.hadoop.hbase.Coprocessor接口。

1.属性

PRIORITY_HIGHEST,PRIORITY_SYSTEM,PRIORITY_USER,PRIORITY_LOWEST四个静态常量表示coprocessor的优先级.值越低优先级越高。

2.方法

start(env) stop(env) ：这两个方法在coprocessor开始及退役的时候被调用(these two methods are called when the coprocessor class is started,and eventually when it is decommissioned)

evn参数用来保存coprocessor整个生命周期的状态。

[java] view plain copy

package org.apache.hadoop.hbase;
import java.io.IOException;
/**
* Coprocess interface.
*/
public interface Coprocessor {
static final int VERSION = 1;
/** Highest installation priority */
static final int PRIORITY_HIGHEST = 0;
/** High (system) installation priority */
static final int PRIORITY_SYSTEM = Integer.MAX_VALUE / 4;
/** Default installation priority for user coprocessors */
static final int PRIORITY_USER = Integer.MAX_VALUE / 2;
/** Lowest installation priority */
static final int PRIORITY_LOWEST = Integer.MAX_VALUE;
/**
* Lifecycle state of a given coprocessor instance.
*/
public enum State {
UNINSTALLED,
INSTALLED,
STARTING,
ACTIVE,
STOPPING,
STOPPED
}
// Interface
void start(CoprocessorEnvironment env) throws IOException;
void stop(CoprocessorEnvironment env) throws IOException;
}

Coprocessor Loading 加载coprocessor

静态加载和动态加载。

静态加载：在hbase-site.xml中做类似下面的配置

[html] view plain copy

<property>
<name>hbase.coprocessor.region.classes</name>
<value>coprocessor.RegionObserverExample,coprocessor.AnotherCoprocessor</value>
</property>
<property>
<name>hbase.coprocessor.master.classes</name>
<value>coprocessor.MasterObserverExample</value>
</property>
<property>
<name>hbase.coprocessor.wal.classes</name>
<value>coprocessor.WALObserverExample,bar.foo.MyWALObserver</value>
</property>

动态加载:通过table descriptor提供的接口实现;看下面的例子，创建表testtable,动态加载RegionObserverExample到该表的region

[java] view plain copy

public class LoadWithTableDescriptorExample {
public static void main(String[] args) throws IOException
{
Configuration conf = HBaseConfiguration.create();
FileSystem fs = FileSystem.get(conf);
//coprocessor所在的jar包的存放路径
Path path = new Path(fs.getUri() + Path.SEPARATOR +"test/coprocessor/"+
"test.jar");
//HTableDescriptor
HTableDescriptor htd = new HTableDescriptor("testtable");
//addFamily
htd.addFamily(new HColumnDescriptor("colfam1"));
//
//设置要加载的corpocessor
htd.setValue("COPROCESSOR$1", path.toString() +
"|" + RegionObserverExample.class.getCanonicalName() +
"|" + Coprocessor.PRIORITY_USER);
//
HBaseAdmin admin = new HBaseAdmin(conf);
//创建表"testtable"
admin.createTable(htd);
System.out.println("end");
}
}

下面是RegionObserverExample类的实现, 编译通过后,将该类打包成test.jar,并上传到hdfs://master:9000/test/coprocessor目录下

[java] view plain copy

package coprocessor;
import java.io.IOException;
import java.sql.Date;
import java.util.List;
import org.apache.commons.net.ntp.TimeStamp;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.coprocessor.BaseRegionObserver;
import org.apache.hadoop.hbase.coprocessor.ObserverContext;
import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
import org.apache.hadoop.hbase.util.Bytes;
public class RegionObserverExample extends
BaseRegionObserver {
public static final byte[] FIXED_ROW =
Bytes.toBytes("@@@GETTIME@@@");
//实现功能:用get查询 "@@@GETTIME@@@"行时，以字节数组形式返回系统时间
@Override
public void preGet(
final ObserverContext<RegionCoprocessorEnvironment> e,
final Get get, final List<KeyValue> results) throws
IOException {
if (Bytes.equals(get.getRow(), FIXED_ROW)) {
KeyValue kv = new KeyValue(get.getRow(), FIXED_ROW,
FIXED_ROW,
Bytes.toBytes(System.currentTimeMillis()));
results.add(kv);
}
}
public static void main(String args[]){
System.out.println("complete!");
}
}

Endpoints

前面提到的RegionObserver例子通过已知的row key参数,将列计算功能添加到get请求期间。看起来这足以实现其他功能，比如恩能够返回所有给定列的value的和的聚合函数。然而，RegionObserver并不能实现上述功能,因为row key 决定了由哪个region处理request,这样就只能将计算请求(computation request)发送到单一的server上。

HBase为了克服上述RegionObserver的局限性，由coprocessor框架提供了一个动态调用实现(a dynamic call implementation),称作endpoint concept.

The CoprocessorProtocol interface

The BaseEndpointCoprocessor class

实现一个endpoint包括以下两个步骤

1.Extend the CoprocessorProtocol interface

2.Extend the BaseEndpointCoprocessor class

下面是一个小例子，实现功能：客户端通过远程调用检索每个region的行数和KeyValue的个数。

1.RowCountProtocol interface, code:

[java] view plain copy

public interface RowCountProtocol extends CoprocessorProtocol {
//获取行数
long getRowCount() throws IOException;
//获取应用Filter后的结果集的行数
long getRowCount(Filter filter) throws IOException;
//获取KeyValue的个数
long getKeyValueCount() throws IOException;
}

2.RowCountEndPoint class, code:

[java] view plain copy

public class RowCountEndPoint extends BaseEndpointCoprocessor implements
RowCountProtocol {
public RowCountEndPoint() {
// TODO Auto-generated constructor stub
}
@Override
public long getRowCount() throws IOException {
// TODO Auto-generated method stub
return this.getRowCount(new FirstKeyOnlyFilter());
}
@Override
public long getRowCount(Filter filter) throws IOException {
// TODO Auto-generated method stub
return this.getRowCount(filter,false);
}
@Override
public long getKeyValueCount() throws IOException {
// TODO Auto-generated method stub
return this.getRowCount(null,true);
}
public long getRowCount(Filter filter,boolean countKeyValue) throws IOException {
// TODO Auto-generated method stub
Scan scan =new Scan();
scan.setMaxVersions(1);
if(filter !=null){
scan.setFilter(filter);
}
RegionCoprocessorEnvironment environment=
(RegionCoprocessorEnvironment) this.getEnvironment();
//使用内部scanner做扫描。
InternalScanner scanner = environment.getRegion().getScanner(scan);
//
long result=0;
//计数
try{
boolean done=false;
List<KeyValue> curValue = new ArrayList<KeyValue>();
do{
curValue.clear();
done=scanner.next(curValue);
result+=countKeyValue?curValue.size():1;
}while(done);
}catch(Exception e){
e.printStackTrace();
}finally{
scanner.close();
}
return result;
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
System.out.println("success!");
}
}

3.1将上述类打包到my_coprocessor.jar, copy到各个RegionServer节点的 $HBASE_HOME/lib目录下；

3.2修改$HBASE_HOME/conf/hbase-site.xml配置文件，添加如下信息：

[java] view plain copy

<property>
<name>hbase.coprocessor.region.classes</name>
<value>
coprocessor.RegionObserverExample,
coprocessor.RowCountEndPoint
</value>
</property>

3.3 重启HBase Cluster

4.通过客户端调用之前定义的EndPoint Coprocessor

[java] view plain copy

public class EndPointExample {
/**
* @author mango_song
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = HBaseConfiguration.create();
HTable table =new HTable(conf,"test");
try {
//
/*table.coprocessorExec 函数的描述信息：
* <RowCountProtocol, Long> Map<byte[], Long> org.apache.hadoop.hbase.client.HTable.coprocessorExec(
* Class<RowCountProtocol> protocol,
* byte[] startKey, byte[] endKey,
* Call<RowCountProtocol, Long> callable)
* throws IOException, Throwable
Invoke the passed org.apache.hadoop.hbase.client.coprocessor.Batch.Call
against the CoprocessorProtocol instances running in the selected regions.
All regions beginning with the region containing the startKey row,
through to the region containing the endKey row (inclusive) will be used.
If startKey or endKey is null, the first and last regions in the table,
respectively, will be used in the range selection.
Specified by: coprocessorExec(...) in HTableInterface
Parameters:
protocol the CoprocessorProtocol implementation to call
startKey start region selection with region containing this row
endKey select regions up to and including the region containing this row
callable wraps the CoprocessorProtocol implementation method calls made per-region
Returns:
a Map of region names to org.apache.hadoop.hbase.client.coprocessor.Batch.Call.call(Object) return values
Throws:
IOException
Throwable
*/
Map<byte[], Long> results=table.coprocessorExec(
RowCountProtocol.class,
null,
null,
new Batch.Call<RowCountProtocol, Long>() {
@Override
public Long call(RowCountProtocol instance)
throws IOException {
// TODO Auto-generated method stub
return instance.getRowCount();
}
}
);
long total =0;
//打印出每个region的行数及总行数
for(Map.Entry<byte[], Long> entry:results.entrySet() ){
total += entry.getValue();
System.out.println("Region: "+Bytes.toString(entry.getKey()) +
", Count: "+entry.getValue());
}
System.out.println("Total Count: "+total);
} catch (Throwable e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

运行结果如下,可以看出test表共由三个region组成，每个region拥有的行数分别为9,13,78

[plain] view plain copy

13/01/26 18:59:53 INFO zookeeper.ClientCnxn: Opening socket connection to server master/172.21.15.21:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
13/01/26 18:59:53 INFO zookeeper.ClientCnxn: Socket connection established to master/172.21.15.21:2181, initiating session
13/01/26 18:59:53 INFO zookeeper.ClientCnxn: Session establishment complete on server master/172.21.15.21:2181, sessionid = 0x13c6a82639f000c, negotiated timeout = 40000
Region: test,,1358337586380.f3e04b8b43d073a509e9a374f643277a., Count: 9
Region: test,209,1358337769870.be5a99319eca6f2881ccd73789bfafb0., Count: 13
Region: test,222,1358337769870.94685f417a95e91d0c9185a95974f866., Count: 78
Total Count: 100

Batch类提供了一个更方便的方法来获取远程endpoint, Batch.forMethod() ,你可以得到一个配置好的Batch.Call实例用来传递到远程的region servers. 下面对EndPointExample做了修改，看起来是不是好看多了~~

[java] view plain copy

Batch.Call call =Batch.forMethod(RowCountEndPoint.class, "getKeyValueCount");
Map<byte[], Long> results=table.coprocessorExec(
RowCountProtocol.class,
null,
null,
call
);

然而，通过直接implementing Batch.Call 更加灵活和强大，(you can perform additional processing on the results ，implementing Batch.call directly will provide more power and flexibility.) 下面的例子，同时获取rowCount和keyvalueCount

[java] view plain copy

Map<byte[],Pair<Long,Long>> results=table.coprocessorExec(
RowCountProtocol.class,
null,
null,
new Batch.Call<RowCountProtocol,Pair<Long,Long>>() {
@Override
public Pair<Long, Long> call(RowCountProtocol instance)
throws IOException {
// TODO Auto-generated method stub
return new Pair<Long, Long>(
instance.getRowCount(),
instance.getKeyValueCount()
);
}
}
);
//
long totalRows=0;
long totalKeyValues=0;
for(Map.Entry<byte[], Pair<Long,Long>> entry:results.entrySet() ){
totalRows+=entry.getValue().getFirst();
totalKeyValues+=entry.getValue().getSecond();
System.out.println("region="+Bytes.toString(entry.getKey())+
" , rowCount="+entry.getValue().getFirst()+
" , keyValueCount="+entry.getValue().getSecond());
}
System.out.println("totalRows="+totalRows+
",totalKeyValues="+totalKeyValues);

当然，我们也可以通过coprocessorProxy()方法获取endpoint的client-side 代理，通过该代理，可以在给定的row key所在的region做你想要的操作 (如果row key不存在，则该对应的region为rowkey范围包含该row key的region)。

[java] view plain copy

RowCountProtocol protocol=table.coprocessorProxy(RowCountProtocol.class, Bytes.toBytes("202"));
long rowsInRegion = protocol.getRowCount();
System.out.println("Region Row Count: "+rowsInRegion);

另一种动态加载方法，通过modifytable修改表方式：

public static void main(String[] args) throws MasterNotRunningException,
Exception {
// TODO Auto-generated method stub
byte[] tableName = Bytes.toBytes("userinfo");
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
admin.disableTable(tableName);
HTableDescriptor htd = admin.getTableDescriptor(tableName);
htd.addCoprocessor(AggregateImplementation.class.getName(), new Path("hdfs://master68:8020/sharelib/aggregate.jar"), 1001, null);
//htd.removeCoprocessor(RowCountEndpoint.class.getName());
admin.modifyTable(tableName, htd);
admin.enableTable(tableName);
admin.close();
}

阅读全文……

标签 : database, hadoop, hbase, java, 大数据

发表评论

IT瘾于2014年12月9日上午01时10分00秒发布 #

HBase Coprocessor 剖析与编程实践 - 林场 - 博客园

1.起因(Why HBase Coprocessor)

HBase作为列族数据库最经常被人诟病的特性包括：无法轻易建立“二级索引”，难以执行求和、计数、排序等操作。比如，在旧版本的(<0.92)Hbase中，统计数据表的总行数，需要使用Counter方法，执行一次MapReduce Job才能得到。虽然HBase在数据存储层中集成了MapReduce，能够有效用于数据表的分布式计算。然而在很多情况下，做一些简单的相加或者聚合计算的时候，如果直接将计算过程放置在server端，能够减少通讯开销，从而获得很好的性能提升。于是，HBase在0.92之后引入了协处理器(coprocessors)，实现一些激动人心的新特性：能够轻易建立二次索引、复杂过滤器(谓词下推)以及访问控制等。

2.灵感来源( Source of Inspration)

HBase协处理器的灵感来自于Jeff Dean 09年的演讲( P66-67)。它根据该演讲实现了类似于bigtable的协处理器，包括以下特性:

每个表服务器的任意子表都可以运行代码
客户端的高层调用接口(客户端能够直接访问数据表的行地址，多行读写会自动分片成多个并行的RPC调用)
提供一个非常灵活的、可用于建立分布式服务的数据模型
能够自动化扩展、负载均衡、应用请求路由

HBase的协处理器灵感来自bigtable，但是实现细节不尽相同。HBase建立了一个框架，它为用户提供类库和运行时环境，使得他们的代码能够在HBase region server和master上处理。

3.细节剖析（Implementation)

协处理器分两种类型，系统协处理器可以全局导入region server上的所有数据表，表协处理器即是用户可以指定一张表使用协处理器。协处理器框架为了更好支持其行为的灵活性，提供了两个不同方面的插件。一个是观察者（observer），类似于关系数据库的触发器。另一个是终端(endpoint)，动态的终端有点像存储过程。

3.1观察者(Observer)

观察者的设计意图是允许用户通过插入代码来重载协处理器框架的upcall方法，而具体的事件触发的callback方法由HBase的核心代码来执行。协处理器框架处理所有的callback调用细节，协处理器自身只需要插入添加或者改变的功能。

以HBase0.92版本为例，它提供了三种观察者接口：

RegionObserver：提供客户端的数据操纵事件钩子：Get、Put、Delete、Scan等。
WALObserver：提供WAL相关操作钩子。
MasterObserver：提供DDL-类型的操作钩子。如创建、删除、修改数据表等。

这些接口可以同时使用在同一个地方，按照不同优先级顺序执行.用户可以任意基于协处理器实现复杂的HBase功能层。HBase有很多种事件可以触发观察者方法，这些事件与方法从HBase0.92版本起，都会集成在HBase API中。不过这些API可能会由于各种原因有所改动，不同版本的接口改动比较大，具体参考Java Doc。

RegionObserver工作原理，如图1所示。更多关于Observer细节请参见HBaseBook的第9.6.3章节。

图1 RegionObserver工作原理

3.2终端(Endpoint)

终端是动态RPC插件的接口，它的实现代码被安装在服务器端，从而能够通过HBase RPC唤醒。客户端类库提供了非常方便的方法来调用这些动态接口，它们可以在任意时候调用一个终端，它们的实现代码会被目标region远程执行，结果会返回到终端。用户可以结合使用这些强大的插件接口，为HBase添加全新的特性。终端的使用，如下面流程所示：

定义一个新的protocol接口，必须继承CoprocessorProtocol.
实现终端接口，该实现会被导入region环境执行。
继承抽象类BaseEndpointCoprocessor.
在客户端，终端可以被两个新的HBase Client API调用。单个region：HTableInterface.coprocessorProxy(Class<T> protocol, byte[] row) 。rigons区域：HTableInterface.coprocessorExec(Class<T> protocol, byte[] startKey, byte[] endKey, Batch.Call<T,R> callable)

整体的终端调用过程范例，如图2所示：

图2 终端调用过程范例

4.编程实践(Code Example)

在该实例中，我们通过计算HBase表中行数的一个实例，来真实感受协处理器的方便和强大。在旧版的HBase我们需要编写MapReduce代码来汇总数据表中的行数，在0.92以上的版本HBase中，只需要编写客户端的代码即可实现，非常适合用在WebService的封装上。

4.1启用协处理器 Aggregation(Enable Coprocessor Aggregation)

我们有两个方法：1.启动全局aggregation，能过操纵所有的表上的数据。通过修改hbase-site.xml这个文件来实现，只需要添加如下代码：

<property>
   <name>hbase.coprocessor.user.region.classes</name>
   <value>org.apache.hadoop.hbase.coprocessor.AggregateImplementation</value>
 </property>

2.启用表aggregation，只对特定的表生效。通过HBase Shell 来实现。

(1)disable指定表。hbase> disable 'mytable'

(2)添加aggregation hbase> alter 'mytable', METHOD => 'table_att','coprocessor'=>'|org.apache.hadoop.hbase.coprocessor.AggregateImplementation||'

(3)重启指定表 hbase> enable 'mytable'

4.2统计行数代码(Code Snippet)

public class MyAggregationClient {    
private static final byte[] TABLE_NAME = Bytes.toBytes("mytable");  
private static final byte[] CF = Bytes.toBytes("vent");  
public static void main(String[] args) throws Throwable {  
Configuration customConf = new Configuration();  
customConf.setStrings("hbase.zookeeper.quorum", "node0,node1,node2"); //提高RPC通信时长
customConf.setLong("hbase.rpc.timeout", 600000); //设置Scan缓存
customConf.setLong("hbase.client.scanner.caching", 1000); 
Configuration configuration = HBaseConfiguration.create(customConf);  
AggregationClient aggregationClient = new AggregationClient( configuration);  
Scan scan = new Scan(); //指定扫描列族，唯一值
scan.addFamily(CF);  
long rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);  
System.out.println("row count is " + rowCount);   
}  
}

5.参考文献(References)

[1]Lai, et al.,(2012-02-01),"Coprocessor Introduction : Apache HBase".Available:https://blogs.apache.org/hbase/entry/coprocessor_introduction

[2]Apache.(2012-08-10),"The Apache HBase Reference Guide".Available:http://hbase.apache.org/book.html#coprocessors

阅读全文……

标签 : database, hadoop, hbase, java, 大数据

发表评论

IT瘾于2014年12月9日上午01时04分00秒发布 #