hbase replication 原理

最新推荐文章于 2023-03-10 11:21:40 发布

baiyangfu

最新推荐文章于 2023-03-10 11:21:40 发布

阅读量6.5k

点赞数

CC 4.0 BY-SA版权

分类专栏： hbase

本文链接：https://blog.csdn.net/baiyangfu_love/article/details/38682349

hbase 专栏收录该内容

8 篇文章

订阅专栏

本文档分析了HBase的复制原理，详细描述了在主集群中3个Region Server向单个从节点复制的过程。当1.1.1.2失去ZooKeeper会话时，其他节点竞争创建锁，1.1.1.3获胜并开始转移队列。如果1.1.1.3在完成1.1.1.2的WAL复制之前失败，最后的存活节点将尝试锁定1.1.1.3的节点并继续转移队列。此外，还介绍了HBase复制的指标监控。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

查看文档，分析了一下 hbase replication的原理，简单记录一下：

http://hbase.apache.org/book.html#cluster_replication

hbase 的复制方式是 master-push 方式，即主集群推的方式，主要是因为每个rs都有自己的WAL。一个master集群可以复制给多个从集群

复制是异步的，运行集群分布在不同的地方，这也意味着从集群和主集群的数据不是完全一致的。他的目标就是最终一致性

复制的格式与mysql的机遇状态的复制类似，不同于mysql的状态复制，整个WAL的修改（包括insert delete 和cell的修改）都会安装顺序的复制到从集群中。

WAL必须保存在hdfs直到所有的从集群复制完成

每个regionserver会记录最后复制的位置，然后每次复制都从最后复制的位置开始。rs会维持一个replication的队列，每个slave的的position都会单独维护

WAL 操作的生命周期：

1、客户端插入或删除

2、rs将操作以可以回放的格式写入wal

3、如果影响的cell正是replication的范围之内的cell，将操作放入replication的队列

4、如果slave rs 无法工作，master会重新选择新的rs作为replication的slave rs，并重新发送buffer中的数据

5、同时，wal 会被压缩并且存储到zookeeper的队列中，server rs通过移动操作日志的路径到一个中心的日志路径将操作日志归档。然后将path更新到内存中replication 线程的队列中

6、如果slave集群最终恢复正常，master会将中断复制这些log按照正常复制过程复制

replication内部原理：

hbase复制的状态都存储在zookeeper中，默认情况下，存储到 /hbase/replication。这个目录有两个子节点： peers znode 和 RS znode

如果人为的删除 /hbase/replication 节点，会造成复制丢失数据

peers znode：

存储在 zookeeper中 /hbase/replication/peers 目录下，这里存储了所有的replication peers，还有他们的状态。peer的值是他的cluster的key，

key包括了cluster的信息有： zookeeper，zookeeper port， hbase 在 hdfs 的目录

/hbase/replication/peers
  /1 [Value: zk1.host.com,zk2.host.com,zk3.host.com:2181:/hbase]
  /2 [Value: zk5.host.com,zk6.host.com,zk7.host.com:2181:/hbase]

每个peer都有一个子节点，标示replication是否激活，这个节点没有任何子节点，只有一个boolean值

/hbase/replication/peers
  /1/peer-state [Value: ENABLED]
  /2/peer-state [Value: DISABLED]

RS znode：

rs node包含了哪些WAL 是需要复制的，包括：rs hostname，client port，start code

/hbase/replication/rs
  /hostname.example.org,6020,1234
  /hostname2.example.org,6020,2856

每一个rs znode包括一个WAL replication 队列，

/hbase/replication/rs
  /hostname.example.org,6020,1234
    /1
    /2

说明 hostname.example.org 的start code 为 1234 的wal 需要复制到 peer 1 和 peer 2

每一个队列都有一个znode 标示 每一个WAL 上次复制的位置，每次复制的时候都会更新这个值

/hbase/replication/rs
  /hostname.example.org,6020,1234
    /1
      23522342.23422 [VALUE: 254]
      12340993.22342 [VALUE: 0]

replication 的实现细节

选择要复制的从集群的 regionserver：

当一个主集群的regionserver 要向从集群复制数据的时候，它首先要连接从集群的zookeeper，然后扫描 RS 目录，查找所有可用的sink，

随机的选择一个子集（默认是10%的比例）。比如从集群有150个regionserver，它会选择15个rs作为接收复制数据的子集。

每一个master 集群的 rs 都会在${zookeeper.znode.parent}/rs 节点中有一个zookeeper watcher，它来监控slave 集群的变化，当从集群接收数据的rs出现问题，主集群的rs 就会重新选择从集群的接收节点

跟踪日志：

每一个master集群的rs在zookeeper中都有自己的znode，每一个peer都会对应一个znode，而且每一个znode都包含一个需要处理的WAL的队列。

每一个队列都会跟踪对应的rs 产生的WAL。

当一个source实例化的时候，他包含正在写入的WAL。日志切换的时候，在新文件可用之前，它会被加到每一个slave 集群的znode中。这样是为了保证当一个新log文件出现的时候都有的source队列都可以同步他的数据到slave集群，但是这样的操作太耗资源。如果从一个日志文件已经读不到新的条目并且已经有文件出现在队列中，那么这个日志文件对应在队列中的元素就会被抛弃。

如果edit log 已经没有再用或者log的条目数已经超出了 hbase.regionserver.maxlogs 的配置，log就会被归档。如果一个log被归档，会通知复制的source线程。如果这个log文件还在复制的队列中，那么归档后的目录会在内存中更新。如果归档的时候，日志都已经复制完毕，那么归档就不会影响复制队列，但是读取线程就不能重新读取数据，因为目录已经发生改变。

Reading, Filtering and Sending Edits.

这部分是最重要的实现逻辑。

默认的情况下，source端会尽快的读取WAL log 然后传送到复制流。传输的速度会被过滤器限制，只有 GLOBAL 范围的并且不属于系统表的log会被保留。传输速度还受制于向每个slave rs 发送队列的大小，默认是64M。如果一个主集群有三个复制集群，那么一个rs 就会存储192M数据用来复制。这里不包含那些被过滤掉的数据。

一旦这个buffer（64M）被填满或者读到WAL log的最后，source 线程停止读取log，从之前随机选取的rs子集中随机选择一个rs 发送数据。首先会发送一个RPC，如果RPC正常，正常发数据。如果WAL log文件已经读取完毕，source会将zookeeper 中的复制队列中对应的znode 删除。否则，记录下log的偏移量。如果RPC抛出异常，source将会重试10次，如果都失败，重新选择一个rs。

清除日志：

如果没有配置replication，hbase集群的清理日志线程会根据TTL配置的时间删除旧的日志。如果配置了replication，这套机制就失效了。因为归档的日志有可能已经过了TTL但是还在replication的队列中。如果log过了TTL，这个时候清理日志线程会在每个复制队列中查找是否包含这个log，如果没有，就直接删除。如果找到了，就见这个队列记录起来，下次开始清理log的时候先到记录的队列里面查看。

regionserver 容错：

每一个主机群的regionserver 会监控其他的regionserver，一旦有regionserver失败，就会感知到。如果一个rs挂掉，其他rs会比赛在zookeeper中创建一个znode 把挂掉的rs的znode锁住。创建成的rs会将挂掉的regionserver的znode中的复制队列复制到自己的复制队列，这是因为zookeeper不支持重命名队列，所有的队列复制完毕之后，旧的就会被删除。

下一步，主机群的regionserver会对每一个复制过来的queue创建一个source线程，每一个source线程依然按照 read/filter/ship 这种模式继续复制数据。主要的区别在于，这些复制队列不会复制新的数据，因为他们不属于新的regionserver。当读到最后一个log的结尾，这个复制队列的znode就会被删除，主机群的regionserver 关掉这个复制source线程

下面是官网的一个例子：

Given a master cluster with 3 region servers replicating to a single slave with id 2, the following hierarchy represents what the znodes layout could be at some point in time. The region servers' znodes all contain a peers znode which contains a single queue. The znode names in the queues represent the actual file names on HDFS in the form address,port.timestamp.

/hbase/replication/rs/
  1.1.1.1,60020,123456780/
    2/
      1.1.1.1,60020.1234  (Contains a position)
      1.1.1.1,60020.1265
  1.1.1.2,60020,123456790/
    2/
      1.1.1.2,60020.1214  (Contains a position)
      1.1.1.2,60020.1248
      1.1.1.2,60020.1312
  1.1.1.3,60020,    123456630/
    2/
      1.1.1.3,60020.1280  (Contains a position)

Assume that 1.1.1.2 loses its ZooKeeper session. The survivors will race to create a lock, and, arbitrarily, 1.1.1.3 wins. It will then start transferring all the queues to its local peers znode by appending the name of the dead server. Right before 1.1.1.3 is able to clean up the old znodes, the layout will look like the following:

/hbase/replication/rs/
  1.1.1.1,60020,123456780/
    2/
      1.1.1.1,60020.1234  (Contains a position)
      1.1.1.1,60020.1265
  1.1.1.2,60020,123456790/
    lock
    2/
      1.1.1.2,60020.1214  (Contains a position)
      1.1.1.2,60020.1248
      1.1.1.2,60020.1312
  1.1.1.3,60020,123456630/
    2/
      1.1.1.3,60020.1280  (Contains a position)

    2-1.1.1.2,60020,123456790/
      1.1.1.2,60020.1214  (Contains a position)
      1.1.1.2,60020.1248
      1.1.1.2,60020.1312

Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from 1.1.1.2, it dies too. Some new logs were also created in the normal queues. The last region server will then try to lock 1.1.1.3's znode and will begin transferring all the queues. The new layout will be:

/hbase/replication/rs/
  1.1.1.1,60020,123456780/
    2/
      1.1.1.1,60020.1378  (Contains a position)

    2-1.1.1.3,60020,123456630/
      1.1.1.3,60020.1325  (Contains a position)
      1.1.1.3,60020.1401

    2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
      1.1.1.2,60020.1312  (Contains a position)
  1.1.1.3,60020,123456630/
    lock
    2/
      1.1.1.3,60020.1325  (Contains a position)
      1.1.1.3,60020.1401

    2-1.1.1.2,60020,123456790/
      1.1.1.2,60020.1312  (Contains a position)

Replication Metrics. The following metrics are exposed at the global region server level and (since HBase 0.95) at the peer level:

source.sizeOfLogQueue

number of WALs to process (excludes the one which is being processed) at the Replication source

source.shippedOps

number of mutations shipped

source.logEditsRead

number of mutations read from HLogs at the replication source

source.ageOfLastShippedOp

age of last batch that was shipped by the replication source