hbase

最新推荐文章于 2023-11-25 15:32:17 发布

原创最新推荐文章于 2023-11-25 15:32:17 发布 · 321 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #hadoop #spark #hbase #mysql

hbase.md

http://www.postgres.cn/downfiles/pg2016conf_day2_s3_am1.pdf
/home/hadoop/hbase/bin/hbase-daemon.sh start thrift
-Dhbase.log.dir=/data/hbase/logs

常见的数据输出模型

Key-Value快速输出型，最简单的kv查询，并发量可能很高，速度要求快。比如风控。
Key-Map快速输出型，定向输出，比如常见的通过文章id获取文章详情数据，kv查询升级版。
MultiKey-Map批量输出型，比如常见的推荐Feed流展示，Key-Map查询升级版。
C-List多维查询输出型，指定多个条件进行数据过滤，条件可能很灵活，分页输出满足条件的数据。这应该是非常常见的，比如筛选指定标签或打分的商品进行推荐、获取指定用户过去某段时间买过的商品等等。
G-Top统计排行输出型，根据某些维度分组，展示排行。如获取某论坛热度最高Top10帖子。
G-Count统计分析输出型，数仓统计分析型需求。
Multi-Table混合输出型，且不同表查询条件不同，如列表页混排输出内容。
Term分词输出型

数据模型 -- sparse, distributed, persistent multidimensional sorted map, which is indexed by a row key, column key, and a timestamp

基本组件

Table
- 一个字符串名称，在文件系统中对应有个目录
Row
- 在一个表中，数据的存储位置由row决定
- rows由唯一的row key标识，并且是一个byte数组
Column Family
- 在同一个行中的数据由 Column Family 分组
- Column Family 影响数据的物理存储，所以必须预先定义好，不会轻易修改
- 同一个表中的所有行的Column Family都相同
- 每一行并不需要存储所有的Column Family数据
- 在文件系统中有对应路径
- 不同的family是在同一个region下面。
  - 而每一个family都会分配一个memstore，所以更多的family会消耗更多的内存
- 由于hfile是以family为单位的，因此对于多个family来说，数据被分散到了更多的hfile中，减小了split发生的机率
Column Qualifer
- 列修饰符被加到一个列簇，以提供对一个数据片段的索引
- HBase中的列是二级列,可以每行不一样
Cell
- rowKey,column family,column qualifier组合唯一表示一个cell
Timestamp
- cell默认的版本通过Timestamp表示

hbase 表创建

TTL=>的更新超时时间是指：该列最后更新的时间，到超时时间的限制，而不是第一次创建，到超时时间。
- TTL（Time-To-Live）：每个Cell的数据超时时间（当前时间 - 最后更新的时间）
- MinVersion：如果当前存储的所有时间版本都早于TTL，至少MIN_VERSION个最新版本会保留下来。这样确保在你的查询以及数据早于TTL时有结果返回。
- CELL TTL 检查 +
hbase的预分配region
- 在create一个表时如果不指定预分配region，则默认会先分配一个region，这样在大数据并行载入时性能比较低，因为所有的数据都往一个region灌入，容易引起单节点负载升高，从而影响入库性能，一个好的方法时在建立表时预先分配数个region
- , {NUMREGIONS => 9, SPLITALGO => 'HexStringSplit'}
Hbase split 自带了两种pre-split的算法
- HexStringSplit => 如果我们的row key是十六进制的字符串作为前缀的，就比较适合用HexStringSplit
- UniformSplit =>某个hbase的表查询只是以随机查询为主，可以用UniformSplit的方式进行，它是按照原始byte值（从0x00~0xFF）右边以00填充
rowkey --https://blog.bcmeng.com/post/hbase-rowkey.html#%E7%83%AD%E7%82%B9

宽表高表 -- https://www.jianshu.com/p/44ffeac1601a

hbase中的宽表是指很多列较少行，即列多行少的表，一行中的数据量较大，行数少；高表是指很多行较少列，即行多列少，一行中的数据量较少，行数大。
hbase的row key是分布式的索引，也是分片的依据。
查询性能：
- 高表更好，因为查询条件都在row key中, 是全局分布式索引的一部分。高表一行中的数据较少。所以查询缓存BlockCache能缓存更多的行，以行数为单位的吞吐量会更高。
分片能力
- 高表分片粒度更细，各个分片的大小更均衡。因为高表一行的数据较少，宽表一行的数据较多。HBase按行来分片。
元数据开销
- 高表元数据开销更大。高表行多，row key多，可能造成region数量也多，- root -、 .meta表数据量更大。过大的元数据开销，可能引起HBase集群的不稳定、master更大的负担
事务能力
- 宽表事务性更好。HBase对一行的写入（Put）是有事务原子性的，一行的所有列要么全部写入成功，要么全部没有写入。但是多行的更新之间没有事务性保证。
数据压缩比
- 如果我们对一行内的数据进行压缩，宽表能获得更高的压缩比。

hbase test --http://debugo.com/hbase-shell-cmds/

create_namespace 'test';
list_namespace;
create 'test:mt', 'f'
describe 'test:mt'
- {NAME => 'f', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_S COPE => '0'}
create 'test:mt4', {NAME => 'f', COMPRESSION => 'SNAPPY',TTL => '200000000'}, {NAME=>'t',TTL => '86400'}, {NUMREGIONS => 2, SPLITALGO => 'HexStringSplit'}
- put 'test:mt2',123,'f:20141224',6 ,1432483200000
- scan 'test:mt2'
- put 'test:mt2',123,'f:20141224',6
- scan 'test:mt2'
- put 'test:mt2','a123-1','f:20141224',0
- put 'test:mt2','a123-2','f:20141224',0
- put 'test:mt2','b223','f:20141224',6
- scan 'test:mt2', { STARTROW => 'a'}
- scan 'test:mt2', { STARTROW => 'a',FILTER=>"PrefixFilter('a')"}
- scan 'test:mt2', {Raw =>'true'}
- scan 'test:mt2', { STARTROW => 'a123-',STOPROW=> 'a123.', FILTER=>"PrefixFilter('a123-') AND ValueFilter(=,'binary:0')" }
- scan 'test:mt2' ,{LIMIT =>1 , FILTER=>"ValueFilter(=,'binary:6')"}
- get 'test:mt2' ,'a123'
- get 'test:mt2' ,'a123',FILTER=>"ValueFilter(=,'binary:6')"
- get 'test:mt2' ,'row-key-6',FILTER=>"SKIP ValueFilter(!=,'binary:7')"
- FILTER=>"ColumnPrefixFilter('birth') AND ValueFilter ValueFilter(=,'substring:1987')"
- scan 'test:mt2', { STARTROW => 'b223-',STOPROW=> 'b223.', FILTER=>"FirstKeyOnlyFilter() AND ValueFilter(>=,'binary:0') ",LIMIT => 3 }
- scan 'test:mt2', { STARTROW => 'row-key-',STOPROW=> 'row-key.', FILTER=>"SKIP ValueFilter(>=,'binary:0') ",LIMIT => 3 }
- scan 'test:mt2', { STARTROW => 'row-key-',STOPROW=> 'row-key.', FILTER=>" RowFilter( = ,'regexstring:y') ",LIMIT => 3 }
- scan 'test:mt2',{REVERSED => TRUE}
- scan 'test:mt2', { STARTROW => 'row-key-',STOPROW=> 'row-key.', FILTER=>"SKIP ValueFilter(!=,'binary:-1') AND QualifierFilter(=,'binary:ugs')",LIMIT => 100 }
- put 'test:mt2','row-key-15','f:20141224',0 , {'TTL'=>10000}
- scan 'dyd:user_recommend_posts', { STARTROW => '1e156267adf0a575-',STOPROW=> '1e156267adf0a575.', FILTER=>" SingleColumnValueFilter('f','d',=,'binary:-1',true,true) AND QualifierFilter(=,'binary:cb')",LIMIT => 5000 }
- compact 'dyd:user_recommend_posts', 'f'
- major_compact 'dyd:user_recommend_posts', 'f'
- get 'test:mt2','3a811367adf0a575' , { FILTER=> "(ValueFilter(=,'substring:cus_cf') AND ColumnCountGetFilter(100) )OR (ValueFilter(=,'substring:ugs') AND ColumnCountGetFilter(100) ) "}
disable 'test:mt2'
- drop 'test:mt2'
- delete 'test:mt2','a123-2', 'f:20141224'
hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'dyd:user_recommend_posts'

filter

存储结构 -http://blog.csdn.net/qq280929090/article/details/56302851

key

//md5
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] digest = md.digest(Bytes.toBytes(s));

hbase spark

// load data
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.spark.SparkContext
import org.apache.hadoop.hbase.{CellUtil, TableName, HBaseConfiguration}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase._
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.client.Result

val a = scala.collection.mutable.ArrayBuffer[Array[Byte]]()
for(i <- 0 to 2){
    val k = s"cb1f97e101647575-ffc81547adf0a575"
    a.append(Bytes.toBytes(k))
}

val rdd = sc.parallelize(a)
val conf = HBaseConfiguration.create()

val hBaseContext = new HBaseContext(sc, conf)
val tableName = "dyd:user_recommend_posts"
val getRdd = hBaseContext.bulkGet[Array[Byte], String](
        TableName.valueOf(tableName),
        1000,
        rdd,
        record => {
          val get = new Get(record)
          get.addColumn(Bytes.toBytes("f"),Bytes.toBytes("test"))
          get.addColumn(Bytes.toBytes("f"),Bytes.toBytes("ugs"))
        },
        (result: Result) => {
         val b = new StringBuilder
          if(result != null && result.listCells() != null) {
          val it = result.listCells().iterator()
          
          while (it.hasNext) {
            val cell = it.next()

            val i =CellUtil.tagsIterator(cell.getTagsArray(), cell.getTagsOffset(), cell.getTagsLength())
             b.append("cell ttl " + ":" + i.hasNext() +"     ")
              while (i.hasNext()) {
                val t = i.next()
                if (TagType.TTL_TAG_TYPE == t.getType()) {
                  val ts = cell.getTimestamp()
                  val ttl = Bytes.toLong(t.getBuffer(), t.getTagOffset(), t.getTagLength())
                  b.append("cell ttl " + ":" + ttl)
                }
              }
            b.append(Bytes.toString(result.getRow) + ":")
            val q = Bytes.toString(CellUtil.cloneQualifier(cell))
            if (q.equals("counter")) {
              b.append("(" + q + "," + Bytes.toLong(CellUtil.cloneValue(cell)) + ")")
            } else {
              b.append("(" + q + "," + Bytes.toString(CellUtil.cloneValue(cell)) + ")")
            }
          }
          }
          b.toString()
        })
      getRdd.collect().foreach(v => println(v))

//save data
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.spark.SparkContext
import org.apache.hadoop.hbase.{TableName, HBaseConfiguration}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put

import org.apache.spark.SparkConf

val tableName = "dyd:user_recommend_posts"
val columnFamily = "f"

val rdd = sc.parallelize(Array(
        (Bytes.toBytes("cb1f97e101647575-ffc81547adf0a575"),
          Array((Bytes.toBytes(columnFamily), Bytes.toBytes("test343"), Bytes.toBytes("1"))))
      ))

val conf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(sc, conf)      
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
        TableName.valueOf(tableName),
        (putRecord) => {
          val put = new Put(putRecord._1)
          put.setTTL(5000)
          putRecord._2.foreach((putValue) =>
            put.addColumn(putValue._1, putValue._2, putValue._3))
          put
        });

//---
val df = sc.parallelize(Array(
UserPostIdArray(6293258517833034308L,Array(6293274501954834378L,6293274501954967548L,6293274501955275403L)),
UserPostIdArray(6293207802437871924L,Array(6293274501954834378L,6293274501954967548L,6293274501955275403L)),
UserPostIdArray(6293245510875449232L,Array(6293274501954834378L,6293274501954967548L,6293274501955275403L))
)).toDF()

doResultSave(df,spark,"20180119")

//========================bulkload=============
// 一行一列插入
// 行要rowkey排序，列要排序
val conf = HBaseConfiguration.create()
val tableName = "test:mt2"
val table = new HTable(conf, tableName) 
  
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
lazy val job = Job.getInstance(conf)
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable])
job.setMapOutputValueClass (classOf[KeyValue])
HFileOutputFormat.configureIncrementalLoad (job, table)
  
// Generate 10 sample data:
val num = sc.parallelize(1 to 10)
val rdd = num.map(x=>{
    val kv: KeyValue = new KeyValue(Bytes.toBytes(x), "f".getBytes(), "c1".getBytes(), "value_xxx".getBytes() )
    (new ImmutableBytesWritable(Bytes.toBytes(x)), kv)
})
  
// Directly bulk load to Hbase/MapRDB tables.
job.getConfiguration.set("mapred.output.dir", "/tmp/hbase/test/mt3")
rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)

val bulkLoader = new LoadIncrementalHFiles(conf)
bulkLoader.doBulkLoad(new Path("/tmp/hbase/test/mt2"), table)

========batch check and put---
import java.util
import org.apache.hadoop.hbase.protobuf.generated.HBaseProtos.CompareType
import org.apache.hadoop.hbase.client.{HTable, ConnectionFactory, Delete, Scan}
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp
import org.apache.hadoop.hbase.filter._
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.hadoop.hbase.client.{Get, ConnectionFactory, Put}
val tableName = "test:mt2"
val conf = HBaseConfiguration.create()
val connection = ConnectionFactory.createConnection(conf)
val table = connection.getTable(TableName.valueOf(tableName)).asInstanceOf[HTable]   
table.setAutoFlushTo(false)
table.setWriteBufferSize(64*1024*1024)
val put=new Put(Bytes.toBytes("row-key-6"))
put.addColumn(Bytes.toBytes("f"),Bytes.toBytes("000132345"),Bytes.toBytes("8"))
table.checkAndPut(Bytes.toBytes("row-key-6"),Bytes.toBytes("f"),Bytes.toBytes("000132345"),org.apache.hadoop.hbase.filter.CompareFilter.CompareOp.NOT_EQUAL,Bytes.toBytes(0), put)


val put2=new Put(Bytes.toBytes("row-key-6"))
put2.addColumn(Bytes.toBytes("f"),Bytes.toBytes("000212345"),Bytes.toBytes(4))
table.checkAndPut(Bytes.toBytes("row-key-6"),Bytes.toBytes("f"),Bytes.toBytes("000212345"),org.apache.hadoop.hbase.filter.CompareFilter.CompareOp.NOT_EQUAL,Bytes.toBytes(3), put2)

val put3=new Put(Bytes.toBytes("row-key-16"))
put3.addColumn(Bytes.toBytes("f"),Bytes.toBytes("000212345"),Bytes.toBytes(3))
table.checkAndPut(Bytes.toBytes("row-key-16"),Bytes.toBytes("f"),Bytes.toBytes("000212345"),org.apache.hadoop.hbase.filter.CompareFilter.CompareOp.EQUAL,null, put3)

val m = connection.getBufferedMutator(TableName.valueOf(tableName))
val puts = new java.util.ArrayList[Put]()
val put=new Put(Bytes.toBytes("row-key-6"))
put.addColumn(Bytes.toBytes("f"),Bytes.toBytes("000132345"),Bytes.toBytes("5"))
puts.add(put)
m.checkAndMutate(Bytes.toBytes("row-key-6"),Bytes.toBytes("f"),null,org.apache.hadoop.hbase.filter.CompareFilter.CompareOp.NOT_EQUAL,Bytes.toBytes(3),puts)

hbase python --http://hbase.apache.org/

sudo pip install thrift
sudo pip install hbase-thrift

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

from hbase import Hbase
from hbase.ttypes import *

transport = TSocket.TSocket('10.10.135.104', 9090) 
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = Hbase.Client(protocol)
transport.open()
client.getTableNames()

tableName = 'test:mt2'
rowKey = 'a123-1'
 
result = client.getRow(tableName, rowKey)
for r in result:
    print 'the row is ', r.row
    print 'the values is ', r.columns.get('f:20141224').value

happybase-- https://github.com/wbolster/happybase/blob/master/doc/user.rst

https://www.jianshu.com/p/d2a40f8fd4f6
pip install happybase

import happybase
pool = happybase.ConnectionPool(size=3, host='10.10.135.104', port=9090, table_prefix=b'test', table_prefix_separator=b':')

with pool.connection() as connection:
    print(connection.tables())

with pool.connection() as connection:
    table = connection.table('mt2')
    row = table.row(b'a123-1')

with pool.connection() as connection:
    table = connection.table('mt2')
    for key, data in table.scan(row_start=b'b223-', row_stop=b'b223.', filter=b'SKIP ValueFilter(>=,'binary:0') ' , limit=6):
        print(key, data)

with pool.connection() as connection:
    table = connection.table('mt2')
    with table.batch() as b:
        b.put(b'row-key-1', {b'f:col1': b'1'})
        b.put(b'row-key-2', {b'f:col1': b'2'})
        b.put(b'row-key-3', {b'f:col1': b'3'})
        b.put(b'row-key-4', {b'f:col1': b'4'})
        b.put(b'row-key-5', {b'f:col1': b'5'})

hbase 原理

scan http://forum.huawei.com/enterprise/zh/thread-327647-1-1.html

问题

ValueFilter 过滤出了所有匹配的包括旧的value

FirstKeyOnlyFilter() 放在最前面

spark 接入hbase 使用 hbase-spark库，用了1.6.0 spark 和scala 2.10编译，线上环境2.0.1 -2.11 spark编译不过

线上环境hbase用的hbase-spark1.2.0-cdh5.8.0 也是基于1.6.0 和2.10的，但是直接运行spark shell可以使用
开发环境编译打包的时候报错：compiled against an incompatible version ， org.apache.spark

spark 批量写入慢 ,Spark写入HBase（Bulk方式）

gc时间长

hBaseContext.bulkGet ==> collectAsMap/collect

ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred.

md5 - python

import hashlib

data =  'This a md5 test!'
hash_md5 = hashlib.md5(data)

hash_md5.hexdigest() # 按16位输出

md5 - java-scala

import java.security.MessageDigest
val digest = MessageDigest.getInstance("MD5")

val text = "MD5 this text!"
digest.digest(text.getBytes).map("%02x".format(_)).mkString

应用推荐结果保存

目标

通过Userid + 日期可以获取前多少个推荐postid
按推荐结果权重排序
派发postid更新推荐结果，下次过滤不被查询出来
重复计算的推荐结果已经被派发更新过的也不能被查询
重复计算的新的结果能优先排序

rowkey

userId
postId
统一批次权重
时间批次

问题点

拉取的时候要按批次+权重排序，
派发更新的时候要能定位到对应的userid+postid
- scan的时候保存key,派发直接通过key删除
重新计算更新相同postid 在同一个地方
- 比较已经存在的postid,重复的不写入，保证一个user+posti只有一条记录
  - 读取一次
  - 比较排除