carbondata:列存储格式,与Spark进行了深度集成,兼容了Spark生态功能(SQL,DataFrame,ML等)
carbondata文件结构
CarbonData 为每个数据块创建一个文件,每个文件都包含 File Header, Blocklets 和 File Footer。
- file header:文件头 保存存储格式的版本和模式信息
- blocklet:最大容量64m,存储数据
- file footer:文件尾 存储数据的索引和摘要
通过支持二级索引,提升查询性能
第一级:文件级索引,过滤hdfs block避免不必要的文件扫描
第二级:blocklet索引,过滤文件内部的blocklet
可以大大减少非必要的任务启动和非必要的磁盘io,但是也导致了压缩率的减少和入库时间的延长
和ORCFile与Parquet相比 在索引层面下了功夫,查询很快
carbondata建表及加载数据
carbon.sql(s"DROP TABLE IF EXISTS ods.useric")
carbon.sql(
s"""
|CREATE TABLE IF NOT EXISTS ods.useric (
| Id long,
| CustomerId long,
| IcBasicInfoId long,
| VIPCardId long,
| IsActive BOOLEAN,
| Remark STRING,
| IsDeleted BOOLEAN,
| DeleterUserId long,
| DeletionTime STRING,
| LastModificationTime STRING,
| LastModifierUserId long,
| CreationTime STRING,
| CreatorUserId long,
| ParkId int
|) STORED BY 'carbondata'
|TBLPROPERTIES('streaming'='true')
""".stripMargin)
carbon.sql(s"""load data inpath '/datatest/ods_app_personas/UserIC.csv'
into table ods.useric OPTIONS('HEADER'='true','DELIMITER'='\t','SKIP_EMPTY_LINE'='TRUE','MULTILINE'='true',
'DATEFORMAT' = 'yyyy-MM-dd','TIMESTAMPFORMAT'='yyyy-MM-dd HH:mm:ss',
'BAD_RECORDS_LOGGER_ENABLE'='true', 'BAD_RECORD_PATH'='/user/error',
'BAD_RECORDS_ACTION'='REDIRECT')""".stripMargin);
carbondata使用
pom引入
<dependency>
<scope>provided</scope>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-core</artifactId>
<version>${carbon.version}</version>
</dependency>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-spark2</artifactId>
<scope>provided</scope>
<version>${carbon.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
</exclusion>
</exclusions>
</dependency>
初始化sparksession
val store_location = "hdfs://cdh03:8020/carbon/data"
val warehouse="hdfs://cdh03:8020/carbon"
val metastore = "hdfs://cdh03:8020/carbon"
import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession
.builder()
.appName(groupId)
.master("local[*]")
.getOrCreateCarbonSession(store_location, metastore)
carbon.sql(s""xxxx"")
//写入数据到carbondata
dataDf.write
.format("carbondata")
.option("dbName", CommonConfig.BASIC_CB_DWD)
.option("tableName", TICKET_CLASSIC)
.option("compress", "true")
.mode(SaveMode.Overwrite)
.save()
jar yarn任务提交
cd /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/spark2
./spark-submit xxxx
spark-shell使用
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://cdh03:8020/carbon/data","hdfs://cdh03:8020/carbon")
carbon.sql("SELECT city, avg(age), sum(age)FROM test_table GROUP BY city").show()