参考链接:
http://www.cnblogs.com/xd502djj/archive/2013/12/11/3470074.html
语句参考:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.max.created.files=500000;
set mapred.reduce.tasks = 3000;
INSERT OVERWRITE TABLE sany_online_hive_wj.ecc_wj partition(st_year,st_month,st_day)
select
a.st_pid,
a.st_loginid,
a.st_ma_serialno,
a.st_checkmark,
a.st_state,
a.st_logintime,
a.st_connecttime,
a.st_updatetime,
a.st_totalwktime,
a.st_rmntime,
a.st_berrorcode,
a.st_werrorcode,
a.st_balmcode,
a.st_walmcode,
a.st_longitude,
a.st_latitude,
a.st_saticunt,
a.st_steppos,
a.st_engv,
a.st_oillev,
a.st_batteryvol,
a.st_floatreserv33,
a.st_floatreserv34,
a.re_en_pid ,
a.st_wktime,
a.st_gpssta,
a.st_velocity,
a.st_orientation,
a.st_sgnlq,
a.st_errdealsta,
a.st_cmmctsch,
a.st_altitude,
a.st_uintreserv10,
a.st_uintreserv11,
a.st_uintreserv12,
a.st_uintreserv13,
a.st_uintreserv14,
a.st_uintreserv15,
a.st_uintreserv16,
a.st_uintreserv17,
a.st_uintreserv18,
a.st_uintreserv19,
a.st_uintreserv20,
a.st_uintreserv21,
a.st_uintreserv22,
a.st_uintreserv23,
a.st_uintreserv24,
a.st_uintreserv25,
a.st_uintreserv26,
a.st_uintreserv27,
a.st_uintreserv28,
a.st_uintreserv29,
a.st_uintreserv30,
a.st_uintreserv31,
a.st_uintreserv32,
a.st_floatreserv13,
a.st_floatreserv14,
a.st_floatreserv15,
a.st_floatreserv16,
a.st_floatreserv17,
a.st_floatreserv18,
a.st_floatreserv19,
a.st_floatreserv20,
a.st_floatreserv21,
a.st_floatreserv22,
a.st_floatreserv23,
a.st_floatreserv24,
a.st_floatreserv25,
a.st_floatreserv26,
a.st_floatreserv27,
a.st_floatreserv28,
a.st_floatreserv29,
a.st_floatreserv30,
a.st_floatreserv31,
a.st_floatreserv32,
substring(trim(a.st_updatetime),1,4) st_year,
substring(trim(a.st_updatetime),6,2) st_month,
substring(trim(a.st_updatetime),9,2) st_day
from sany_online_hive_wj.ecc_wj_distinct as a where
substring(trim(a.st_updatetime),1,7)='${YYYY-MM}'
distribute by st_year,st_month,st_day
不加distribute by之前,数据从hive任务的临时结果路径写入数据的分区路径下,速度特别慢,3,40分钟左右,加上后耗时3分钟左右。
具体原因,可以参考:http://blog.csdn.net/xiaolang85/article/details/11767297
另外,hive的表,考虑数据倾斜的情况,最好是将数据均分到表的文件中会好些。
语句参考:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.max.created.files=500000;
set mapred.reduce.tasks = 3000;
INSERT OVERWRITE TABLE sany_online_hive_wj.ecc_wj partition(st_year,st_month,st_day)
select
a.st_pid,
a.st_loginid,
a.st_ma_serialno,
a.st_checkmark,
a.st_state,
a.st_logintime,
a.st_connecttime,
a.st_updatetime,
a.st_totalwktime,
a.st_rmntime,
a.st_berrorcode,
a.st_werrorcode,
a.st_balmcode,
a.st_walmcode,
a.st_longitude,
a.st_latitude,
a.st_saticunt,
a.st_steppos,
a.st_engv,
a.st_oillev,
a.st_batteryvol,
a.st_floatreserv33,
a.st_floatreserv34,
a.re_en_pid ,
a.st_wktime,
a.st_gpssta,
a.st_velocity,
a.st_orientation,
a.st_sgnlq,
a.st_errdealsta,
a.st_cmmctsch,
a.st_altitude,
a.st_uintreserv10,
a.st_uintreserv11,
a.st_uintreserv12,
a.st_uintreserv13,
a.st_uintreserv14,
a.st_uintreserv15,
a.st_uintreserv16,
a.st_uintreserv17,
a.st_uintreserv18,
a.st_uintreserv19,
a.st_uintreserv20,
a.st_uintreserv21,
a.st_uintreserv22,
a.st_uintreserv23,
a.st_uintreserv24,
a.st_uintreserv25,
a.st_uintreserv26,
a.st_uintreserv27,
a.st_uintreserv28,
a.st_uintreserv29,
a.st_uintreserv30,
a.st_uintreserv31,
a.st_uintreserv32,
a.st_floatreserv13,
a.st_floatreserv14,
a.st_floatreserv15,
a.st_floatreserv16,
a.st_floatreserv17,
a.st_floatreserv18,
a.st_floatreserv19,
a.st_floatreserv20,
a.st_floatreserv21,
a.st_floatreserv22,
a.st_floatreserv23,
a.st_floatreserv24,
a.st_floatreserv25,
a.st_floatreserv26,
a.st_floatreserv27,
a.st_floatreserv28,
a.st_floatreserv29,
a.st_floatreserv30,
a.st_floatreserv31,
a.st_floatreserv32,
substring(trim(a.st_updatetime),1,4) st_year,
substring(trim(a.st_updatetime),6,2) st_month,
substring(trim(a.st_updatetime),9,2) st_day
from sany_online_hive_wj.ecc_wj_distinct as a where
substring(trim(a.st_updatetime),1,7)='${YYYY-MM}'
distribute by st_year,st_month,st_day
不加distribute by之前,数据从hive任务的临时结果路径写入数据的分区路径下,速度特别慢,3,40分钟左右,加上后耗时3分钟左右。
具体原因,可以参考:http://blog.csdn.net/xiaolang85/article/details/11767297
另外,hive的表,考虑数据倾斜的情况,最好是将数据均分到表的文件中会好些。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/31347383/viewspace-2125732/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/31347383/viewspace-2125732/