hive dynamic partition的使用-CSDN博客

本文介绍了使用Hive进行动态分区插入数据时的一种优化方法，通过添加distribute by子句，显著提高了数据写入速度。从原本的30-40分钟减少到了3分钟左右，并探讨了数据倾斜的问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

参考链接： http://www.cnblogs.com/xd502djj/archive/2013/12/11/3470074.html

语句参考：
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.max.created.files=500000;
set mapred.reduce.tasks = 3000;
INSERT OVERWRITE TABLE sany_online_hive_wj.ecc_wj partition(st_year,st_month,st_day)
select
a.st_pid,
a.st_loginid,
a.st_ma_serialno,
a.st_checkmark,
a.st_state,
a.st_logintime,
a.st_connecttime,
a.st_updatetime,
a.st_totalwktime,
a.st_rmntime,
a.st_berrorcode,
a.st_werrorcode,
a.st_balmcode,
a.st_walmcode,
a.st_longitude,
a.st_latitude,
a.st_saticunt,
a.st_steppos,
a.st_engv,
a.st_oillev,
a.st_batteryvol,
a.st_floatreserv33,
a.st_floatreserv34,
a.re_en_pid ,
a.st_wktime,
a.st_gpssta,
a.st_velocity,
a.st_orientation,
a.st_sgnlq,
a.st_errdealsta,
a.st_cmmctsch,
a.st_altitude,
a.st_uintreserv10,
a.st_uintreserv11,
a.st_uintreserv12,
a.st_uintreserv13,
a.st_uintreserv14,
a.st_uintreserv15,
a.st_uintreserv16,
a.st_uintreserv17,
a.st_uintreserv18,
a.st_uintreserv19,
a.st_uintreserv20,
a.st_uintreserv21,
a.st_uintreserv22,
a.st_uintreserv23,
a.st_uintreserv24,
a.st_uintreserv25,
a.st_uintreserv26,
a.st_uintreserv27,
a.st_uintreserv28,
a.st_uintreserv29,
a.st_uintreserv30,
a.st_uintreserv31,
a.st_uintreserv32,
a.st_floatreserv13,
a.st_floatreserv14,
a.st_floatreserv15,
a.st_floatreserv16,
a.st_floatreserv17,
a.st_floatreserv18,
a.st_floatreserv19,
a.st_floatreserv20,
a.st_floatreserv21,
a.st_floatreserv22,
a.st_floatreserv23,
a.st_floatreserv24,
a.st_floatreserv25,
a.st_floatreserv26,
a.st_floatreserv27,
a.st_floatreserv28,
a.st_floatreserv29,
a.st_floatreserv30,
a.st_floatreserv31,
a.st_floatreserv32,
substring(trim(a.st_updatetime),1,4) st_year,
substring(trim(a.st_updatetime),6,2) st_month,
substring(trim(a.st_updatetime),9,2) st_day
from sany_online_hive_wj.ecc_wj_distinct as a where
substring(trim(a.st_updatetime),1,7)='${YYYY-MM}'
distribute by st_year,st_month,st_day

不加distribute by之前，数据从hive任务的临时结果路径写入数据的分区路径下，速度特别慢，3，40分钟左右，加上后耗时3分钟左右。
具体原因，可以参考：http://blog.csdn.net/xiaolang85/article/details/11767297

另外，hive的表，考虑数据倾斜的情况，最好是将数据均分到表的文件中会好些。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/31347383/viewspace-2125732/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/31347383/viewspace-2125732/