目标:
弄清楚聚合算法原理
目录:
1 总入口
2 分析MetricProcessor服务
3 分析监控数据聚合处理算法_compute_and_store_timeseries方法
4 分析 _get_unaggregated_timeserie_and_unserialize: 获取未聚合的时间序列数据进行反序列化,来重新构建为新的时间序列
5 分析 ts.set_values: 计算聚合后的时间序列
6 分析_store_unaggregated_timeserie: 更新未聚合的时间序列
7 总结
1 总入口
gnocchi/cli.py
主入口代码如下:
def metricd():
conf = cfg.ConfigOpts()
conf.register_cli_opts([
cfg.IntOpt("stop-after-processing-metrics",
default=0,
min=0,
help="Number of metrics to process without workers, "
"for testing purpose"),
])
conf = service.prepare_service(conf=conf)
if conf.stop_after_processing_metrics:
metricd_tester(conf)
else:
MetricdServiceManager(conf).run()
分析:
1.1) 这里最关键的代码就是
MetricdServiceManager(conf).run()
进入到对应代码如下
class MetricdServiceManager(cotyledon.ServiceManager):
def __init__(self, conf):
super(MetricdServiceManager, self).__init__()
oslo_config_glue.setup(self, conf)
self.conf = conf
self.queue = multiprocessing.Manager().Queue()
self.add(MetricScheduler, args=(self.conf, self.queue))
self.metric_processor_id = self.add(
MetricProcessor, args=(self.conf, self.queue),
workers=conf.metricd.workers)
if self.conf.metricd.metric_reporting_delay >= 0:
self.add(MetricReporting, args=(self.conf,))
self.add(MetricJanitor, args=(self.conf,))
self.register_hooks(on_reload=self.on_reload)
def run(self):
super(MetricdServiceManager, self).run()
self.queue.close()
分析:
1.1.1) 可以看到这里分别实例化了
MetricScheduler服务用于每隔一定时间从incoming数据库中拉取临时的监控数据,放在多进程队列中
MetricReporting服务每隔2分钟统计并以日志形式输出未处理的监控项个数和未处理的measure数目
MetricJanitor每隔已定时间清理已经删除的metric数据
以及最重要的监控数据聚合处理服务MetricProcessor,它从MetricScheduler服务存放在多进程队列中获取需要处理的监控数据进行最终的聚合运算。
1.1.2)下面重点分析MetricProcessor服务。
具体参见2的分析
2 分析MetricProcessor服务
代码位置: gnocchi/cli.py
内容如下:
class MetricProcessor(MetricProcessBase):
name = "processing"
def __init__(self, worker_id, conf, queue):
super(MetricProcessor, self).__init__(worker_id, conf, 0)
self.queue = queue
def _run_job(self):
try:
try:
metrics = self.queue.get(block=True, timeout=10)
except six.moves.queue.Empty:
# NOTE(sileht): Allow the process to exit gracefully every
# 10 seconds
return
self.store.process_background_tasks(self.index, metrics)
except Exception:
LOG.error("Unexpected error during measures processing",
exc_info=True)
分析:
2.1) _run_job的主要逻辑就是从存放监控数据的多进程队列中获取
监控数据,然后调用self.store实际就是storage数据库(这里一般使用ceph)
的process_background_tasks方法
2.2) 分析process_background_tasks方法
进入到
gnocchi/storage/__init__.py
如下代码:
class StorageDriver(object):
def __init__(self, conf, incoming):
self.incoming = incoming
def process_background_tasks(self, index, metrics, sync=False):
"""Process background tasks for this storage.
This calls :func:`process_new_measures` to process new measures
:param index: An indexer to be used for querying metrics
:param block_size: number of metrics to process
:param sync: If True, then process everything synchronously and raise
on error
:type sync: bool
"""
LOG.debug("Processing new measures")
try:
self.process_new_measures(index, metrics, sync)
except Exception:
if sync:
raise
LOG.error("Unexpected error during measures processing",
exc_info=True)
分析:
2.2.1) 最关键的就是调用process_new_measures方法进行处理,
进入到,具体参见2.3的分析
2.3 分析process_new_measures方法
gnocchi/storage/_carbonara.py
代码如下:
class CarbonaraBasedStorage(storage.StorageDriver):
def process_new_measures(self, indexer, metrics_to_process,
sync=False):
metrics = indexer.list_metrics(ids=metrics_to_process)
# This build the list of deleted metrics, i.e. the metrics we have
# measures to process for but that are not in the indexer anymore.
deleted_metrics_id = (set(map(uuid.UUID, metrics_to_process))
- set(m.id for m in metrics))
for metric_id in deleted_metrics_id:
# NOTE(jd): We need to lock the metric otherwise we might delete
# measures that another worker might be processing. Deleting
# measurement files under its feet is not nice!
try:
with self._lock(metric_id)(blocking=sync):
self.incoming.delete_unprocessed_measures_for_metric_id(
metric_id)
except coordination.LockAcquireFailed:
LOG.debug("Cannot acquire lock for metric %s, postponing "
"unprocessed measures deletion", metric_id)
for metric in metrics:
lock = self._lock(metric.id)
# Do not block if we cannot acquire the lock, that means some other
# worker is doing the job. We'll just ignore this metric and may
# get back later to it if needed.
if not lock.acquire(blocking=sync):
continue
try:
locksw = timeutils.StopWatch().start()
LOG.debug("Processing measures for %s", metric)
# process_measure_for_metric(self, metric):返回待处理监控项对应的监控数据列表,
# 每个元素是时间戳和对应的值,样例:[(Timestamp('2018-04-19 02:29:04.925214'), 4.785732057729687)]
with self.incoming.process_measure_for_metric(metric) \
as measures:
self._compute_and_store_timeseries(metric, measures)
LOG.debug("Metric %s locked during %.2f seconds",
metric.id, locksw.elapsed())
except Exception:
LOG.debug("Metric %s locked during %.2f seconds",
metric.id, locksw.elapsed())
if sync:
raise
LOG.error("Error processing new measures", exc_info=True)
finally:
lock.release()
分析:
2.3.1)具体处理过程如下:
一. 根据待处理的监控项集合,判断如果有已经删除的监控项,则删除对应incoming storage中的监控数据
二. 遍历待处理的监控项列表,获取每个监控项在incoming storage中的监控数据列表,然后根据监控项及其待处理监控数据
调用_compute_and_store_timeseries方法来计算并存储时间序列
具体步骤二需要具体细分为如下步骤
最关键的代码如下:
with self.incoming.process_measure_for_metric(metric) \
as measures:
self._compute_and_store_timeseries(metric, measures)
对_compute_and_store_timeseries方法的分析
具体参见3的分析
3 分析监控数据聚合处理算法_compute_and_store_timeseries方法
class CarbonaraBasedStorage(storage.StorageDriver):
def _compute_and_store_timeseries(self, metric, measures):
# NOTE(mnaser): The metric could have been handled by
# another worker, ignore if no measures.
if len(measures) == 0:
LOG.debug("Skipping %s (already processed)", metric)
return
measures = sorted(measures, key=operator.itemgetter(0))
agg_methods = list(metric.archive_policy.aggregation_methods)
block_size = metric.archive_policy.max_block_size
back_window = metric.archive_policy.back_window
definition = metric.archive_policy.definition
try:
ts = self._get_unaggregated_timeserie_and_unserialize(
metric, block_size=block_size, back_window=back_window)
except storage.MetricDoesNotExist:
try:
self._create_metric(metric)
except storage.MetricAlreadyExists:
# Created in the mean time, do not worry
pass
ts = None
except CorruptionError as e:
LOG.error(e)
ts = None
if ts is None:
# This is the first time we treat measures for this
# metric, or data are corrupted, create a new one
ts = carbonara.BoundTimeSerie(block_size=block_size,
back_window=back_window)
current_first_block_timestamp = None
else:
current_first_block_timestamp = ts.first_block_timestamp()
# NOTE(jd) This is Python where you need such
# hack to pass a variable around a closure,
# sorry.
computed_points = {"number": 0}
'''
_map_add_measures(bound_timeserie):
1. 对给定的已经合并了待处理数据生成的时间序列和未聚合的时间序列的合并时间序列boundTimeSerie进行如下操作
2. 遍历归档策略,根据采样间隔,聚合方法:
计算每个boundTimeSerie聚合后的时间序列;
并对该聚合的时间序列分割,计算分割序列的偏移量和对应序列化的值;
根据偏移量,将序列化的值写入到对应的ceph对象
总结:这个函数实现了: 计算聚合后的时间序列,将聚合后的时间序列写入到ceph对象中
'''
def _map_add_measures(bound_timeserie):
# NOTE (gordc): bound_timeserie is entire set of
# unaggregated measures matching largest
# granularity. the following takes only the points
# affected by new measures for specific granularity
tstamp = max(bound_timeserie.first, measures[0][0])
new_first_block_timestamp = bound_timeserie.first_block_timestamp()
computed_points['number'] = len(bound_timeserie)
for d in definition:
'''
group_serie(self, granularity, start=0):
1.根据给定的时间开始时间,计算过滤后的时间序列,按照采样间隔计算分组后的时间索引列表
2.对分组后的时间序列索引列表赋值,对时间序列的索引按照采样间隔做去重处理,得到新的索引列表和次数列表
'''
ts = bound_timeserie.group_serie(
d.granularity, carbonara.round_timestamp(
tstamp, d.granularity * 10e8))
'''
根据给定的聚合方法,归档策略等信息,以及已经分组的时间序列,计算聚合后的时间序列,
并将聚合后的时间序列写入到ceph的对象中
_add_measures(self, aggregation, archive_policy_def,
metric, grouped_serie,
previous_oldest_mutable_timestamp,
oldest_mutable_timestamp):
1. 根据给定的聚合方法,对已经索引分组的时间序列等先计算得到聚合后的时间序列
2. 对时间序列做截断操作,得到截断后的时间序列,最终用这些参数初始化AggregatedTimeSerie对象
3. 对已经计算好的时间序列进行分割(例如每个时间序列最多保存3600个点),对每个分割后的时间序列
计算写入到对象的偏移量值,以及对应序列化的值,然后写入ceph对象;
4. 重复步骤3,直到所有被分割的时间序列都写入到了ceph对象
'''
self._map_in_thread(
self._add_measures,
((aggregation, d, metric, ts,
current_first_block_timestamp,
new_first_block_timestamp)
for aggregation in agg_methods))
with timeutils.StopWatch() as sw:
'''
set_values(self, values, before_truncate_callback=None,
ignore_too_old_timestamps=False):
1. 从未聚合的时间序列最后一个时间lastTime为基点,找出能够被最大采样间隔(例如86400)整除且最接近lasTtime
的时间作为最近的起始时间firstTime
2. 然后从待处理监控数据列表中过滤出时间 >= firstTime的待处理监控数据
3. 将待处理的监控数据(有时间,值的元组组成的列表),构建为待处理时间序列,并检查
重复和是否是单调的,然后用原来未聚合的时间序列和当前待处理时间序列进行合并操作,
得到新生成的时间序列
'''
ts.set_values(measures,
before_truncate_callback=_map_add_measures,
ignore_too_old_timestamps=True)
elapsed = sw.elapsed()
number_of_operations = (len(agg_methods) * len(definition))
perf = ""
if elapsed > 0:
perf = " (%d points/s, %d measures/s)" % (
((number_of_operations * computed_points['number']) /
elapsed),
((number_of_operations * len(measures)) / elapsed)