一、compute_near_duplicates
用法
这是一个用于检测样本集合中潜在重复项的函数:
功能概述
compute_near_duplicates
函数用于在给定的样本集合中检测潜在的重复或相似图像。它通过计算图像嵌入(embeddings)并比较它们之间的相似度来实现。
主要参数说明
必需参数
samples
: FiftyOne的样本集合,包含要检测重复的图像
相似度控制
-
threshold
(默认0.2): 相似度距离阈值,用于判定重复。推荐值在 [0.1, 0.25] 范围内 -
threshold
是判断两张图像是否"重复"的距离阈值。- 两图像距离 < threshold → 判定为重复
- 距离越小 = 越相似
感兴趣区域
roi_field
: 可选参数,指定图像中的感兴趣区域(ROI),支持检测框、折线等标注类型
嵌入向量相关
embeddings
:- 如果不提供模型:指定预计算的嵌入向量
- 如果提供模型:指定存储计算出的嵌入向量的字段名
similarity_index
: 预先计算的相似度索引
模型相关
model
: 用于生成嵌入向量的模型model_kwargs
: 传递给模型配置的参数batch_size
: 计算嵌入时的批处理大小num_workers
: 加载图像时的工作线程数
ROI处理
force_square
: 是否将边界框强制转换为正方形alpha
: 扩展/收缩系数,用于调整提取区域的大小
(详细见底部详细解释用法)
其他
skip_failures
: 是否跳过失败的样本progress
: 进度条显示设置
返回值
函数返回一个 SimilarityIndex
对象,该对象提供以下方法:
duplicate_ids
: 获取重复项ID列表neighbors_map
: 获取ID到相似项的映射字典duplicates_view()
: 返回包含所有重复项的视图
流程图
使用示例
# 基础用法
index = compute_near_duplicates(dataset, threshold=0.15)
# 获取重复项
duplicate_ids = index.duplicate_ids
neighbors = index.neighbors_map
# 查看所有重复项
duplicates_view = index.duplicates_view()
# 使用预训练模型
index = compute_near_duplicates(
dataset,
model="clip-vit-base32",
threshold=0.2
)
# 使用ROI
index = compute_near_duplicates(
dataset,
roi_field="detections",
threshold=0.15,
alpha=0.1 # 扩展边界框10%
)
工作原理
- 嵌入计算: 使用指定的模型或预计算的嵌入向量
- 相似度计算: 计算样本之间的距离
- 重复检测: 基于阈值识别相似的样本对
- 索引构建: 创建相似度索引以便快速查询
二、compute_near_duplicates
基于源码的计算流程:
一、源码引用及出处
# 主入口
# 文件:fiftyone/brain/__init__.py
def compute_near_duplicates(samples, threshold=0.2, ...)
# 核心实现
# 文件:fiftyone/brain/internal/core/duplicates.py
def compute_near_duplicates(samples, threshold=None, ...):
"""See ``fiftyone/brain/__init__.py``."""
# 相似度计算
# 文件:fiftyone/brain/internal/core/similarity.py
def compute_similarity(samples, patches_field, roi_field, embeddings, ...):
"""See ``fiftyone/brain/__init__.py``."""
# sklearn后端
# 文件:fiftyone/brain/internal/core/sklearn.py
class SklearnSimilarityIndex(SimilarityIndex, DuplicatesMixin):
"""Class for interacting with sklearn similarity indexes."""
# 重复检测混入类
# 文件:fiftyone/brain/similarity.py
class DuplicatesMixin:
def find_duplicates(self, thresh=None, fraction=None):
"""Queries the index to find near-duplicate examples..."""
二、完整计算流程:从嵌入到索引
阶段1:API入口和参数准备
# 文件:fiftyone/brain/__init__.py
def compute_near_duplicates(samples, threshold=0.2, ...):
# 委托给内部实现
import fiftyone.brain.internal.core.duplicates as fbd
return fbd.compute_near_duplicates(samples, threshold, ...)
阶段2:重复检测主流程
# 文件:fiftyone/brain/internal/core/duplicates.py
def compute_near_duplicates(samples, threshold=None, ...):
# 2.1 验证样本集合
fov.validate_collection(samples)
# 2.2 处理嵌入参数
if etau.is_str(embeddings):
embeddings_field, embeddings_exist = fbu.parse_data_field(
samples, embeddings, data_type="embeddings"
)
embeddings = None
# 2.3 加载现有相似度索引
if etau.is_str(similarity_index):
similarity_index = samples.load_brain_results(similarity_index)
# 2.4 设置默认模型
if (model is None and embeddings is None
and similarity_index is None and not embeddings_exist):
model = _DEFAULT_MODEL # "resnet18-imagenet-torch"
# 2.5 计算相似度索引
if similarity_index is None:
similarity_index = fb.compute_similarity(
samples,
backend="sklearn",
roi_field=roi_field,
embeddings=embeddings_field or embeddings,
model=model,
...
)
# 2.6 查找重复项
similarity_index.find_duplicates(thresh=threshold)
return similarity_index
阶段3:相似度计算详细流程
# 文件:fiftyone/brain/internal/core/similarity.py
def compute_similarity(samples, patches_field, roi_field, embeddings, ...):
# 3.1 参数验证
fova.validate_collection(samples)
if roi_field is not None:
fova.validate_collection_label_fields(
samples, roi_field, _ALLOWED_ROI_FIELD_TYPES
)
# 3.2 处理嵌入字段参数
embeddings_field = kwargs.pop("embeddings_field", None)
if embeddings_field is not None or etau.is_str(embeddings):
if embeddings_field is None:
embeddings_field = embeddings
embeddings = None
# 检查嵌入是否已存在于数据集
embeddings_field, embeddings_exist = fbu.parse_data_field(
samples,
embeddings_field,
patches_field=patches_field or roi_field,
data_type="embeddings",
)
# 3.3 模型加载
if model is None and embeddings is None and not embeddings_exist:
model = _DEFAULT_MODEL
if batch_size is None:
batch_size = _DEFAULT_BATCH_SIZE
if etau.is_str(model):
_model = foz.load_zoo_model(model, **_model_kwargs)
# 3.4 配置后端
config = _parse_config(
backend, # "sklearn"
embeddings_field=embeddings_field,
patches_field=patches_field,
roi_field=roi_field,
model=model,
...
)
brain_method = config.build() # 创建 SklearnSimilarity 实例
# 3.5 初始化索引
dataset = samples._root_dataset
if brain_key is not None:
brain_method.register_run(dataset, brain_key, overwrite=False)
results = brain_method.initialize(dataset, brain_key)
# results 是 SklearnSimilarityIndex 实例
阶段4:嵌入向量计算
# 续 compute_similarity 函数
# 4.1 判断是否需要计算嵌入
get_embeddings = embeddings is not False
if not results.is_external and results.total_index_size > 0:
# 索引已包含嵌入,无需重新计算
get_embeddings = False
# 4.2 计算/加载嵌入向量
if get_embeddings:
# ROI场景特殊处理
if roi_field is not None:
handle_missing = "image" # 缺失ROI时使用整图
agg_fcn = lambda e: np.mean(e, axis=0) # 多ROI聚合
else:
handle_missing = "skip"
agg_fcn = None
# 获取嵌入向量
embeddings, sample_ids, label_ids = fbu.get_embeddings(
samples,
model=_model,
patches_field=patches_field or roi_field,
embeddings=embeddings,
embeddings_field=embeddings_field,
force_square=force_square,
alpha=alpha,
handle_missing=handle_missing,
agg_fcn=agg_fcn,
batch_size=batch_size,
num_workers=num_workers,
skip_failures=skip_failures,
progress=progress,
)
# 4.3 添加嵌入到索引
if embeddings is not None:
results.add_to_index(embeddings, sample_ids, label_ids=label_ids)
# 4.4 保存结果
brain_method.save_run_results(dataset, brain_key, results)
return results
阶段5:构建sklearn相似度索引
# 文件:fiftyone/brain/internal/core/sklearn.py
class SklearnSimilarityIndex(SimilarityIndex, DuplicatesMixin):
def __init__(self, samples, config, brain_key, embeddings=None, ...):
# 5.1 解析数据
embeddings, sample_ids, label_ids = self._parse_data(
samples, config, embeddings, sample_ids, label_ids
)
# 5.2 存储核心数据结构
self._embeddings = embeddings # N×D numpy数组
self._sample_ids = sample_ids # 样本ID数组
self._label_ids = label_ids # 标签ID数组
self._neighbors_helper = None # 延迟初始化
def add_to_index(self, embeddings, sample_ids, label_ids=None, ...):
# 5.3 动态扩展嵌入矩阵
n = self._embeddings.shape[0]
m = max(jj) - n + 1
if m >Ḣ0:
self._embeddings = np.concatenate(
(self._embeddings, np.empty((m, d), dtype=self._embeddings.dtype))
)
# 5.4 更新嵌入
self._embeddings[jj, :] = _embeddings
self._sample_ids = _sample_ids
self._label_ids = _label_ids
# 5.5 重置缓存
self._neighbors_helper = None
阶段6:邻居搜索准备
# 文件:fiftyone/brain/internal/core/sklearn.py
class NeighborsHelper:
def __init__(self, embeddings, metric):
self.embeddings = embeddings
self.metric = metric # 默认 "cosine"
self._full_dists = None
self._curr_neighbors = None
def _build_dists(self, embeddings):
# 6.1 中心化嵌入
embeddings = np.asarray(embeddings)
embeddings -= embeddings.mean(axis=0, keepdims=True)
# 6.2 计算距离矩阵
dists = skm.pairwise_distances(embeddings, metric=self.metric)
np.fill_diagonal(dists, np.nan) # 自身距离设为NaN
return dists
def _build_neighbors(self, embeddings):
# 6.3 余弦距离特殊处理
if metric == "cosine":
# sklearn NearestNeighbors不支持余弦距离
# 转换为归一化向量的欧氏距离
embeddings = skp.normalize(embeddings, axis=1)
metric = "euclidean"
# 关系:cos_dist ≈ euclidean_dist² / 2
# 6.4 构建近邻搜索器
neighbors = skn.NearestNeighbors(metric=metric)
neighbors.fit(embeddings)
return neighbors
阶段7:查找重复项
# 文件:fiftyone/brain/similarity.py
class DuplicatesMixin:
def find_duplicates(self, thresh=None, fraction=None):
# 7.1 获取当前活动ID
if self.config.patches_field is not None:
ids = self.current_label_ids # ROI模式
else:
ids = self.current_sample_ids # 全图模式
# 7.2 执行重复检测
if fraction is not None:
# 基于比例:自动调整阈值
num_keep = int(round((1.0 - fraction) * len(ids)))
unique_ids, thresh = self._remove_duplicates_count(num_keep, ids)
else:
# 基于阈值:使用固定距离阈值
unique_ids = self._remove_duplicates_thresh(thresh, ids)
阶段8:基于阈值引用函数 _remove_duplicates_thresh(self, thresh, ids):
## 文件:fiftyone/brain/similarity.py
# 8.1函数调用_radius_neighbors
def _remove_duplicates_thresh(self, thresh, ids):
nearest_inds = self._radius_neighbors(thresh=thresh)
n = len(ids)
keep = set(range(n))
for ind in range(n):
if ind in keep:
keep -= {i for i in nearest_inds[ind] if i > ind}
return [ids[i] for i in keep]
## 文件:fiftyone/brain/sklearn.py
##8.2 调用_radius_neighbor
def _radius_neighbors(self, query=None, thresh=None, return_dists=False):
(
query,
query_inds,
full_index,
single_query,
) = self._parse_neighbors_query(query)
can_use_dists = full_index or query_inds is not None
neighbors, dists = self._get_neighbors(can_use_dists=can_use_dists)
# When not using brute force, we approximate cosine distance by
# computing Euclidean distance on unit-norm embeddings.
# ED = sqrt(2 * CD), so we need to scale the threshold appropriately
if getattr(neighbors, _COSINE_HACK_ATTR, False):
thresh = np.sqrt(2.0 * thresh)
if dists is not None:
# Use pre-computed distances
if query_inds is not None:
_dists = dists[query_inds, :]
else:
_dists = dists
# note: this must gracefully ignore nans
inds = [np.nonzero(d <= thresh)[0] for d in _dists]
if return_dists:
dists = [d[i] for i, d in zip(inds, _dists)]
else:
dists = None
else:
if return_dists:
dists, inds = neighbors.radius_neighbors(
X=query, radius=thresh, return_distance=True
)
else:
dists = None
inds = neighbors.radius_neighbors(
X=query, radius=thresh, return_distance=False
)
return self._format_output(
inds, dists, full_index, single_query, return_dists
)
阶段9:得出结果
## 文件:fiftyone/brain/similarity.py
def find_duplicates(self, thresh=None, fraction=None):
#...省略看阶段8里面有引用#
# 9.1 分离唯一和重复ID
_unique_ids = set(unique_ids)
duplicate_ids = [_id for _id in ids if _id not in _unique_ids]
# 9.2 构建邻居映射
if unique_ids and duplicate_ids:
# 为每个重复项找最近的唯一项
unique_view = self._samples.select(unique_ids)
with self.use_view(unique_view):
_sample_ids, _label_ids, dists = self._kneighbors(
query=duplicate_ids, k=1, return_dists=True
)
# 构建映射:唯一ID -> [(重复ID, 距离), ...]
neighbors_map = defaultdict(list)
for dup_id, _ids, _dists in zip(duplicate_ids, nearest_ids, dists):
neighbors_map[_ids[0]].append((dup_id, _dists[0]))
# 按距离排序
neighbors_map = {
k: sorted(v, key=lambda t: t[1])
for k, v in neighbors_map.items()
}
# 9.3 存储结果
self._thresh = thresh
self._unique_ids = unique_ids
self._duplicate_ids = duplicate_ids
self._neighbors_map = neighbors_map
三、核心计算总结
数据流向
输入图像 → 模型提取嵌入 → 构建索引 → 计算距离 → 查找重复 → 返回结果
关键优化
- 智能缓存:小数据集预计算距离矩阵,大数据集使用近邻搜索
- 增量更新:支持动态添加/删除样本
- ROI支持:可以基于图像区域进行相似度搜索
- 灵活阈值:支持固定阈值或自动调整以达到指定重复比例
使用示例
# 基础用法
index = compute_near_duplicates(dataset, threshold=0.2)
# 获取结果
unique_ids = index.unique_ids
duplicate_ids = index.duplicate_ids
neighbors_map = index.neighbors_map # 重复项到最近唯一项的映射
# 可视化
duplicates_view = index.duplicates_view()
补充用法
ROI 处理简明说明
roi_field(感兴趣区域)
指定图像中要比较的特定区域,而不是整张图像。
支持的类型:
Detection/Detections
:矩形检测框Polyline/Polylines
:多边形区域
示例:
# 只比较人脸区域的相似度
compute_near_duplicates(dataset, roi_field="faces")
# 只比较检测到的物体
compute_near_duplicates(dataset, roi_field="detections")
alpha(区域调整)
调整ROI的大小,以百分比方式扩展或收缩。
取值影响:
alpha > 0
:扩展区域(如 0.1 = 扩大10%)alpha < 0
:收缩区域(如 -0.1 = 缩小10%)alpha = 0
:保持原始大小
示例:
# 扩展检测框20%,包含更多上下文
compute_near_duplicates(dataset, roi_field="faces", alpha=0.2)
# 收缩检测框10%,聚焦核心区域
compute_near_duplicates(dataset, roi_field="objects", alpha=-0.1)
应用场景:
- 人脸去重:只比较人脸部分
- 产品去重:忽略背景,只看产品
- 扩展边界:包含更多周围信息以提高准确性