论文阅读:Cross and Learn: Cross-Modal Self-supervision

这篇论文提出了一种利用跨模态信息作为监督的新方法,通过交叉模态损失和多样性损失训练强大的特征表示。研究者使用双流网络架构,旨在获取对跨模态信息敏感且对模态特定内容不变的特征。文章展示了如何结合这两种损失,以及在GCPR 2018会议上取得的结果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目录

Contributions

Method

Cross-Modal Loss

Diversity Loss

Combining Both Loss Contributions

Results

 


 

论文名称:Cross and Learn: Cross-Modal Self-supervision(2018 GCPR: German Conference on Pattern Recognition)

论文作者:Nawid Sayed, Biagio Brattoli, and Bj¨orn Ommer

下载地址https://link.springer.com/chapter/10.1007/978-3-030-12939-2_17

 


 

Contributions

In this paper, we use cross-modal information as an alternative source of supervision and propose a new method to effectively exploit mutual information to train powerful feature representations for both modalities. The main motivation of our approach is derived from the following observation: Information shared across modalities has a much higher semantic meaning compared to information from modality-specific. So, our goal is to obtain feature representations that are sensitive to cross-modal information while being invariant to modality-specific content. These conditions are fulfilled by feature representations that are similar for a pair and dissimilar across different pairs. To achieve that we utilize a trainable two stream architecture with one network per modality similar to (Two-stream network) as the backbone of the proposed framework. Meanwhile, to achieve the former we propose a cross-modal loss L_cross, and to achieve the latter we utilize a diversity loss L_div, both of which act directly in feature space thus promising better training signals.

 


 

Method

Our method requires paired data from two different modalities xX and yY, which is available in most use cases i.e. RGB and optical flow. We utilize a two-stream architecture with trainable CNNs in order to obtain our feature representations f(x) and g(y). With exception of the first layer, the networks share the same architecture but do not share weights. To calculate both loss contributions we need a tuple of pairs xi, yi and xj, yj from our dataset.

 

 

 

Cross-Modal Loss

In order to enforce cross-modal similarity between f and g we enforce the feature representations of a pair to be close in feature space via some distance d. Solving this task requires the networks to ignore information which is only present in either x or y

We utilize the bounded cosine distance for d, which is given by

 

 

Diversity Loss

We obtain diversity by enforcing the feature representation for both modalities to be distant across pairs with respect to the same distance d as before. This spreads the features of different pairs apart in feature space. Due to the cross-modal loss these features mostly encode cross-modal information, thus ensuring sensitive feature representations for this content. The distance across pairs therefore contributes negatively into the loss

 

 

Combining Both Loss Contributions

Given our observations, we weight both loss contributions equally which yields our final loss

 


 

Results

 

 


 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值