Multi-modal系列论文研读目录
文章目录
1.ABSTRACT
Multimodal Emotion Recognition is an important research area for developing human-centric applications, especially in the context of video platforms. Most existing models have attempted to develop sophisticated fusion techniques to integrate heterogeneous features from different modalities. However, these fusion methods can affect performance since not all modalities help figure out the semantic alignment for emotion prediction. We observed that the 8.0% of misclassified instances’ performance is improved for the existing fusion model when one of the input modalities is masked. Based on this observation, we propose a representation learning method called Cross-modal DynAmic Transfer learning (CDaT), which dynamically filters the low-confident modality and complements it with the high-confident modality using uni-modal masking and cross-modal representation transfer learning. We train an auxiliary network that learns model confidence scores to determine which modality is low-confident and how much the transfer should occur from other modalities. Furthermore, it can be used with any fusion model in a model-agnostic way because it leverages transfer between low-level uni-modal information via probabilistic knowledge transfer loss. Experiments have demonstrated the effect of CDaT with four different state-of-the-art fusion models on the CMU-MOSEI and IEMOCAP datasets for emotion recognition.多模态情感识别是开发以人为中心的应用程序的重要研究领域,特别是在视频平台的背景下。大多数现有的模型都试图开发复杂的融合技术,以整合不同形式的异构功能。然而,这些融合方法可能会影响性能,因为并非所有模态都有助于找出情感预测的语义对齐。我们观察到,8.0%的误分类实例的性能提高了现有的融合模型时,输入模态之一被掩盖。基于这一观察结果,我们提出了一种称为跨模态动态迁移学习(CDaT)的表征学习方法,该方法使用单模态掩蔽和跨模态表征迁移学习动态过滤低置信模态,并使用高置信模态对其进行补充。我们训练了一个辅助网络,该网络学习模型置信度得分,以确定哪种模态是低置信度的,以及应该从其他模态转移多少。此外,它可以以模型不可知的方式与任何融合模型一起使用,因为它通过概率知识转移损失来利用低级单峰信息之间的转移。实验证明了CDaT与四种不同的最先进的融合模型对CMU-MOSEI和IEMOCAP情感识别数据集的影响。
2.INDEX TERMS
Affective computing, cross-modal knowledge transfer, model confidence, multimodal emotion recognition.
情感计算,跨模态知识转移,模型置信度,多模态情感识别。
3.INTRODUCTION
- In recent years, research on multimodal understanding for emotion recognition and sentiment analysis has rapidly advanced due to advances in machine learning and information fusion for application to video platforms (e.g., YouTube, Twitch, TikTok, etc) [1], [2], [3]. Multimodal Emotion Recognition (MER) aims to understand human emotions by integrating multiple sources such as language, facial expression, and speech information [4], [5], [6], [7], [8]. There are two main challenges. That is considering 1) leveraging the complement between different modalities and 2) fusing them to look in the same direction, bridging the gap between different modalities.近年来,由于应用于视频平台的机器学习和信息融合的进步,对用于情感识别和情感分析的多模态理解的研究已经迅速发展(例如,YouTube、Twitch、TikTok等)[1]、[2]、[3]。多模态情感识别(MER)旨在通过整合语言,面部表情和语音信息等多个来源来理解人类情感[4],[5],[6],[7],[8]。有两个主要挑战。这是考虑1)利用不同模态之间的互补,2)将它们融合以朝同一方向看,弥合不同模态之间的差距。
- Previous studies mainly focused on a fusion method that minimizes the heterogeneity gap between different modalities. Tensor-based fusion network [9] proposed to capture the two types of gap (intra-modality and intermodality). Recently, more sophisticated attention-based approaches [10], [11], [12], [13] proposed to understand the complementary information across modalities. However, these methods did not consider the possibility of semantic misalignment between modalities, which could affect the model’s performance. In Figure 1, we observed that among the set of incorrect instances of the previous fusion model [8], masking a particular modality improves the 8.0% of instances’ accuracy. For example, given the text ‘‘That really depends on how chronic or severe your condition is. . . ’’, the visual cue of a frown face, and the acoustic information of a high and strong voice, trying to use all modalities would result in low performance of emotion recognition due to misaligned audio information, but masking that information would make it much easier to recognize them. We found that this error correction situation of masked modality is worse in fusion models that use more complex networks. This observation led us to the following interesting questions: 1) Misalignment: Does unusually misaligned modality hinder multimodal fusion learning for emotion prediction? 2) Modality confidence: Is there a clue for the misaligned modality? 3) Knowledge transfer: Is it possible to adjust misalignment with the remaining modalities in each instance level?以前的研究主要集中在融合方法,最大限度地减少不同模态之间的异质性差距。基于张量的融合网络[9]提出捕获两种类型的间隙(模态内和模态间)。最近,更复杂的基于注意力的方法[10],[11],[12],[13]提出了理解跨模态的互补信息。然而,这些方法没有考虑模态之间语义不一致的可能性,这可能会影响模型的性能。在图1中,我们观察到在先前融合模型[8]的一组不正确实例中,掩蔽特定模态提高了8.0%的实例准确度。例如,给定文本“这真的取决于你的病情有多慢性或严重。…“,皱眉的视觉线索,以及高而强的声音的声学信息,试图使用所有模态将导致由于不对齐的音频信息而导致的情感识别性能低下,但是掩蔽该信息将使识别它们变得更容易。我们发现,在使用更复杂网络的融合模型中,掩蔽模态的这种纠错情况更糟。这一观察使我们产生了以下有趣的问题:1)失调:异常失调的模态是否会阻碍情感预测的多模态融合学习?2)模态信心:是否有一个线索的失调模态?3)知识转移:是否有可能在每个实例级别调整与其余模式的不一致?
- To answer these questions, we propose a representation learning method called Cross-modal DynAmic Transfer learning (CDaT) that dynamically adjusts misaligned modality. The proposed approach leverages fusion models to learn cross-modal knowledge transfer in a model-agnostic way. Based on the results of masking modality inference, we hypothesized that any change in logit outcome or class probability when masking a particular modality is evidence of misalignment. To capture this change, we propose a two-stage method: 1) misaligned modality detection and 2) modality knowledge transfer. First, we introduce the Misaligned Modality Filtering (MMF) stage that trains an additional network to estimate the instance-level modality confidence. It proportionally adjusts irrelevant modalities with other high-confident modalities. To make these adjust ments dynamic for each instance, models jointly learn the Probabilistic Knowledge Transfer (PKT) with divergence loss between features extracted from each modality encoder. The advantage of using PKT loss is that it does not require additional parameters for knowledge transfer between modalities. Unlike the general KT method, it can be used without specifying specific hyperparameters (e.g., temperature) even if the dimension between features differs [14].为了回答这些问题,我们提出了一种称为跨模态动态迁移学习(CDaT)的表示学习方法,该方法动态调整未对齐的模态。所提出的方法利用融合模型以模型不可知的方式学习跨模态知识转移。基于掩蔽模态推理的结果,我们假设当掩蔽特定模态时,logit结果或类概率的任何变化都是不对齐的证据。为了捕捉这种变化,我们提出了一个两阶段的方法:1)错位模态检测和2)模态知识转移。首先,我们介绍了未对齐模态过滤(MMF)阶段,该阶段训练额外的网络来估计实例级模态置信度。它与其他高置信度模态成比例地调整不相关模态。为了使这些调整对于每个实例是动态的,模型联合学习概率知识转移(PKT),其中从每个模态编码器提取的特征之间存在发散损失。使用PKT损失的优点是,它不需要额外的参数,用于模态之间的知识转移。与一般的KT方法不同,它可以在不指定特定超参数的情况下使用(例如,即使特征之间的尺寸不同[14]。
- We conduct experiments on four baseline models to demonstrate the effectiveness of our proposed modelagnostic framework. The Naive Fusion model uses simple concatenation of all input-level modalities without representation learning for modality fusion. TFN [9] is an end-toend approach that pose multimodal sentiment analysis as modeling intra- and inter-modality dynamic for the first time. MISA [11] is a representation learning method that encodes modality-shared and -distinct spaces separately to overcome heterogeneity between modalities. TAILOR [8] is similar to the previous model but improves emotion recognition performance using a hierarchical cross-modal encoder and label-guided decoder based on Transformer architecture. We implemented the above four state-of-the-art models on CMU-MOSEI and IEMOCAP datasets, showed the overall performance improvement when applying CDaT on top of them, and experimentally analyzed the impact of different confidence measures.通过在四个基线模型的实验证明了所提模型不可知框架的有效性。朴素融合模型使用所有输入级模态的简单连接,而不进行模态融合的表示学习。TFN [9]是一种端到端方法,它首次将多模态情感分析作为模态内和模态间动态建模。MISA [11]是一种表征学习方法,其分别对模态共享空间和模态相异空间进行编码,以克服模态之间的异质性。TAILOR [8]与之前的模型类似,但使用基于Transformer架构的分层跨模态编码器和标签引导解码器来提高情感识别性能。我们在CMU-MOSEI和IEMOCAP数据集上实现了上述四个最先进的模型,展示了在它们之上应用CDaT时的整体性能改善,并实验分析了不同置信度的影响。
- In this work, the novel contributions can be summarized as:
(1) We proposed CDaT, a novel multimodal emotion recognition method based on cross-modal confidence score. TheMMFstage solves the modality misalignment problem by training an additional network to estimate misalignment levels within each modality.
(2) We also introduce a dynamic PKT for mitigating the effects of semantically misaligned modality. The transfer model compares the outcome probability values between the two modalities and selectively learns that the features of the modality with lower confidence follow the feature distribution of the modality with higher confidence.
(3) Experiments on CMU-MOSEI and IEMOCAP two publicly available datasets for MER tasks, demonstrate consistent performance gain over state-of-the-art fusion models, proving the effectiveness of our model-agnostic approach.
在这项工作中,新的贡献可以概括为:
(1)我们提出了CDaT,一种新的多模态情感识别方法的基础上跨模态置信度得分。MMF阶段通过训练一个额外的网络来估计每个模态内的未对准水平,从而解决了模态未对准问题。
&#