[ECCV 2024]UMBRAE: Unified Multimodal Brain Decoding-CSDN博客

论文网址：01133.pdf

论文代码：GitHub - weihaox/UMBRAE: [ECCV 2024] UMBRAE: Unified Multimodal Brain Decoding | Unveiling the 'Dark Side' of Brain Modality

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4.2. Cross-Subject Alignment

2.4.3. Multimodal Alignment

2.4.4. Brain Prompting Interface

2.5. Experiments

2.5.1. Implementation Details

2.5.2. BrainHub

2.5.3. Brain Captioning

2.5.4. Brain Grounding

2.5.5. Brain Retrieval

2.5.6. Visual Decoding

2.5.7. Weakly-Supervised Adaptation

2.6. Ablation Study

2.6.1. Architectural Improvements

2.6.2. Training Strategies

2.7. Conclusion

1. 心得

（1）额

2. 论文逐段精读

2.1. Abstract

①Challenges: spatial brain-powered information and cross-subject research

granularity n. 间隔尺寸，[岩] 粒度

2.2. Introduction

①The target object of brain signal decoding: people with cognitive or physical disabilities or even locked-in patients

②⭐Challenges: a) single modality decoding will cause loss of brain information, b) text encoding ignores the spatial information

2.3. Related Works

①Mentioned generation models, LLM based models and alignment models

2.4. UMBRAE

①UMBRAE denotes unified multimodal brain decoding

②Overall framework of UMBRAE:

（咦，一眼OneLLM了）

2.4.1. Architecture

①Brain encoder: lightweight Transformer

②Brain signal $s\in\mathbb{R}^{1\times L_{s}}$ for each person is from subject set $\mathcal{S}_{\Omega}$ , where $L_s$ is with arbitrary length

③Tokenizer transforms $s\in\mathbb{R}^{1\times L_{s}}$ to $\mathbf{s}_{k}\in\mathbb{R}^{M\times D}$ with $M$ tokens and $D$ dimension（这个是绿色小方块？）

④Brain tokens $\mathbf{x}\in\mathbb{R}^{L\times D}$ （这个到底是什么啊哪哪都是brain token但是图上没有啊图上只有subject token，看文本就像是紫色小方块）

⑤Universal Perceive Encoder: cross attention module

prepend v. 预置；前置；预先考虑；预先准备；预追加

2.4.2. Cross-Subject Alignment

①Uniform random sampling in the data of each participant with probability $p$ :

$p_k=\frac{\|\mathcal{S}_k\|}{\sum_{n=1}^K\|\mathcal{S}_n\|}$

就是如果一个batch size的大小是 $B$ ，有 $p_k$ 的概率抽到某个被试 $S_{k}$ ，从 $S_{k}$ 中抽 $\theta B$ 个数据。然后剩下的 $\left ( 1-\theta \right ) B$ 个数据从其他受试者中均匀采样

2.4.3. Multimodal Alignment

①Instead of mapping data in all the modalities to the same space, they align brain signal element by element to pretrained image feature

②To align brain response $s\in\mathbb{R}^{1\times L_{s}}$ and image $v\in\mathbb{R}^{W\times H\times C}$ , they minimize the loss between brain encoder $\mathcal{B}$ and image encoder $\mathcal{V}$ :

$\mathcal{L}_{\mathrm{rec}}=\mathbb{E}_{\mathbf{b}\sim\mathbf{B},\mathbf{v}\sim\mathbf{V}}[\|\mathcal{V}(v)-\mathcal{B}(b))\|_2^2]$

2.4.4. Brain Prompting Interface

①Templet of MLLM:

for brain captioning, they define <instruction> as: ‘Describe this image <image> as simply as possible.’, for brain grounding task, they define <instruction> as: ‘Locate <expr> in <image> and provide its coordinates, please.’, where <expr> is the expression

2.5. Experiments

2.5.1. Implementation Details

①Visual encoder: CLIP ViT-L/14

②LLM: Vicuna-7B/13B

③Image feature: $\mathbf{T}\in\mathbb{R}^{16\times16\times1024}$ from the second last layer of the transformer encoder, and is converted to $\mathbf{T}^{\prime}\in\mathbb{R}^{256\times D}$ . $D=4,096$ for Vicuna-7B and $D=5,120$ for Vicuna-13B