[ECCV 2024]UMBRAE: Unified Multimodal Brain Decoding

论文网址:01133.pdf

论文代码:GitHub - weihaox/UMBRAE: [ECCV 2024] UMBRAE: Unified Multimodal Brain Decoding | Unveiling the 'Dark Side' of Brain Modality

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Works

2.4. UMBRAE

2.4.1. Architecture

2.4.2. Cross-Subject Alignment

2.4.3. Multimodal Alignment

2.4.4. Brain Prompting Interface

2.5. Experiments

2.5.1. Implementation Details

2.5.2. BrainHub

2.5.3. Brain Captioning

2.5.4. Brain Grounding

2.5.5. Brain Retrieval

2.5.6. Visual Decoding

2.5.7. Weakly-Supervised Adaptation

2.6. Ablation Study

2.6.1. Architectural Improvements

2.6.2. Training Strategies

2.7. Conclusion

1. 心得

(1)额

2. 论文逐段精读

2.1. Abstract

        ①Challenges: spatial brain-powered information and cross-subject research

 granularity  n. 间隔尺寸,[岩] 粒度

2.2. Introduction

        ①The target object of brain signal decoding: people with cognitive or physical disabilities or even locked-in patients

        ②⭐Challenges: a) single modality decoding will cause loss of brain information, b) text encoding ignores the spatial information

2.3. Related Works

        ①Mentioned generation models, LLM based models and alignment models

2.4. UMBRAE

        ①UMBRAE denotes unified multimodal brain decoding

        ②Overall framework of UMBRAE:

(咦,一眼OneLLM了)

2.4.1. Architecture

        ①Brain encoder: lightweight Transformer

        ②Brain signal s\in\mathbb{R}^{1\times L_{s}} for each person is from subject set \mathcal{S}_{\Omega}, where L_s is with arbitrary length

        ③Tokenizer transforms s\in\mathbb{R}^{1\times L_{s}} to \mathbf{s}_{k}\in\mathbb{R}^{M\times D} with M tokens and D dimension(这个是绿色小方块?)

        ④Brain tokens \mathbf{x}\in\mathbb{R}^{L\times D}(这个到底是什么啊哪哪都是brain token但是图上没有啊图上只有subject token,看文本就像是紫色小方块)

        ⑤Universal Perceive Encoder: cross attention module

prepend  v. 预置;前置;预先考虑;预先准备;预追加

2.4.2. Cross-Subject Alignment

        ①Uniform random sampling in the data of each participant with probability p:

p_k=\frac{\|\mathcal{S}_k\|}{\sum_{n=1}^K\|\mathcal{S}_n\|}

就是如果一个batch size的大小是B,有p_k的概率抽到某个被试S_{k},从S_{k}中抽\theta B个数据。然后剩下的\left ( 1-\theta \right ) B个数据从其他受试者中均匀采样

2.4.3. Multimodal Alignment

        ①Instead of mapping data in all the modalities to the same space, they align brain signal element by element to pretrained image feature

        ②To align brain response s\in\mathbb{R}^{1\times L_{s}} and image v\in\mathbb{R}^{W\times H\times C}, they minimize the loss between brain encoder \mathcal{B} and image encoder \mathcal{V}:

\mathcal{L}_{\mathrm{rec}}=\mathbb{E}_{\mathbf{b}\sim\mathbf{B},\mathbf{v}\sim\mathbf{V}}[\|\mathcal{V}(v)-\mathcal{B}(b))\|_2^2]

2.4.4. Brain Prompting Interface

        ①Templet of MLLM:

for brain captioning, they define <instruction> as: ‘Describe this image <image> as simply as possible.’, for brain grounding task, they define <instruction> as: ‘Locate <expr> in <image> and provide its coordinates, please.’, where <expr> is the expression

2.5. Experiments

2.5.1. Implementation Details

        ①Visual encoder: CLIP ViT-L/14

        ②LLM: Vicuna-7B/13B

        ③Image feature: \mathbf{T}\in\mathbb{R}^{16\times16\times1024} from the second last layer of the transformer encoder, and is converted to \mathbf{T}^{\prime}\in\mathbb{R}^{256\times D}D=4,096 for Vicuna-7B and D=5,120 for Vicuna-13B

        ④Epoch: 240 

        ⑤Batch size: 256

        ⑥Training time: 12 hours in one A100 GPU

        ⑦Optimizer: AdamW with \beta _1=0.9\beta _2=0.95, weight decay of 0.01, learning rate of 3e-4

        ⑧\theta =0.5, meaning that in each batch of 256 samples, 128 come from each of two subjects

2.5.2. BrainHub

        ①Dataset: NSD

        ②Tasks: brain captioning, brain grounding, brain retrieval, visual decoding

2.5.3. Brain Captioning

        ①Brain captioning performance comparison table:

where -S1 denotes training on single subject 01

2.5.4. Brain Grounding

        ①Example of brain captioning and brain grounding tasks:

        ②Performance of brain grounding:

2.5.5. Brain Retrieval

        ①Forward retrieval, backward retrieval, and exemplar retrieval peroformance:

where the model needs to identify brain embedding, image embedding and image

2.5.6. Visual Decoding

        ①Image reconstruction:

2.5.7. Weakly-Supervised Adaptation

        ①Performance with different training data on S7:

2.6. Ablation Study

2.6.1. Architectural Improvements

        ①UMBRAE has less parameters

2.6.2. Training Strategies

        ①Module ablation:

2.7. Conclusion

        ~

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值