[ACM MM 2024]Lite-Mind:Towards Efficient and Robust Brain Representation

论文网址:Lite-Mind: Towards Efficient and Robust Brain Representation Learning | Proceedings of the 32nd ACM International Conference on Multimedia

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.3.1. Brain Visual Decoding

2.3.2. Fourier Transform in Deep Learning

2.4. Lite-Mind

2.4.1. Overview

2.4.2. DFT Backbone

2.4.3. Retrieval Pipeline

2.5.  Experiments

2.5.1. Dataset

2.5.2. Implementation details

2.6. Results

2.6.1. fMRI/image retrieval

2.6.2. LAION-5B retrieval

2.6.3. GOD zero-shot classification

2.6.4. Ablations and visualization

2.7. Limitations

2.8. Conclusion

1. 心得

(1)~重建不行检索也是路~

2. 论文逐段精读

2.1. Abstract

        ①Limitations on fMRI decoding image retrieval: scarce data, low signal-to-noise ratio, individual variations

2.2. Introduction

        ①字挤的会议或者长篇期刊都很爱intro写一点相关工作

        ②The authors aim to design specific lightweigt model for each one:

2.3. Related Work

2.3.1. Brain Visual Decoding

        ①Lists Mindreader, BrainClip, Mind-Vis, and MindEye, pointing out that they did not consider the lightweight network

2.3.2. Fourier Transform in Deep Learning

        ①Introduce how Fourier Transform is used in digital signal process field

2.4. Lite-Mind

2.4.1. Overview

        ①The overview of Lite-Mind:

where (a) is the backbone of MindEye, (b) represents Lite-Mind

2.4.2. DFT Backbone

        ①fMRI-image pair: (x,y)

        ②Dataset: D

(1)fMRI Spectrum Compression

        ①Divide image x (voxel level) to n non-overlapping patches x=\left [ x_1,x_2,...,x_n \right ] with 0 padding

        ②Employ positional enoding on patches then get t=\left [ t_1,t_2,...,t_n \right ]. The spectrum of these tokens are processed by 1D Discrete Fourier Transform (DFT):

X[k]=F(t)=\sum_{i=1}^{n}t_{i}e^{-ki(2\pi/n)j}

where X\in\mathbb{C}^{n\times d} denotes complex tensor, 2\pi k/n denotes frequency, i is the code of token and j is the code of imaginary unit

        ③For M filters \mathbf{K}=[\mathbf{k}_{1},\mathbf{k}_{2},...,\mathbf{k}_{M}], the features can be extracted by:

\hat{X}=\sum_{m=1}^{M}\frac{1}{n}|X|^{2}\odot\mathbf{k}_{m}cos(\frac{(2m-1)\pi}{2M})

where \hat{X}\in\mathbb{C}^{n\times d}\odot is element-wise multiplication, |X|^{2} denotes power spectrum of X

        ④Convert the spectrum back into the spatial domain by Inverse Discrete Fourier Transform (IDFT):

\hat{t}\leftarrow F^{-1}(\hat{X})

(2)Frequency Projector

        ①Align voxel and image by FreMLP:

X^{\prime}=\sigma(\hat{X}^{T}\mathcal{W}+\mathcal{B})^{T}

where \mathcal{W}\in\mathbb{C}^{n\times n^{\prime}} denotes complex number weight matrix, \mathcal{B}\in\mathbb{C}^{n^{\prime}} is complex number bias, X^{\prime}\in\mathbb{C}^{n^{\prime}\times d} is the final output, \sigma denotes the activation function. It can be extend to:

\begin{aligned} X^{\prime} & =(\sigma(Re(\hat{X}^{T})\mathcal{W}_{r}-Im(\hat{X}^{T})\mathcal{W}_{i}+\mathcal{B}_{r}) \\ & +j\sigma(Re(\hat{X}^{T})\mathcal{W}_{i}+Im(\hat{X}^{T})\mathcal{W}_{r}+\mathcal{B}_{i}))^{T} \end{aligned}

where Re\left ( \cdot \right ) is the real part of \hat{X}^{T}\mathcal{W}=\mathcal{W}_{r}+j\mathcal{W}_{i}\mathcal{B}=\mathcal{B}_{r}+j\mathcal{B}_{i}

        ②Employ IDFT again:

t^{\prime}\leftarrow F^{-1}(X^{\prime})

and f is the voxel embedding

2.4.3. Retrieval Pipeline

        ①Optimization objective:

\omega^{*}=argmax\sum_{\omega}\sum_{(x,y)\in D}SIM(DFT(x;\omega),CLIP(y))

where \omega is the weight of DFT backbone, SIM\left ( \cdot \right ) denotes cosine similarity

        ②They process f by LAION-5B:

\mathcal{V}^{\prime}=Diffusion(f)

        ③Contrastive loss:

L_{contr}=-\frac{1}{|B|}\sum_{s=1}^{|B|}\log\frac{\exp(f_{s}^{\top}\cdot V_{s}/\tau)}{\sum_{i=1}^{|B|}\exp(f_{s}^{\top}\cdot V_{i}/\tau)}

where B denotes batch size, \tau is temperature factor

        ④MSE loss to constrain the image generation:

L_{mse}=\frac{1}{|B|}\sum_{s=1}^{|B|}\|V_{s}-V^{\prime}{}_{s}\|_{2}^{2}

        ⑤Final loss:

L=L_{contr}+\alpha L_{mse}

        ⑥Tasks: test set retrieval, LAION-5B retrieval, zero-shot classification

2.5.  Experiments

2.5.1. Dataset

        ①Dataset: Natural Scenes Dataset (NSD)

        ②Sample: subject 1, 2, 5, 7 with 10000 images 

        ③Data split: 8859 image stimuli and 24980 trials for training, 982 image stimuli and 2770 trials for test

        ④Voxel of each subject: 15724, 14278, 13039, and 12682

2.5.2. Implementation details

        ①V100 32GB GPU

2.6. Results

2.6.1. fMRI/image retrieval

        ①Retrieval performance:

2.6.2. LAION-5B retrieval

        ①Retrieval performance on LAION-5B:

        ②Retrieval results on LAION-5B:

2.6.3. GOD zero-shot classification

        ①Performance:

2.6.4. Ablations and visualization

        ①Ablation of different depth of DFT backbone:

        ②Module ablation:

        ③Retrieval performance with different cerebral cortex for Subject 1 on the NSD dataset:

        ④t-SNE for embedding visualization:

2.7. Limitations

        ①Number of training data

2.8. Conclusion

        ~

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值