[ACM MM 2024]Lite-Mind:Towards Efficient and Robust Brain Representation-CSDN博客

本文链接：https://blog.csdn.net/Sherlily/article/details/148425454

论文网址：Lite-Mind: Towards Efficient and Robust Brain Representation Learning | Proceedings of the 32nd ACM International Conference on Multimedia

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3.1. Brain Visual Decoding

2.3.2. Fourier Transform in Deep Learning

2.4. Lite-Mind

2.4.1. Overview

2.4.2. DFT Backbone

2.4.3. Retrieval Pipeline

2.5. Experiments

2.5.1. Dataset

2.5.2. Implementation details

2.6. Results

2.6.1. fMRI/image retrieval

2.6.2. LAION-5B retrieval

2.6.3. GOD zero-shot classification

2.6.4. Ablations and visualization

2.7. Limitations

2.8. Conclusion

1. 心得

（1）~重建不行检索也是路~

2. 论文逐段精读

2.1. Abstract

①Limitations on fMRI decoding image retrieval: scarce data, low signal-to-noise ratio, individual variations

2.2. Introduction

①字挤的会议或者长篇期刊都很爱intro写一点相关工作

②The authors aim to design specific lightweigt model for each one:

2.3. Related Work

2.3.1. Brain Visual Decoding

①Lists Mindreader, BrainClip, Mind-Vis, and MindEye, pointing out that they did not consider the lightweight network

2.3.2. Fourier Transform in Deep Learning

①Introduce how Fourier Transform is used in digital signal process field

2.4. Lite-Mind

2.4.1. Overview

①The overview of Lite-Mind:

where (a) is the backbone of MindEye, (b) represents Lite-Mind

2.4.2. DFT Backbone

①fMRI-image pair: $(x,y)$

②Dataset: $D$

（1）fMRI Spectrum Compression

①Divide image $x$ (voxel level) to $n$ non-overlapping patches $x=\left [ x_1,x_2,...,x_n \right ]$ with 0 padding

②Employ positional enoding on patches then get $t=\left [ t_1,t_2,...,t_n \right ]$ . The spectrum of these tokens are processed by 1D Discrete Fourier Transform (DFT):

$X[k]=F(t)=\sum_{i=1}^{n}t_{i}e^{-ki(2\pi/n)j}$

where $X\in\mathbb{C}^{n\times d}$ denotes complex tensor, $2\pi k/n$ denotes frequency, $i$ is the code of token and $j$ is the code of imaginary unit

③For $M$ filters $\mathbf{K}=[\mathbf{k}_{1},\mathbf{k}_{2},...,\mathbf{k}_{M}]$ , the features can be extracted by:

$\hat{X}=\sum_{m=1}^{M}\frac{1}{n}|X|^{2}\odot\mathbf{k}_{m}cos(\frac{(2m-1)\pi}{2M})$

where $\hat{X}\in\mathbb{C}^{n\times d}$ , $\odot$ is element-wise multiplication, $|X|^{2}$ denotes power spectrum of $X$

④Convert the spectrum back into the spatial domain by Inverse Discrete Fourier Transform (IDFT):

$\hat{t}\leftarrow F^{-1}(\hat{X})$

（2）Frequency Projector

①Align voxel and image by FreMLP:

$X^{\prime}=\sigma(\hat{X}^{T}\mathcal{W}+\mathcal{B})^{T}$

where $\mathcal{W}\in\mathbb{C}^{n\times n^{\prime}}$ denotes complex number weight matrix, $\mathcal{B}\in\mathbb{C}^{n^{\prime}}$ is complex number bias, $X^{\prime}\in\mathbb{C}^{n^{\prime}\times d}$ is the final output, $\sigma$ denotes the activation function. It can be extend to:

$\begin{aligned} X^{\prime} & =(\sigma(Re(\hat{X}^{T})\mathcal{W}_{r}-Im(\hat{X}^{T})\mathcal{W}_{i}+\mathcal{B}_{r}) \\ & +j\sigma(Re(\hat{X}^{T})\mathcal{W}_{i}+Im(\hat{X}^{T})\mathcal{W}_{r}+\mathcal{B}_{i}))^{T} \end{aligned}$

where $Re\left ( \cdot \right )$ is the real part of $\hat{X}^{T}$ , $\mathcal{W}=\mathcal{W}_{r}+j\mathcal{W}_{i}$ , $\mathcal{B}=\mathcal{B}_{r}+j\mathcal{B}_{i}$