[arXiv 2023]UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Huma-CSDN博客

本文链接：https://blog.csdn.net/Sherlily/article/details/147337393

论文网址：[2308.07428] UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3.1. Latent Diffusion Models

2.3.2. Image Reconstruction and Captioning

2.4. Experiments

2.4.1. Dataset

2.4.2. Implementation

2.4.3. Evaluation Metric

2.5. Results

2.5.1. Image Reconstruction Results

2.5.2. Image Captioning Results

2.5.3. Ablation Experiments

2.5.4. ROI Analysis

2.6. Conclusion

1. 心得

（1）在最近看的arXiv论文里面算写得很清楚的了

2. 论文逐段精读

2.1. Abstract

①Challenge: generate realistic captions and images with both low-level details and high semantic

②They proposed UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity

2.2. Introduction

①⭐Caption generation helps to understand the visual experiences of people with mental disorder or communication difficulties

②They novely reconstruct caption by diffusion model:

2.3. Methodology

2.3.1. Latent Diffusion Models

①Noise added input at time $t$ :

$x_{t}=\sqrt{\alpha_{t}}x_{0}+\sqrt{1-\alpha_{t}}\epsilon_{t}$

where $\alpha$ is weight factor, $\epsilon$ denotes Gaussian noise

②UNet is used to learn the noise, and the noise loss is:

$L_{DM}=E_{x_0,\epsilon\sim\mathcal{N}(0,1),t}\left[||\epsilon-\epsilon_\theta(x_t,t)||^2\right]$

where $\epsilon _\theta\left ( \cdot \right )$ is model (UNET) trained to predict noise, $t\in\{1,2,...,T\}$ is time steps

③Facing such a expensive computing cost, LDMs compress input $x_0$ by $z_{0}=E(x_{0})$ . Thus the noise adding process can be changed to:

$z_{t}=\sqrt{\alpha_{t}}z_{0}+\sqrt{1-\alpha_{t}}\epsilon_{t}$

and the loss function also changes:

$L_{LDM}=E_{z_{0},c,\epsilon\sim\mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon(z_{t},t,\tau_{\theta}(c)\|^{2}\right]$

where $\tau_{\theta}(c)$ is conditioning input (e.g. labels, captions, images and semantic maps) of U-Net in cross-attention block

④Employed versatile diffusion (VD) model trained on Laion2B-en and COYO700M

2.3.2. Image Reconstruction and Captioning

①Framework of UniBrain:

where $Z$ s are low level feature and $C$ s are high-level ( $Z_I \in \mathbb{R}^{4 \times 64 \times 64}$ , $Z_T \in \mathbb{R}^{1 \times 768}$ , $C_I \in \mathbb{R}^{257 \times 768}$ , $C_T \in \mathbb{R}^{77 \times 768}$ )

②They trained 4 regressors to respectively align with 4 embeddings

③ $Z_I$ and $Z_T$ are the input of diffusion model, $C_I$ and $C_T$ guide backward diffusion process

④ $C_I$ and $C_T$ are mixed by linear interpolation in cross-attention process of UNET

2.4. Experiments

2.4.1. Dataset

①Dataset: NSD

②Subjects: 1, 2, 3, 7

③Image: 8859/982 for tr/test

2.4.2. Implementation

①Regressor: Ridge regression

②Diffusion steps: 50

③Diffusion strength: 0.75 for both text and image

④Mix rate: 0.6 for image reconstruction and 0.9 for image captioning

2.4.3. Evaluation Metric

（1）Vision Metric

①Low-level: PixCorr, SSIM, AlexNet

②High-level: Inception, CLIP, EffNet, SwAV

（2）Text Metric

①Low-level: Meteor, Rouge

②High-level: CLIP

2.5. Results

①Performance of image and text reconstructio:

2.5.1. Image Reconstruction Results

①Compared with other image reconstruction models:

②Quantity comparison between image reconstruction models:

2.5.2. Image Captioning Results

①Captioning performance comparison table:

2.5.3. Ablation Experiments

①Module ablation performance of image reconstruction:

②Image reconstruction with module ablation:

③Module ablation performance of image captioning:

2.5.4. ROI Analysis

①Clip fMRI signal to specific ROI features, the image reconstruction performance will be:

（其实可以来点对比的，比如Place ROI生成人物会咋样等等）