[arXiv 2023]UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Huma

论文网址:[2308.07428] UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Methodology

2.3.1. Latent Diffusion Models

2.3.2. Image Reconstruction and Captioning

2.4. Experiments

2.4.1. Dataset

2.4.2. Implementation

2.4.3. Evaluation Metric

2.5. Results

2.5.1. Image Reconstruction Results

2.5.2. Image Captioning Results

2.5.3. Ablation Experiments

2.5.4. ROI Analysis

2.6. Conclusion

1. 心得

(1)在最近看的arXiv论文里面算写得很清楚的了

2. 论文逐段精读

2.1. Abstract

        ①Challenge: generate realistic captions and images with both low-level details and high semantic

        ②They proposed UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity

2.2. Introduction

        ①⭐Caption generation helps to understand the visual experiences of people with mental disorder or communication difficulties

        ②They novely reconstruct caption by diffusion model:

2.3. Methodology

2.3.1. Latent Diffusion Models

        ①Noise added input at time t:

x_{t}=\sqrt{\alpha_{t}}x_{0}+\sqrt{1-\alpha_{t}}\epsilon_{t}

where \alpha is weight factor, \epsilon denotes Gaussian noise

        ②UNet is used to learn the noise, and the noise loss is:

L_{DM}=E_{x_0,\epsilon\sim\mathcal{N}(0,1),t}\left[||\epsilon-\epsilon_\theta(x_t,t)||^2\right]

where \epsilon _\theta\left ( \cdot \right ) is model (UNET) trained to predict noise, t\in\{1,2,...,T\} is time steps

        ③Facing such a expensive computing cost, LDMs compress input x_0 by z_{0}=E(x_{0}). Thus the noise adding process can be changed to:

z_{t}=\sqrt{\alpha_{t}}z_{0}+\sqrt{1-\alpha_{t}}\epsilon_{t}

and the loss function also changes:

L_{LDM}=E_{z_{0},c,\epsilon\sim\mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon(z_{t},t,\tau_{\theta}(c)\|^{2}\right]

where \tau_{\theta}(c) is conditioning input (e.g. labels, captions, images and semantic maps) of U-Net in cross-attention block

        ④Employed versatile diffusion (VD) model trained on Laion2B-en and COYO700M

2.3.2. Image Reconstruction and Captioning

        ①Framework of UniBrain:

where Zs are low level feature and Cs are high-level (Z_I \in \mathbb{R}^{4 \times 64 \times 64}, Z_T \in \mathbb{R}^{1 \times 768}, C_I \in \mathbb{R}^{257 \times 768}, C_T \in \mathbb{R}^{77 \times 768})

        ②They trained 4 regressors to respectively align with 4 embeddings

        ③Z_I and Z_T are the input of diffusion model, C_I and C_T guide backward diffusion process

        ④C_I and C_T are mixed by linear interpolation in cross-attention process of UNET

2.4. Experiments

2.4.1. Dataset

        ①Dataset: NSD

        ②Subjects: 1, 2, 3, 7

        ③Image: 8859/982 for tr/test

2.4.2. Implementation

        ①Regressor: Ridge regression

        ②Diffusion steps: 50

        ③Diffusion strength: 0.75 for both text and image

        ④Mix rate: 0.6 for image reconstruction and 0.9 for image captioning

2.4.3. Evaluation Metric

(1)Vision Metric

        ①Low-level: PixCorr, SSIM, AlexNet

        ②High-level: Inception, CLIP, EffNet, SwAV

(2)Text Metric

        ①Low-level: Meteor, Rouge

        ②High-level: CLIP

2.5. Results

        ①Performance of image and text reconstructio:

2.5.1. Image Reconstruction Results

        ①Compared with other image reconstruction models:

        ②Quantity comparison between image reconstruction models:

2.5.2. Image Captioning Results

        ①Captioning performance comparison table:

2.5.3. Ablation Experiments

        ①Module ablation performance of image reconstruction:

        ②Image reconstruction with module ablation:

        ③Module ablation performance of image captioning:

2.5.4. ROI Analysis

        ①Clip fMRI signal to specific ROI features, the image reconstruction performance will be:

(其实可以来点对比的,比如Place ROI生成人物会咋样等等)

2.6. Conclusion

        ~

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值