英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用
目录
2.3.1. Latent Diffusion Models
2.3.2. Image Reconstruction and Captioning
2.5.1. Image Reconstruction Results
2.5.2. Image Captioning Results
1. 心得
(1)在最近看的arXiv论文里面算写得很清楚的了
2. 论文逐段精读
2.1. Abstract
①Challenge: generate realistic captions and images with both low-level details and high semantic
②They proposed UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity
2.2. Introduction
①⭐Caption generation helps to understand the visual experiences of people with mental disorder or communication difficulties
②They novely reconstruct caption by diffusion model:
2.3. Methodology
2.3.1. Latent Diffusion Models
①Noise added input at time :
where is weight factor,
denotes Gaussian noise
②UNet is used to learn the noise, and the noise loss is:
where is model (UNET) trained to predict noise,
is time steps
③Facing such a expensive computing cost, LDMs compress input by
. Thus the noise adding process can be changed to:
and the loss function also changes:
where is conditioning input (e.g. labels, captions, images and semantic maps) of U-Net in cross-attention block
④Employed versatile diffusion (VD) model trained on Laion2B-en and COYO700M
2.3.2. Image Reconstruction and Captioning
①Framework of UniBrain:
where s are low level feature and
s are high-level (
,
,
,
)
②They trained 4 regressors to respectively align with 4 embeddings
③ and
are the input of diffusion model,
and
guide backward diffusion process
④ and
are mixed by linear interpolation in cross-attention process of UNET
2.4. Experiments
2.4.1. Dataset
①Dataset: NSD
②Subjects: 1, 2, 3, 7
③Image: 8859/982 for tr/test
2.4.2. Implementation
①Regressor: Ridge regression
②Diffusion steps: 50
③Diffusion strength: 0.75 for both text and image
④Mix rate: 0.6 for image reconstruction and 0.9 for image captioning
2.4.3. Evaluation Metric
(1)Vision Metric
①Low-level: PixCorr, SSIM, AlexNet
②High-level: Inception, CLIP, EffNet, SwAV
(2)Text Metric
①Low-level: Meteor, Rouge
②High-level: CLIP
2.5. Results
①Performance of image and text reconstructio:
2.5.1. Image Reconstruction Results
①Compared with other image reconstruction models:
②Quantity comparison between image reconstruction models:
2.5.2. Image Captioning Results
①Captioning performance comparison table:
2.5.3. Ablation Experiments
①Module ablation performance of image reconstruction:
②Image reconstruction with module ablation:
③Module ablation performance of image captioning:
2.5.4. ROI Analysis
①Clip fMRI signal to specific ROI features, the image reconstruction performance will be:
(其实可以来点对比的,比如Place ROI生成人物会咋样等等)
2.6. Conclusion
~