[IEEE ICIP 2020]Generation of Viewed Image Captions From Human Brain Activity Via Unsupervised Text -CSDN博客

①For image $I^{i}(i=1,2,...,N_{c})$ , the image embedding $v^{i}\in\mathbb{R}^{D_{v}}$ can be obtained by pretrained DNN（图像到底在主图的哪里表现出来？？DNN在哪里？线性层又在哪里？）:

$v^i=\mathrm{DNN}(I^i)$

②Reducing the dimension of $v^i$ by linear layer and $D_{v}^{\prime}<D_{v}$ :

$v^{\prime i}=W_\mathrm{linear}v^i$

③Image captioning network consists of LSTM units $l^{j}(\cdot)(j=0,1,...,N_{l})$ , and words are converted to vector by Word2vec. For $n$ words $\begin{array} {ccc}S_n^i & (n & = \end{array}0,1,...,N_{s}^{i})$ , the caption is generated by:

$t^i=l^0(v^{\prime i},\mathrm{word}2\mathrm{vec}(S_0))$

④Loss and optimizer in caption training: CE loss and Adam

2.3.2. Conversion of fMRI data into Text Features

①The previous method used fMRI to convert image features into text, but the authors believe that two-stage conversion is too cumbersome and loses information, so only one-stage conversion is used

②For fMRI data $x^{l}\in\mathbb{R}^{D_{f}}(l=1,2,...,N_{f})$ , the text feature is:

$t^l=l^0(v^{\prime l},S_0)$

③The regression process:

$t^l=W^\top x^l+b$

（怎么又是 $t^l$ ？啥玩意？这俩公式也不一样啊）

④Optimization:

$\min_{\boldsymbol{W},\boldsymbol{b}}\sum_{l=1}^{N_f}||\boldsymbol{t}^l-(\boldsymbol{W}^\top\boldsymbol{x}^l+\boldsymbol{b})||_2^2+\alpha||\boldsymbol{W}||_2^2$

（啊认真的吗？？左边那俩东西不是相等吗）

2.3.3. Text Feature Transformation with unlabeled images

①Extract text features $\tilde{t}^{m}\in{\mathbb{R}}^{D_{t}}$ from unlabeled images $\tilde{I}^{m}(m=1,2,...,N_{a})$

②fMRI embedding:

$z=W^\top x_{\mathrm{test}}+b$

③Calculate the Euclidean distance $d^m$ between $\tilde{t}^{m}$ and $z$ , and choose the nearest $k$ neighbors to get new feature:

$y=\beta z+\frac{(1-\beta)}{k}\sum_{m=1}^k\tilde{t}^m$

2.4. Experiments

2.4.1. Experimental Settings

①Dataset:

T. Horikawa and Y. Kamitani, "Generic decoding of seen and imagined objects using hierarchical visual features", Nature communications, vol. 8, pp. 15037, 2017.

②Image categories: 150/ new 50 for tr/test

③Image number: 1200/50 for tr/test

④Unlabeled images: 38,532

⑤Image description: MSCOCO

⑥Caption evaluation: cosine similarity by Sent2Vec

2.4.2. Experimental Results

①Example of generated caption:

②Performance comparison:

2.5. Conclusion

3. Reference

@INPROCEEDINGS{9191262,
author={Takada, Saya and Togo, Ren and Ogawa, Takahiro and Haseyama, Miki},
booktitle={2020 IEEE International Conference on Image Processing (ICIP)},
title={Generation of Viewed Image Captions From Human Brain Activity Via Unsupervised Text Latent Space},
year={2020},
volume={},
number={},
pages={2521-2525},
keywords={Functional magnetic resonance imaging;Semantics;Training;Feature extraction;Brain modeling;Computer architecture;Image captioning;deep neural network (DNN);neuroscience;functional magnetic resonance imaging (fMRI).},
doi={10.1109/ICIP40778.2020.9191262}}