DecoFuse: Decomposing and Fusing the “What”, “Where”, and “How” for Brain-Inspired fMRI-to-Video Decoding

Chong Li, Jingyang Huo, Weikang Gong, Yanwei Fu, Xiangyang Xue, and Jianfeng Feng
Fudan University
[email protected]

Abstract

Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components—semantic, spatial, and motion—then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.

Refer to caption — Figure 1: Diagram of DecoFuse framework. Inspired by the brain’s two-streams hypothesis [11], the DecoFuse pipeline decomposes video into three components: semantic (“what”), spatial (“where”), and motion (“how”). Neural features are extracted by an fMRI encoder and decomposed to semantic, spatial and motion embeddings. These components are then fused to generate video. Additionally, neural encoding analyzes the differential contribution of semantic and spatial embeddings in predicting signals from the brain’s dorsal and ventral streams, confirming alignment with the two-streams hypothesis [11].

1 Introduction

Visual input is the brain’s primary source of information, making the accurate decoding of visual signals and understanding their encoding processes key challenges in neuroscience and AI. Functional magnetic resonance imaging (fMRI), a non-invasive method for recording whole-brain activity, has become increasingly popular for decoding applications [26]. Meanwhile, advances in techniques like Stable Diffusion (SD)[24] have driven major progress in fMRI-based decoding for images [22, 1, 21, 15, 16, 13], videos [14, 2, 10, 5], and 3D objects [7]. These breakthroughs have delivered remarkable results, bringing the idea of “mind reading” closer to reality.

However, decoding fMRI into video is still inherently challenging! Neuroscience research has shown that different brain regions process various aspects of visual information. The two-streams hypothesis [11, 19] suggests two main pathways for visual processing: the “what” pathway (ventral stream) for object recognition and the “where/how” pathway (dorsal stream) for tracking location and movement. These three components—semantic (what), spatial (where), and motion (how)—are fundamental to video perception. However, fMRI-to-video decoding has mainly focused on semantic information, while decoding spatial and motion aspects, which are crucial for visual experiences, remains a significant yet underexplored challenge [10].

MinD-Video [2] was the first to use Stable Diffusion for fMRI-to-video decoding, aligning fMRI features with text embeddings to reconstruct semantically accurate videos. Several studies have since followed this approach, focusing on semantic alignment [14, 25]. Yeung et al. [31] took a different approach by successfully decoding visual motion information. Particularly, recent works have also explored spatial decoding by predicting the variational autoencoder (VAE) latent of Stable Diffusion as an initial estimate for UNet’s noise input [17, 6, 10]. Despite these efforts, evaluations mostly rely on semantic or pixel-level metrics like classification accuracy and SSIM. How well spatial and motion information can be independently decoded from fMRI remains an open question.

To address these issues, we introduce DecoFuse, a novel brain-inspired framework that decomposes video into three key components—semantic, spatial, and motion information. They are separately decoded and then fused to reconstruct the video in Fig. 1. Aligned with two-streams hypothesis, the learned components are expected to reflect their biological counterparts in the brain as three stages:
Stage 1: A pretrained fMRI encoder extracts semantic, spatial, and motion embeddings. Semantic and spatial embeddings then condition an image generator, defining “what” the object is and “where” it is located, producing a static initial frame.
Stage 2: The motion decoder predicts optical flow using the neural motion embedding and the initial frame, simulating how the brain processes object movement.
Stage 3: A motion-conditioned video generator animates the static frame using the predicted optical flow.

DecoFuse offers two main advantages: (1) It simplifies fMRI-to-video decoding by breaking it into manageable sub-tasks, enhancing performance, and (2) its biologically inspired modular design supports ablation studies, allowing assessment of how well semantic, spatial, and motion information can be independently decoded from fMRI signals.

In our experiments, we evaluated decoding accuracy for each of the three components (semantic, spatial, and motion) and demonstrated superior performance compared to existing SOTA methods [2, 22, 14, 31, 13]. For semantic information, we conducted a classification task on the generated and ground truth (GT) images, achieving 20.8% 50-way accuracy—an improvement of 20.9% over MinD-Video [2]. For spatial information, we applied foreground detection using DINOv2 [20] and obtained a 70.6% accuracy for foreground consistency between generated and GT images, surpassing the previous SOTA performance of 68.7% in NeuroPictor [13]. Regarding motion information, we measured the cosine similarity between the predicted and GT optical flow, achieving a score of 0.212, significantly better than the 0.174 reported by [31]. Moreover, we also assessed the quality of the generated videos, showing 50-way classification accuracy of 21.9%, which outperforms current SOTA methods [2, 22, 14]. We also conducted ablation studies for each component, all of which showed a significant drop in their respective metrics, emphasizing the correspondence between our learned representations and their biological counterparts. Finally, leveraging our brain-inspired decomposition in DecoFuse, we conducted neural encoding of “what” and “where” embeddings, demonstrating alignment with the two-streams hypothesis [11].

In summary, we have these contributions: (1) Novel Brain Decoding Framework: This paper proposes DecoFuse, a novel framework for fMRI-to-video decoding that addresses the challenge of reconstructing videos from brain activity by decomposing the video into three key components: semantic, spatial, and motion information. (2) Novel Designs of Various Encoders and Decoders. Our DecoFuse nontrivially improves upon previous works, featuring novel fMRI, semantic, spatial, and motion encoders. (3) Biologically Plausible Design: DecoFuse’s modular approach closely aligns with the two-streams hypothesis. Our ablation studies demonstrate a strong correlation between the learned representations and their biological counterparts. (4) Differential Neural Encoding: Investigates the alignment of decoded embeddings with the brain’s dorsal and ventral streams; and uses PCA and ridge regression to predict fMRI signals from semantic and spatial embeddings. Essentially, it supports the well established neuroscience theories. (5) Superior Performance: DecoFuse significantly outperforms state-of-the-art methods in decoding semantic, spatial, and motion components.

2 Related Work

fMRI-to-vision reconstruction. Recent advances in fMRI-based decoding have made significant strides in extracting visual information from brain activity, particularly in decoding images, videos, and 3D objects using techniques like Stable Diffusion (SD) [1, 13, 14, 2, 10, 7]. However, fMRI-to-video decoding remains underexplored, especially in terms of spatial and motion components. Early works [2, 14, 25] focused primarily on semantic decoding, while more recent approaches [17, 6, 10, 31] have incorporated VAE latent or motion-specific decoders. Nonetheless, evaluations have typically concentrated on semantic or pixel-level metrics, leaving the reliable decoding of spatial and motion information as an ongoing challenge.

Visual pathways in brain. Numerous studies in neuroscience have explored how the brain processes visual information. The two-streams hypothesis [11, 19] proposes that visual processing is divided into two pathways: the “what” pathway (ventral stream) for object recognition, and the “where”/“how” pathway (dorsal stream) for tracking object location and movement. These pathways correspond to the three components of video—semantic (what), spatial (where), and motion (how)—which are crucial for reconstructing realistic video content.

3 Method

Overview. We decompose the task into semantic, spatial, and motion decoding, respectively. In data preprocessing, raw fMRI frames are aligned with an anatomical brain template [8] to create single-channel images. These fMRI frames are then fed into a large-scale fMRI Pretrained Transformer Encoder (fMRI-PTE) [23], which is pretrained on the UKB [18] dataset. Next, two independent modules separately decode the semantic and spatial embeddings, producing a single image via Stable Diffusion [24]. Finally, using both the fMRI data and the generated image, a motion decoder predicts optical flow, and DragNUWA [32] animates the static object in the image to generate the video.

Generally, combining the two-streams hypothesis (“what” and “where” concepts) with a brain decoding model offers new insights. Building on this, our brain-inspired method links deep learning embeddings to the brain’s encoding process, helping us analyze brain signals more effectively by separating different variables.

3.1 Data Pre-processing

fMRI preprocessing. Some decoding methods flatten each frame and intentionally filter subject-specific activated voxels [30, 2]. In contrast, we align the fMRI data to the fs_LR_32k brain surface space using anatomical structures [8] and unfold the cortical surface to create a 2D image, ensuring a standardized and unified representation across subjects while preserving spatial relationships between adjacent voxels. Given that visual tasks primarily activate specific brain regions [12], we concentrate on early and higher visual cortical Regions of Interest (ROIs) covering 8,405 vertices, as defined by the HCP-MMP atlas [9] in the fs_LR_32k space. Each fMRI frame is then transformed into a one-channel 256 $\times$ 256 image, followed by voxel-wise z-transformation. Additionally, temporally aligned fMRI frames from different runs with the same video stimulus are averaged. Finally, we apply an approximate 6-second temporal shift to the fMRI series considering the inherent time lag between the stimulus input and the peak of the BOLD signal due to the hemodynamic response.

fMRI-stimuli paired data. We follow the MinD-Video [2] and use a sliding window approach to split the CC2017 dataset [30] into fMRI-video paired samples. Specifically, the fMRI-to-video decoding task is reformulated as generating a $T$ -second video from $\alpha T$ -seconds of fMRI data. Additionally, inspired by the two-streams hypothesis [11], which suggests that “what”, “where”, and “how” information is primarily encoded by different brain regions, we decompose the video to semantic, spatial and motion components. These are represented by the initial frame (semantic and spatial) and optical flow (motion).

Assuming there are $n$ frames of fMRI $\mathbf{F}_{i}\in\mathbb{R}^{n\times H_{f}\times W_{f}}$ and $m$ frames of video $\mathbf{V}_{i}\in\mathbb{R}^{m\times 3\times H_{v}\times W_{v}}$ in the $i$ -th window, where $H_{f},W_{f}$ and $H_{v},W_{v}$ denotes the height and width of the unfolded fMRI image and video. Each optical flow $\mathbf{O}_{i}^{k}\in\mathbb{R}^{H_{v}\times W_{v}\times 2}$ is then generated by MemFlow [4] using the initial frame $\mathbf{V}_{i}^{k}$ and the future frame $\mathbf{V}_{i}^{k+\lfloor\frac{m}{2}\rfloor}$ , which can be formulated as

\displaystyle\mathbf{O}_{i}^{k}=\mathrm{MemFlow}(\mathbf{V}_{i}^{k},\mathbf{V}% _{i}^{k+\lfloor\frac{m}{2}\rfloor})

(1)

where $1\leq k\leq\lfloor\frac{m}{2}\rfloor$ .

3.2 DecoFuse pipeline

Visual input is essential for the brain, and many studies have explored how it processes this information. The well-known two-streams hypothesis suggests that the brain processes visual information through two distinct pathways: the “what” pathway (ventral stream) for recognizing objects and the “where/how” pathway (dorsal stream) for tracking their location and movement [11, 19]. Motivated by this, we propose a brain-inspired fMRI-to-video framework, DecoFuse, which decomposes a video into three components: semantic (“what”), spatial (“where”), and motion (“how”), separately decodes each component, and finally fuses them to generate the video.

fMRI encoder. To reduce information loss when encoding high-dimensional fMRI signals into a compact feature space, we apply fMRI-PTE [23], a ViT-based autoencoder pretrained on a large-scale fMRI dataset [18], as our encoder. Unlike those ViT-based encoders that flatten and patchify voxels without preserving spatial information, this approach retains local structure [1, 2]. Each 2D fMRI frame $\mathbf{F}_{i}^{t}\in\mathbb{R}^{H_{f}\times W_{f}}$ is divided into $p$ square patches, where each patch represents a token that captures the spatial relationships between neighboring voxels. These patchified fMRI images are then transformed into token embeddings $\mathbf{F}^{t}_{emb,i}\in\mathbb{R}^{(p+1)\times D_{f}}$ through a series of spatial attention blocks, with $D_{f}$ representing the embedding dimension. The model achieves high-precision reconstruction using only the [CLS] token, yielding an encoder that effectively retains the main information.

Stage 1: semantic and spatial decoding. In this stage, we decode semantic and spatial information from fMRI data to reconstruct static keyframes. Recent advances in image editing [33] demonstrate that high-level semantic latent codes can guide the semantic content of generated images, while updating feature maps allows precise control over spatial composition. Building on this insight, as illustrated in Fig. 2, we employ an fMRI-to-image pipeline that integrates semantic guidance and spatial control to enhance Stable Diffusion (SD) [24]. Based on the high-level and low-level framework from NeuroPictor [13], our approach further deepens the encoding process and augment the semantic encoder to improve decoding performance.

For high-level semantic decoding, we use a semantic encoder $\mathcal{E}_{sem}$ to transform fMRI features $\mathbf{F}_{emb}$ into semantic embeddings $\mathbf{E}_{sem}=\mathcal{E}_{sem}(\mathbf{F}_{emb})$ , replacing the typical text embeddings $\mathbf{E}_{txt}$ in Stable Diffusion, where $\mathbf{E}_{sem},\mathbf{E}_{txt}\in\mathbb{R}^{L_{T}\times D_{T}}$ . Unlike NeuroPictor, which uses convolutional layers and MLPs in its encoder, we use transformer layers to capture semantic information related to the visual stimulus. This helps guide the diffusion model, ensuring the generated image accurately reflects the perceived objects and scene context.

For spatial decoding, we use a spatial encoder $\mathcal{E}_{spa}$ to directly adjust the feature maps in the U-Net architecture of the diffusion model. The spatial embeddings are derived as $\mathbf{E}_{spa}=\mathcal{E}_{spa}(\mathbf{F}_{emb})$ , where $\mathbf{E}_{spa}=\{\mathbf{E}_{spa,(i)}\mid i=1,\ldots,13\}$ , with $\mathbf{E}_{spa,(i)}$ representing the feature map from the $i$ -th encoder block. The spatial encoder applies channel-wise convolutions, MLPs, and transformer layers to refine U-Net feature maps at various levels. The resulting spatial embeddings are processed through zero convolution layers and combined with the intermediate outputs of the SD model using a residual connection:

\displaystyle\tilde{\mathbf{E}}_{spa}=\mathbf{E}_{SD}+\alpha\mathcal{Z}(% \mathbf{E}_{spa})

(2)

where $\mathcal{Z}$ is the zero convolution layer, $\mathbf{E}_{SD}$ represents the latent codes of the SD U-Net, and $\alpha$ is a hyperparameter balancing high-level semantic guidance and fine-grained spatial details. This method effectively controls detailed spatial features, such as object positioning and structural layout.

By combining the semantic guidance $\mathbf{E}_{sem}$ and spatial guidance $\tilde{\mathbf{E}}_{spa}$ derived from fMRI, we can finely control the generated outputs, achieving both semantic and spatial reconstruction of static images.

Stage 2: motion decoding. Previous work [31] has demonstrated that motion information, such as optical flow, can be decoded from fMRI. Therefore, we propose a motion decoder that predicts optical flow of a video based on fMRI and its first frame. In other word, motion decoder functions by “asking” the frozen brain (fMRI) how objects in the first frame are moving in the viewed video. Moreover, we suggest that in a short video (e.g., a 2-second clip), only coarse movement can be reliably encoded in fMRI due to its low temporal and spatial resolution. As a result, our motion decoder $\mathcal{D}_{M}$ predicts only a single frame of low-resolution optical flow for each sample.

\displaystyle\hat{\mathbf{O}}_{i}^{k}=\mathcal{D}_{M}(\mathbf{V}_{i}^{k},% \mathbf{F}_{i})

(3)

To accurately decode motion information, we follow prior image-to-motion work [28], which showed that optical flow classification outperforms direct prediction. First, we flatten the vectors from all optical flow $\mathbf{O}_{i}$ in training set and apply K-means clustering to obtain a codebook $\mathbf{B}\in\mathbb{R}^{N_{vec}\times 2}$ , where $N_{vec}$ is the number of clusters. Each vector in the optical flow $\mathbf{O}_{i}$ is then quantized by its nearest vector in the codebook. The quantized optical flow $\tilde{\mathbf{O}}$ is defined as:

\displaystyle\tilde{\mathbf{O}}_{i,h,w}^{k}=\mathbf{B}_{c^{\star}},\ \ c^{% \star}=\arg\min_{c}\parallel\mathbf{O}_{i,h,w}-\mathbf{B}_{c}\parallel_{2}^{2}

(4)

More specifically, $\mathbf{V}_{i}^{k}$ and $\mathbf{F}_{i}$ are sent to their corresponding pretrained encoders, DINOv2 [20] and fMRI-PTE [23] to generate token-level embeddings. As shown in Fig. 2 (Stage 2), after separate CNN processing, the two embeddings are concatenated and passed through a CNN and softmax layer to predict probability distribution $\mathbf{P}_{i}^{k}\in\mathbb{R}^{H_{o}\times W_{o}\times N_{vec}}$ of vectors in codebook. The final prediction of optical flow is then given by $\hat{\mathbf{O}}_{i}^{k}=\mathbf{P}_{i}^{k}\mathbf{B}$ .

Stage 3: video generation. Based on the pre-generated image and optical flow, we reconstruct the video using DragNUWA [32], a pretrained video diffusion model conditioned on motion. First, to ensure more stable video generation, we mask the optical flow using foreground detection from DINOv2 [20]. Additionally, to generate an $N_{f}$ -frame video, we extend the single-frame optical flow by linearly dividing the vector to $N_{f}-1$ sub-vectors.

3.3 Differential neural encoding

Since DecoFuse is inspired by the two-streams hypothesis [11], we conduct neural encoding to examine whether and how the decoded embeddings differentially align with the two streams identified in biological studies. To prevent overfitting, we first apply Principal Component Analysis (PCA) to reduce the dimension of the semantic embedding $\mathbf{E}_{sem}$ and spatial embedding $\mathbf{E}_{spa}$ , resulting in $\mathbf{E}^{PCA}_{sem},\mathbf{E}^{PCA}_{spa}\in\mathbb{R}^{T\times D}$ , where $T$ is the number of time points in the fMRI volumes, and $D$ is the reduced dimension. We then use ridge regression to predict the Gaussian-smoothed and flattened fMRI data $\mathbf{F}\in\mathbb{R}^{T\times N_{v}}$ based on semantic or spatial embeddings, where $N_{v}$ represents the number of voxels.

\displaystyle\hat{\mathbf{F}}_{X}=\mathrm{RidgeRegressor}(\mathbf{E}^{PCA}_{X}% ),X\in\{sem,spa\}

(5)

Next, we compute the average temporal correlation $\mathbf{r}_{X}$ for the predicted and GT fMRI signals over a window size $T_{w}$ :

	$\displaystyle\mathbf{r}_{X}=\frac{1}{N_{w}}\sum_{t=1}^{N_{w}}\mathrm{corr}(% \mathbf{F}_{X,t:t+T_{w}},\hat{\mathbf{F}}_{X,t:t+T_{w}}),$
	$\displaystyle X\in\{sem,spa\}$		(6)

Following metrics in [3], we differentiate the relative contributions of the semantic and spatial embeddings in predicting the brain’s dorsal and ventral streams for individual voxels:

\displaystyle\mathbf{p}_{spa}=\frac{\mathbf{r}_{spa}^{2}}{\mathbf{r}_{spa}^{2}% +\mathbf{r}_{sem}^{2}}-0.5

(7)

Here, $\mathbf{p}_{spa}$ ranges from -0.5 to 0.5. A value of $p_{spa}>0$ indicates that the spatial embedding better predicts the voxel, while $p_{spa}<0$ suggests that semantic embedding provides a better prediction.

3.4 Training Strategy

We perform training in both Stage 1 and Stage 2.

Stage 1. We freeze the SD model to retain its strong image synthesis capabilities, while finetuning the semantic, spatial, and fMRI encoders to extract semantic and spatial information from fMRI data. Since the image represents static information, we use a single fMRI frame for image decoding. The pipeline is trained with fMRI-image pairs $(\mathbf{F}_{i}^{1},\mathbf{V}_{i}^{k})$ , where $1\leq k\leq\lfloor\frac{m}{2}\rfloor$ denotes data augmentation for random initial frame.

Specifically, the input image $\mathbf{V}_{i}^{k}$ is first encoded into a latent representation $z_{0}$ . The diffusion process then progressively adds noise to $z_{0}$ over $t$ time steps, resulting in a noisy latent $z_{t}$ . During the denoising stage, the frozen U-Net predicts a denoised version of $z_{t}$ , conditioned on the time step $t$ , semantic embedding $\mathbf{E}_{emb}$ , and spatial embedding $E_{spa}$ . The denoising loss for optimizing the SD latent is defined as follow:

\displaystyle\mathcal{L}_{s1}=\mathbb{E}_{z_{0},t,\mathbf{F}_{emb},\epsilon% \sim\mathcal{N}(0,1)}\Big{[}\parallel\epsilon-\epsilon_{\theta}(z_{t},t,% \mathbf{F}_{emb},\mathbf{E}_{emb})\parallel^{2}_{2}\Big{]}

(8)

Stage 2. For training the motion decoder, we use fMRI-image-motion paired data $(\mathbf{F}_{i},\mathbf{V}_{i}^{k},\mathbf{O}_{i}^{k})$ , where $1\leq k\leq\lfloor\frac{m}{2}\rfloor$ denotes data augmentation for random frame. We combine cross-entropy loss $\mathcal{L}_{entropy}$ and mean squared error (MSE) loss $\mathcal{L}_{MSE}$ to form the total loss $\mathcal{L}_{s2}=\mathcal{L}_{entropy}+\lambda_{2}\mathcal{L}_{MSE}$ for training the motion decoder $\mathcal{D}_{M}$ .

	$\displaystyle\mathcal{L}_{entropy}$	$\displaystyle=\mathrm{CrossEntropy}(\mathbf{P}_{i}^{k},\mathbf{c}_{i}^{k})$		(9)
	$\displaystyle\mathcal{L}_{MSE}$	$\displaystyle=\parallel\mathbf{O}_{i}^{k}-\hat{\mathbf{O}}_{i}^{k}\parallel_{2% }^{2}$		(10)

where $\mathbf{c}_{i}^{k}$ is codebook label for optical flow $\mathbf{O}_{i}^{k}$ .

4 Experiments

Pre-training Dataset. The UK Biobank (UKB) [18] is a large-scale biomedical resource that gathers extensive genetic and health-related data from roughly 500,000 individuals across the UK. A subset of this repository is utilized, specifically the resting-state fMRI data from approximately 39,630 participants. Each participant provides a single session consisting of 490 time-point volumes.

Paired fMRI-video Dataset. Experiments used the CC2017 dataset [30], which pairs fMRI data with video stimuli. It includes data from three participants, with fMRI frames captured using a 3T MRI scanner at a 2-second repetition time (TR). The dataset covers about 3 hours of video and provides around 5,500 fMRI-stimulus pairs per subject.

Vision metrics. 1) Semantic-level. Following Mind-Video [2], we use both image-based and video-based classification metrics to assess semantic-level performance. For image classification, we rely on ImageNet classifier. For the video-based metrics, we apply a similar classification framework, utilizing VideoMAE [27]. In both cases, the N-way top-K accuracy metric is employed, where for video the top-3 predicted classes are compared against the ground truth (GT) class. Specifically, N candidates include the ground truth class along with N-1 randomly selected classes from the classifier’s full class set. This approach is consistent with the methodology used in MinD-Video. 2) Spatial-level. We evaluate spatial performance by calculating the ratio of foreground-background matching between the ground truth and decoded images. Foreground detection is performed using DINOv2 [20]. Let the matrix $\mathbf{M}\in\{0,1\}^{H\times W}$ represent the foreground mask, where $\mathbf{M}_{i,j}=1$ indicates pixel $(i,j)$ is detected as foreground, and $\mathbf{M}_{i,j}=0$ indicates background. The matching ratio $r_{m}$ is then calculated as follows:

\displaystyle r_{m}=1-\frac{\parallel\mathbf{M}_{GT}-\mathbf{M}_{pred}% \parallel_{0}}{H\times W}

(11)

where $\mathbf{M}_{GT},\mathbf{M}_{pred}$ represent foreground mask metrics of GT and predicted images, respectively. The value of $r_{m}$ ranges from 0 to 1, with a value closer to 1 indicating better matching of the foreground and background between ground truth and predicted images, and a value closer to 0 indicating worse matched. 3) Pixel-level. We use the structural similarity index measure (SSIM) [29] to assess pixel-level decoding performance. For video evaluation, SSIM is computed for each frame of both the ground truth and reconstructed videos, with results averaged across frames.

Motion metrics. We evaluate motion decoding performance using cosine similarity between the ground truth and decoded optical flow vectors. To handle scene changes that may produce invalid optical flow, we apply scene-change detection to remove such samples. Additionally, we mask all predicted optical flow using foreground detection and also mask ground truth values close to the zero vector to reduce noise. Specifically, the shortest cluster in the quantized codebook is set to the zero vector.

Implementation Details. For CC2017 [30], we used an fMRI window of $\alpha$ T = 2s to generate videos lasting T = 2s and videos were downsampled to 8 FPS. All training and inference processes were conducted on a single NVIDIA A100 GPU. Please refer to the Supplementary for detailed hyperparameter configurations and additional training information. Codes&Models will be released.

4.1 Verifying the ‘What’ and ‘Where’ Factor

	Methods	Semantic-level		Spatial-level
	Methods	2-way	50-way	$r_{m}$
sub1	MinD-Video [2]	0.792	0.172	0.660
	fMRI-PTE-video [14]	0.793	0.169	0.652
	NeuroPictor [13]	0.808	0.195	0.687
	DecoFuse(w/o what)	0.774	0.130	0.704
	DecoFuse(w/o where)	0.792	0.171	0.668
	DecoFuse(1 frame)	0.816	0.201	0.690
	DecoFuse	0.824	0.208	0.706
sub2	MinD-Video	0.784	0.158	0.669
	fMRI-PTE-video	0.780	0.159	0.648
	NeuroPictor	0.785	0.169	0.679
	DecoFuse	0.802	0.190	0.692
sub3	MinD-Video	0.812	0.193	0.662
	fMRI-PTE-video	0.799	0.173	0.637
	NeuroPictor	0.803	0.194	0.671
	DecoFuse	0.816	0.215	0.689

Table 1: Results of fMRI-to-image decoding. Evaluations of semantic and spatial metrics are presented for all three subjects, with bolded results indicating performance surpassing all baselines.

To isolate the “What” and “Where” components in decoded images from fMRI signals, we compare our method, DecoFuse, with other established fMRI-to-video decoding approaches, including MinD-Video [2], fMRI-PTE-video [14], and NeuroPictor [13].

Our findings, as in Fig. 3 and Tab. 1, show results across three subjects with both semantic and spatial metrics. (1) DecoFuse consistently outperforms the other methods in these metrics, capturing detailed semantic content and accurately decoding spatial locations. This shows DecoFuse’s ability to better align “What” (semantic content) and “Where” (spatial arrangement) from brain activity, setting a new benchmark in fMRI-to-video decoding. (2) At the semantic level, DecoFuse achieves significantly higher accuracy than the other methods. For example, in subject 1, DecoFuse achieves a 50-way accuracy of 0.208, compared to 0.172 for MinD-Video, 0.169 for fMRI-PTE-video, and 0.195 for NeuroPictor. This trend holds across subjects, with DecoFuse leading in both 2-way and 50-way accuracy, demonstrating its effectiveness in capturing semantic content from fMRI data. (3) At the spatial level, DecoFuse excels in preserving spatial locations. For instance, in subject 1, DecoFuse achieves a matching ratio of 0.706, outperforming MinD-Video (0.660), fMRI-PTE-video (0.652), and NeuroPictor (0.687), indicating better object localization.

To evaluate the impact of semantic and spatial features, we ablate these embeddings in DecoFuse respectively. DecoFuse(w/o where), which excludes spatial features, shows a clear drop in spatial metrics, confirming their importance. DecoFuse(w/o what), which removes semantic conditioning, experiences a significant decline in semantic accuracy but retains a high spatial score of 0.704. Additionally, to reduce randomness, DecoFuse generates 20 frames and selects the one with the least deviation (see Supplementary for details), while DecoFuse(1 frame) generates only a single frame. The results show that filtering one frame from multiple frames improves performance by reducing generation variance. Overall, DecoFuse excels in both semantic and spatial decoding, capturing fine fMRI details and generating high-quality visual reconstructions, surpassing previous methods.

4.2 Verifying the ‘How’ factor

	Methods	20%	30%	40%	50%	60%
	Methods	cosine similarity
sub1	F2M [31]	0.174
	DecoFuse(w/o fMRI)	0.051	0.049	0.026	0.016	-0.042
	DecoFuse	0.139	0.147	0.150	0.212	0.179
sub2	F2M	0.085
sub2	DecoFuse	0.045	0.052	0.055	-0.028	-0.137
sub3	F2M	0.110
sub3	DecoFuse	0.106	0.129	0.153	0.144	0.050

Table 2: Results of fMRI-to-motion decoding. The details of how F2M [31] computes cosine similarity are not provided. Therefore, we evaluate our method on optical flow where the foreground occupies more than various ratios.

‘Disclaimer’. Since there is no direct way to make a fair comparison for the “How” factor, we adapt optical flow metrics for evaluation. However, optical flow is highly sensitive to various factors—occlusions, rapid motion and motion blur, changes in illumination, and even noise or artifacts—all of which commonly appear in generated images of all methods. As a result, it is challenging to quantify the exact impact these sensitivities might have on our comparisons. Nonetheless, optical flow still provides a useful baseline metric, offering a general gauge for assessing the effectiveness of each method.

To assess motion decoding performance, we measure cosine similarity between predicted and ground truth optical flow vectors across varying foreground coverage levels. In Tab. 2, each percentile (e.g., 20%, 30%, etc.) represents the proportion of the scene occupied by the foreground, offering insights into how well each model decodes motion with emphasis on larger, more prominent objects. This approach reflects the human tendency to focus on movement associated with larger scene elements.

The motion decoding results in Fig. 4 and Tab. 2 demonstrate DecoFuse’s capabilities relative to the fMRI-to-motion (F2M) method [31], using cosine similarity across these foreground thresholds. Although exact comparisons are limited by the F2M algorithm’s incomplete details, DecoFuse presents a notable edge. For instance, our method’s computation of optical flow at one-second intervals introduces added complexity, yet DecoFuse still demonstrates strong performance. In particular, DecoFuse excels in capturing motion within larger foreground regions, outperforming F2M. This pattern supports our hypothesis that DecoFuse aligns closely with human perceptual biases, effectively prioritizing motion decoding for visually dominant areas. These results affirm DecoFuse’s robust motion decoding ability, especially in challenging conditions that require precision with significant scene elements.

We also tested optical flow prediction after ablating fMRI input, which is equivalent to optical flow prediction based only on images. The results show that predictions based solely on images perform much worse compared to predictions made with both fMRI and images. This suggests that the model successfully learns motion information from the fMRI data.

4.3 More Ablation Study

Other impacting factors in decoding videos. We further evaluate the direct decoding of videos from fMRI by semantic-level accuracy and structural similarity (SSIM), following the metrics used in [2]. For each subject, we report both 2-way and 50-way semantic accuracy. As shown in Tab. 3, DecoFuse demonstrates best performance on most cases, highlighting the improved accuracy of our decoded videos. These results affirm DecoFuse’s effectiveness in preserving both semantic and structural details from fMRI data. We also provide visualizations of the decoded frames in Fig. 5, highlighting the clarity and fidelity of our approach. Additionally, we assess video decoding (DecoFuse(NeuroPictor)) based on images generated by NeuroPictor [13], showing a significant decrease in semantic metrics, which further proves the improvement of our fMRI-to-image decoding pipeline.

	Methods	Semantic-level		Pixel-level
	Methods	2-way	50-way	SSIM
sub1	LEA [22]	0.825	0.149	0.137
	MinD-Video [2]	0.853	0.202	0.171
	fMRI-PTE-video [14]	0.851	0.214	0.193
	DecoFuse(NeuroPictor)	0.839	0.204	0.370
	DecoFuse	0.855	0.219	0.339
sub2	LEA	0.826	0.148	0.145
	MinD-Video	0.841	0.173	0.171
	fMRI-PTE-video	0.834	0.192	0.182
	DecoFuse	0.846	0.193	0.306
sub3	LEA	0.834	0.160	0.137
	MinD-Video	0.846	0.216	0.187
	fMRI-PTE-video	0.851	0.225	0.176
	DecoFuse	0.856	0.218	0.314

Table 3: Results of fMRI-to-video decoding. Our method outperforms the baselines in most cases, with bolded results highlighting superior performance over all baselines.

Differential Neural Encoding. We explore how the brain encodes semantic and spatial information through distinct pathways using differential neural encoding,as visualized in Fig. 6. It highlights the contributions of semantic and spatial features in predicting fMRI responses. So our findings support the two-streams hypothesis [11]. In the primary visual cortex, when $p_{spa}$ approaches 0, both types of information are encoded equally. As processing progresses through the dorsal and ventral pathways, a bias emerges, favoring spatial or semantic cues, respectively. In higher-order regions, such as the frontal lobe, this distinction diminishes, supporting the idea that our approach decodes brain activity in a manner consistent with biological encoding processes.

5 Conclusion

This paper introduces DecoFuse, a novel fMRI-to-video decoding framework that separates video into semantic, spatial, and motion components. By independently decoding these aspects, DecoFuse provides a more accurate reconstruction of visual experiences, addressing the brain’s “what”, “where”, and “how” pathways. Unlike existing methods focused on semantic information, DecoFuse incorporates spatial and motion components for more realistic video reconstruction.

References

Chen et al. [2023a] Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023a.
Chen et al. [2023b] Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity. arXiv preprint arXiv:2305.11675, 2023b.
Choi et al. [2023] Minkyu Choi, Kuan Han, Xiaokai Wang, Yizhen Zhang, and Zhongming Liu. A dual-stream neural network explains the functional segregation of dorsal and ventral visual pathways in human brains. Advances in Neural Information Processing Systems, 36:50408–50428, 2023.
Dong and Fu [2024] Qiaole Dong and Yanwei Fu. Memflow: Optical flow estimation and prediction with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19068–19078, 2024.
Fosco et al. [a] Camilo Fosco, Benjamin Lahner, Bowen Pan, Alex Andonian, Emilie Josephs, Alex Lascelles, and Aude Oliva. Brain netflix: Scaling data to reconstruct videos from brain signals. a.
Fosco et al. [b] Camilo Fosco, Benjamin Lahner, Bowen Pan, Alex Andonian, Emilie Josephs, Alex Lascelles, and Aude Oliva. Brain netflix: Scaling data to reconstruct videos from brain signals. b.
Gao et al. [2023] Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng, and Yanwei Fu. Mind-3d: Reconstruct high-quality 3d objects in human brain. arXiv preprint arXiv:2312.07485, 2023.
Glasser et al. [2013] Matthew F Glasser, Stamatios N Sotiropoulos, J Anthony Wilson, Timothy S Coalson, Bruce Fischl, Jesper L Andersson, Junqian Xu, Saad Jbabdi, Matthew Webster, Jonathan R Polimeni, et al. The minimal preprocessing pipelines for the human connectome project. Neuroimage, 80:105–124, 2013.
Glasser et al. [2016] Matthew F. Glasser, Timothy S. Coalson, Emma Claire Robinson, Carl D. Hacker, John W. Harwell, Essa Yacoub, Kâmil Uğurbil, Jesper L. R. Andersson, Christian F. Beckmann, Mark Jenkinson, Stephen M. Smith, and David C. Van Essen. A multi-modal parcellation of human cerebral cortex. Nature, 536:171 – 178, 2016.
Gong et al. [2024] Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Changwei Wang, Rongtao Xu, Liang Hu, et al. Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction. arXiv preprint arXiv:2410.19452, 2024.
Goodale and Milner [1992] Melvyn A. Goodale and A.David Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(1):20–25, 1992.
Huang et al. [2021] Shuo Huang, Wei Shao, Mei-Ling Wang, and Dao-Qiang Zhang. fmri-based decoding of visual information from human brain activity: A brief review. International Journal of Automation and Computing, 18(2):170–184, 2021.
Huo et al. [2025] Jingyang Huo, Yikai Wang, Yun Wang, Xuelin Qian, Chong Li, Yanwei Fu, and Jianfeng Feng. Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation. In European Conference on Computer Vision, pages 56–73. Springer, 2025.
Li et al. [2024] Chong Li, Xuelin Qian, Yun Wang, Jingyang Huo, Xiangyang Xue, Yanwei Fu, and Jianfeng Feng. Enhancing cross-subject fmri-to-video decoding with global-local functional alignment. In European Conference on Computer Vision, pages 353–369. Springer, 2024.
Lin et al. [2022] Sikun Lin, Thomas Sprague, and Ambuj K Singh. Mind reader: Reconstructing complex images from brain activities. Advances in Neural Information Processing Systems, 35:29624–29636, 2022.
Liu et al. [2023] Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding from fmri. arXiv preprint arXiv:2302.12971, 2023.
Lu et al. [2024] Yizhuo Lu, Changde Du, Chong Wang, Xuanliu Zhu, Liuyun Jiang, and Huiguang He. Animate your thoughts: Decoupled reconstruction of dynamic natural vision from slow brain activity. arXiv preprint arXiv:2405.03280, 2024.
Miller et al. [2016] Karla L. Miller, Fidel Alfaro-Almagro, Neal Kepler Bangerter, David L. Thomas, Essa Yacoub, Junqian Xu, Andreas J. Bartsch, Saâd Jbabdi, Stamatios N. Sotiropoulos, Jesper L. R. Andersson, Ludovica Griffanti, Gwenaëlle Douaud, Thomas W. Okell, Peter J. Weale, Iulius Dragonu, Steve Garratt, Sarah Hudson, Rory Collins, Mark Jenkinson, Paul M. Matthews, and Stephen M. Smith. Multimodal population brain imaging in the uk biobank prospective epidemiological study. Nature neuroscience, 19:1523 – 1536, 2016.
Milner and Goodale [2006] David Milner and Mel Goodale. The visual brain in action. Oup Oxford, 2006.
Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
Ozcelik and VanRullen [2023] Furkan Ozcelik and Rufin VanRullen. Brain-diffuser: Natural scene reconstruction from fmri signals using generative latent diffusion. arXiv preprint arXiv:2303.05334, 2023.
Qian et al. [2023a] Xuelin Qian, Yikai Wang, Yanwei Fu, Xinwei Sun, Xiangyang Xue, and Jianfeng Feng. Joint fmri decoding and encoding with latent embedding alignment. arXiv preprint arXiv:2303.14730, 2023a.
Qian et al. [2023b] Xuelin Qian, Yun Wang, Jingyang Huo, Jianfeng Feng, and Yanwei Fu. fmri-pte: A large-scale fmri pretrained transformer encoder for multi-subject brain activity decoding. arXiv preprint arXiv:2311.00342, 2023b.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Sun et al. [2024] Jingyuan Sun, Mingxiao Li, Zijiao Chen, and Marie-Francine Moens. Neurocine: Decoding vivid video sequences from human brain activties. arXiv preprint arXiv:2402.01590, 2024.
Tong and Pratte [2012] Frank Tong and Michael S Pratte. Decoding patterns of human brain activity. Annual review of psychology, 63:483–509, 2012.
Tong et al. [2022] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
Walker et al. [2015] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In Proceedings of the IEEE international conference on computer vision, pages 2443–2451, 2015.
Wang et al. [2004] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
Wen et al. [2017] Haiguang Wen, Junxing Shi, Yizhen Zhang, Kun-Han Lu, Jiayue Cao, and Zhongming Liu. Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. Cerebral Cortex, 28(12):4136–4160, 2017.
Yeung et al. [2024] Jacob Yeung, Andrew F Luo, Gabriel Sarch, Margaret M Henderson, Deva Ramanan, and Michael J Tarr. Neural representations of dynamic visual stimuli. arXiv preprint arXiv:2406.02659, 2024.
Yin et al. [2023] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.