DecoFuse: Decomposing and Fusing the “What”, “Where”, and “How” for Brain-Inspired fMRI-to-Video Decoding

Chong Li, Jingyang Huo, Weikang Gong, Yanwei Fu, Xiangyang Xue, and Jianfeng Feng
Fudan University
[email protected]
Abstract

Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components—semantic, spatial, and motion—then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.

Refer to caption
Figure 1: Diagram of DecoFuse framework. Inspired by the brain’s two-streams hypothesis [11], the DecoFuse pipeline decomposes video into three components: semantic (“what”), spatial (“where”), and motion (“how”). Neural features are extracted by an fMRI encoder and decomposed to semantic, spatial and motion embeddings. These components are then fused to generate video. Additionally, neural encoding analyzes the differential contribution of semantic and spatial embeddings in predicting signals from the brain’s dorsal and ventral streams, confirming alignment with the two-streams hypothesis [11].
Refer to caption
Figure 2: Details of DecoFuse framework. Neural features are extracted by an fMRI encoder and decomposed to semantic, spatial and motion embeddings through three independent encoders. These components are then fused to generate video via three stages: (1) fMRI-to-image decoding, which uses Stable Diffusion and ControlNet to generate static images based on high-level semantic and low-level spatial embeddings; (2) fMRI-to-motion decoding, predicting optical flow using an image- and fMRI-based motion decoder to capture dynamic elements of the video; (3) fMRI-to-video decoding, where the decoded image and optical flow are combined to generate the final video using a motion-conditioned video diffusion model.

1 Introduction

Visual input is the brain’s primary source of information, making the accurate decoding of visual signals and understanding their encoding processes key challenges in neuroscience and AI. Functional magnetic resonance imaging (fMRI), a non-invasive method for recording whole-brain activity, has become increasingly popular for decoding applications [26]. Meanwhile, advances in techniques like Stable Diffusion (SD)[24] have driven major progress in fMRI-based decoding for images [22, 1, 21, 15, 16, 13], videos [14, 2, 10, 5], and 3D objects [7]. These breakthroughs have delivered remarkable results, bringing the idea of “mind reading” closer to reality.

However, decoding fMRI into video is still inherently challenging! Neuroscience research has shown that different brain regions process various aspects of visual information. The two-streams hypothesis [11, 19] suggests two main pathways for visual processing: the “what” pathway (ventral stream) for object recognition and the “where/how” pathway (dorsal stream) for tracking location and movement. These three components—semantic (what), spatial (where), and motion (how)—are fundamental to video perception. However, fMRI-to-video decoding has mainly focused on semantic information, while decoding spatial and motion aspects, which are crucial for visual experiences, remains a significant yet underexplored challenge [10].

MinD-Video [2] was the first to use Stable Diffusion for fMRI-to-video decoding, aligning fMRI features with text embeddings to reconstruct semantically accurate videos. Several studies have since followed this approach, focusing on semantic alignment [14, 25]. Yeung et al. [31] took a different approach by successfully decoding visual motion information. Particularly, recent works have also explored spatial decoding by predicting the variational autoencoder (VAE) latent of Stable Diffusion as an initial estimate for UNet’s noise input [17, 6, 10]. Despite these efforts, evaluations mostly rely on semantic or pixel-level metrics like classification accuracy and SSIM. How well spatial and motion information can be independently decoded from fMRI remains an open question.

To address these issues, we introduce DecoFuse, a novel brain-inspired framework that decomposes video into three key components—semantic, spatial, and motion information. They are separately decoded and then fused to reconstruct the video in Fig. 1. Aligned with two-streams hypothesis, the learned components are expected to reflect their biological counterparts in the brain as three stages:
Stage 1: A pretrained fMRI encoder extracts semantic, spatial, and motion embeddings. Semantic and spatial embeddings then condition an image generator, defining “what” the object is and “where” it is located, producing a static initial frame.
Stage 2: The motion decoder predicts optical flow using the neural motion embedding and the initial frame, simulating how the brain processes object movement.
Stage 3: A motion-conditioned video generator animates the static frame using the predicted optical flow.

DecoFuse offers two main advantages: (1) It simplifies fMRI-to-video decoding by breaking it into manageable sub-tasks, enhancing performance, and (2) its biologically inspired modular design supports ablation studies, allowing assessment of how well semantic, spatial, and motion information can be independently decoded from fMRI signals.

In our experiments, we evaluated decoding accuracy for each of the three components (semantic, spatial, and motion) and demonstrated superior performance compared to existing SOTA methods [2, 22, 14, 31, 13]. For semantic information, we conducted a classification task on the generated and ground truth (GT) images, achieving 20.8% 50-way accuracy—an improvement of 20.9% over MinD-Video [2]. For spatial information, we applied foreground detection using DINOv2 [20] and obtained a 70.6% accuracy for foreground consistency between generated and GT images, surpassing the previous SOTA performance of 68.7% in NeuroPictor [13]. Regarding motion information, we measured the cosine similarity between the predicted and GT optical flow, achieving a score of 0.212, significantly better than the 0.174 reported by [31]. Moreover, we also assessed the quality of the generated videos, showing 50-way classification accuracy of 21.9%, which outperforms current SOTA methods [2, 22, 14]. We also conducted ablation studies for each component, all of which showed a significant drop in their respective metrics, emphasizing the correspondence between our learned representations and their biological counterparts. Finally, leveraging our brain-inspired decomposition in DecoFuse, we conducted neural encoding of “what” and “where” embeddings, demonstrating alignment with the two-streams hypothesis [11].

In summary, we have these contributions: (1) Novel Brain Decoding Framework: This paper proposes DecoFuse, a novel framework for fMRI-to-video decoding that addresses the challenge of reconstructing videos from brain activity by decomposing the video into three key components: semantic, spatial, and motion information. (2) Novel Designs of Various Encoders and Decoders. Our DecoFuse nontrivially improves upon previous works, featuring novel fMRI, semantic, spatial, and motion encoders. (3) Biologically Plausible Design: DecoFuse’s modular approach closely aligns with the two-streams hypothesis. Our ablation studies demonstrate a strong correlation between the learned representations and their biological counterparts. (4) Differential Neural Encoding: Investigates the alignment of decoded embeddings with the brain’s dorsal and ventral streams; and uses PCA and ridge regression to predict fMRI signals from semantic and spatial embeddings. Essentially, it supports the well established neuroscience theories. (5) Superior Performance: DecoFuse significantly outperforms state-of-the-art methods in decoding semantic, spatial, and motion components.

2 Related Work

fMRI-to-vision reconstruction. Recent advances in fMRI-based decoding have made significant strides in extracting visual information from brain activity, particularly in decoding images, videos, and 3D objects using techniques like Stable Diffusion (SD) [1, 13, 14, 2, 10, 7]. However, fMRI-to-video decoding remains underexplored, especially in terms of spatial and motion components. Early works [2, 14, 25] focused primarily on semantic decoding, while more recent approaches [17, 6, 10, 31] have incorporated VAE latent or motion-specific decoders. Nonetheless, evaluations have typically concentrated on semantic or pixel-level metrics, leaving the reliable decoding of spatial and motion information as an ongoing challenge.

Visual pathways in brain. Numerous studies in neuroscience have explored how the brain processes visual information. The two-streams hypothesis [11, 19] proposes that visual processing is divided into two pathways: the “what” pathway (ventral stream) for object recognition, and the “where”/“how” pathway (dorsal stream) for tracking object location and movement. These pathways correspond to the three components of video—semantic (what), spatial (where), and motion (how)—which are crucial for reconstructing realistic video content.

3 Method

Overview. We decompose the task into semantic, spatial, and motion decoding, respectively. In data preprocessing, raw fMRI frames are aligned with an anatomical brain template [8] to create single-channel images. These fMRI frames are then fed into a large-scale fMRI Pretrained Transformer Encoder (fMRI-PTE) [23], which is pretrained on the UKB [18] dataset. Next, two independent modules separately decode the semantic and spatial embeddings, producing a single image via Stable Diffusion [24]. Finally, using both the fMRI data and the generated image, a motion decoder predicts optical flow, and DragNUWA [32] animates the static object in the image to generate the video.

Generally, combining the two-streams hypothesis (“what” and “where” concepts) with a brain decoding model offers new insights. Building on this, our brain-inspired method links deep learning embeddings to the brain’s encoding process, helping us analyze brain signals more effectively by separating different variables.

3.1 Data Pre-processing

fMRI preprocessing. Some decoding methods flatten each frame and intentionally filter subject-specific activated voxels [30, 2]. In contrast, we align the fMRI data to the fs_LR_32k brain surface space using anatomical structures [8] and unfold the cortical surface to create a 2D image, ensuring a standardized and unified representation across subjects while preserving spatial relationships between adjacent voxels. Given that visual tasks primarily activate specific brain regions  [12], we concentrate on early and higher visual cortical Regions of Interest (ROIs) covering 8,405 vertices, as defined by the HCP-MMP atlas [9] in the fs_LR_32k space. Each fMRI frame is then transformed into a one-channel 256×\times×256 image, followed by voxel-wise z-transformation. Additionally, temporally aligned fMRI frames from different runs with the same video stimulus are averaged. Finally, we apply an approximate 6-second temporal shift to the fMRI series considering the inherent time lag between the stimulus input and the peak of the BOLD signal due to the hemodynamic response.

fMRI-stimuli paired data. We follow the MinD-Video [2] and use a sliding window approach to split the CC2017 dataset [30] into fMRI-video paired samples. Specifically, the fMRI-to-video decoding task is reformulated as generating a T𝑇Titalic_T-second video from αT𝛼𝑇\alpha Titalic_α italic_T-seconds of fMRI data. Additionally, inspired by the two-streams hypothesis [11], which suggests that “what”, “where”, and “how” information is primarily encoded by different brain regions, we decompose the video to semantic, spatial and motion components. These are represented by the initial frame (semantic and spatial) and optical flow (motion).

Assuming there are n𝑛nitalic_n frames of fMRI 𝐅in×Hf×Wfsubscript𝐅𝑖superscript𝑛subscript𝐻𝑓subscript𝑊𝑓\mathbf{F}_{i}\in\mathbb{R}^{n\times H_{f}\times W_{f}}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and m𝑚mitalic_m frames of video 𝐕im×3×Hv×Wvsubscript𝐕𝑖superscript𝑚3subscript𝐻𝑣subscript𝑊𝑣\mathbf{V}_{i}\in\mathbb{R}^{m\times 3\times H_{v}\times W_{v}}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × 3 × italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in the i𝑖iitalic_i-th window, where Hf,Wfsubscript𝐻𝑓subscript𝑊𝑓H_{f},W_{f}italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Hv,Wvsubscript𝐻𝑣subscript𝑊𝑣H_{v},W_{v}italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes the height and width of the unfolded fMRI image and video. Each optical flow 𝐎ikHv×Wv×2superscriptsubscript𝐎𝑖𝑘superscriptsubscript𝐻𝑣subscript𝑊𝑣2\mathbf{O}_{i}^{k}\in\mathbb{R}^{H_{v}\times W_{v}\times 2}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 2 end_POSTSUPERSCRIPT is then generated by MemFlow [4] using the initial frame 𝐕iksuperscriptsubscript𝐕𝑖𝑘\mathbf{V}_{i}^{k}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and the future frame 𝐕ik+m2superscriptsubscript𝐕𝑖𝑘𝑚2\mathbf{V}_{i}^{k+\lfloor\frac{m}{2}\rfloor}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + ⌊ divide start_ARG italic_m end_ARG start_ARG 2 end_ARG ⌋ end_POSTSUPERSCRIPT, which can be formulated as

𝐎ik=MemFlow(𝐕ik,𝐕ik+m2)superscriptsubscript𝐎𝑖𝑘MemFlowsuperscriptsubscript𝐕𝑖𝑘superscriptsubscript𝐕𝑖𝑘𝑚2\displaystyle\mathbf{O}_{i}^{k}=\mathrm{MemFlow}(\mathbf{V}_{i}^{k},\mathbf{V}% _{i}^{k+\lfloor\frac{m}{2}\rfloor})bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_MemFlow ( bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + ⌊ divide start_ARG italic_m end_ARG start_ARG 2 end_ARG ⌋ end_POSTSUPERSCRIPT ) (1)

where 1km21𝑘𝑚21\leq k\leq\lfloor\frac{m}{2}\rfloor1 ≤ italic_k ≤ ⌊ divide start_ARG italic_m end_ARG start_ARG 2 end_ARG ⌋.

3.2 DecoFuse pipeline

Visual input is essential for the brain, and many studies have explored how it processes this information. The well-known two-streams hypothesis suggests that the brain processes visual information through two distinct pathways: the “what” pathway (ventral stream) for recognizing objects and the “where/how” pathway (dorsal stream) for tracking their location and movement [11, 19]. Motivated by this, we propose a brain-inspired fMRI-to-video framework, DecoFuse, which decomposes a video into three components: semantic (“what”), spatial (“where”), and motion (“how”), separately decodes each component, and finally fuses them to generate the video.

fMRI encoder. To reduce information loss when encoding high-dimensional fMRI signals into a compact feature space, we apply fMRI-PTE [23], a ViT-based autoencoder pretrained on a large-scale fMRI dataset [18], as our encoder. Unlike those ViT-based encoders that flatten and patchify voxels without preserving spatial information, this approach retains local structure [1, 2]. Each 2D fMRI frame 𝐅itHf×Wfsuperscriptsubscript𝐅𝑖𝑡superscriptsubscript𝐻𝑓subscript𝑊𝑓\mathbf{F}_{i}^{t}\in\mathbb{R}^{H_{f}\times W_{f}}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is divided into p𝑝pitalic_p square patches, where each patch represents a token that captures the spatial relationships between neighboring voxels. These patchified fMRI images are then transformed into token embeddings 𝐅emb,it(p+1)×Dfsubscriptsuperscript𝐅𝑡𝑒𝑚𝑏𝑖superscript𝑝1subscript𝐷𝑓\mathbf{F}^{t}_{emb,i}\in\mathbb{R}^{(p+1)\times D_{f}}bold_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e italic_m italic_b , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_p + 1 ) × italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT through a series of spatial attention blocks, with Dfsubscript𝐷𝑓D_{f}italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT representing the embedding dimension. The model achieves high-precision reconstruction using only the [CLS] token, yielding an encoder that effectively retains the main information.

Stage 1: semantic and spatial decoding. In this stage, we decode semantic and spatial information from fMRI data to reconstruct static keyframes. Recent advances in image editing [33] demonstrate that high-level semantic latent codes can guide the semantic content of generated images, while updating feature maps allows precise control over spatial composition. Building on this insight, as illustrated in Fig. 2, we employ an fMRI-to-image pipeline that integrates semantic guidance and spatial control to enhance Stable Diffusion (SD) [24]. Based on the high-level and low-level framework from NeuroPictor [13], our approach further deepens the encoding process and augment the semantic encoder to improve decoding performance.

For high-level semantic decoding, we use a semantic encoder semsubscript𝑠𝑒𝑚\mathcal{E}_{sem}caligraphic_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT to transform fMRI features 𝐅embsubscript𝐅𝑒𝑚𝑏\mathbf{F}_{emb}bold_F start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT into semantic embeddings 𝐄sem=sem(𝐅emb)subscript𝐄𝑠𝑒𝑚subscript𝑠𝑒𝑚subscript𝐅𝑒𝑚𝑏\mathbf{E}_{sem}=\mathcal{E}_{sem}(\mathbf{F}_{emb})bold_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT ( bold_F start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT ), replacing the typical text embeddings 𝐄txtsubscript𝐄𝑡𝑥𝑡\mathbf{E}_{txt}bold_E start_POSTSUBSCRIPT italic_t italic_x italic_t end_POSTSUBSCRIPT in Stable Diffusion, where 𝐄sem,𝐄txtLT×DTsubscript𝐄𝑠𝑒𝑚subscript𝐄𝑡𝑥𝑡superscriptsubscript𝐿𝑇subscript𝐷𝑇\mathbf{E}_{sem},\mathbf{E}_{txt}\in\mathbb{R}^{L_{T}\times D_{T}}bold_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT italic_t italic_x italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Unlike NeuroPictor, which uses convolutional layers and MLPs in its encoder, we use transformer layers to capture semantic information related to the visual stimulus. This helps guide the diffusion model, ensuring the generated image accurately reflects the perceived objects and scene context.

For spatial decoding, we use a spatial encoder spasubscript𝑠𝑝𝑎\mathcal{E}_{spa}caligraphic_E start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT to directly adjust the feature maps in the U-Net architecture of the diffusion model. The spatial embeddings are derived as 𝐄spa=spa(𝐅emb)subscript𝐄𝑠𝑝𝑎subscript𝑠𝑝𝑎subscript𝐅𝑒𝑚𝑏\mathbf{E}_{spa}=\mathcal{E}_{spa}(\mathbf{F}_{emb})bold_E start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT ( bold_F start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT ), where 𝐄spa={𝐄spa,(i)i=1,,13}subscript𝐄𝑠𝑝𝑎conditional-setsubscript𝐄𝑠𝑝𝑎𝑖𝑖113\mathbf{E}_{spa}=\{\mathbf{E}_{spa,(i)}\mid i=1,\ldots,13\}bold_E start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT = { bold_E start_POSTSUBSCRIPT italic_s italic_p italic_a , ( italic_i ) end_POSTSUBSCRIPT ∣ italic_i = 1 , … , 13 }, with 𝐄spa,(i)subscript𝐄𝑠𝑝𝑎𝑖\mathbf{E}_{spa,(i)}bold_E start_POSTSUBSCRIPT italic_s italic_p italic_a , ( italic_i ) end_POSTSUBSCRIPT representing the feature map from the i𝑖iitalic_i-th encoder block. The spatial encoder applies channel-wise convolutions, MLPs, and transformer layers to refine U-Net feature maps at various levels. The resulting spatial embeddings are processed through zero convolution layers and combined with the intermediate outputs of the SD model using a residual connection:

𝐄~spa=𝐄SD+α𝒵(𝐄spa)subscript~𝐄𝑠𝑝𝑎subscript𝐄𝑆𝐷𝛼𝒵subscript𝐄𝑠𝑝𝑎\displaystyle\tilde{\mathbf{E}}_{spa}=\mathbf{E}_{SD}+\alpha\mathcal{Z}(% \mathbf{E}_{spa})over~ start_ARG bold_E end_ARG start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT = bold_E start_POSTSUBSCRIPT italic_S italic_D end_POSTSUBSCRIPT + italic_α caligraphic_Z ( bold_E start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT ) (2)

where 𝒵𝒵\mathcal{Z}caligraphic_Z is the zero convolution layer, 𝐄SDsubscript𝐄𝑆𝐷\mathbf{E}_{SD}bold_E start_POSTSUBSCRIPT italic_S italic_D end_POSTSUBSCRIPT represents the latent codes of the SD U-Net, and α𝛼\alphaitalic_α is a hyperparameter balancing high-level semantic guidance and fine-grained spatial details. This method effectively controls detailed spatial features, such as object positioning and structural layout.

By combining the semantic guidance 𝐄semsubscript𝐄𝑠𝑒𝑚\mathbf{E}_{sem}bold_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT and spatial guidance 𝐄~spasubscript~𝐄𝑠𝑝𝑎\tilde{\mathbf{E}}_{spa}over~ start_ARG bold_E end_ARG start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT derived from fMRI, we can finely control the generated outputs, achieving both semantic and spatial reconstruction of static images.

Stage 2: motion decoding. Previous work [31] has demonstrated that motion information, such as optical flow, can be decoded from fMRI. Therefore, we propose a motion decoder that predicts optical flow of a video based on fMRI and its first frame. In other word, motion decoder functions by “asking” the frozen brain (fMRI) how objects in the first frame are moving in the viewed video. Moreover, we suggest that in a short video (e.g., a 2-second clip), only coarse movement can be reliably encoded in fMRI due to its low temporal and spatial resolution. As a result, our motion decoder 𝒟Msubscript𝒟𝑀\mathcal{D}_{M}caligraphic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT predicts only a single frame of low-resolution optical flow for each sample.

𝐎^ik=𝒟M(𝐕ik,𝐅i)superscriptsubscript^𝐎𝑖𝑘subscript𝒟𝑀superscriptsubscript𝐕𝑖𝑘subscript𝐅𝑖\displaystyle\hat{\mathbf{O}}_{i}^{k}=\mathcal{D}_{M}(\mathbf{V}_{i}^{k},% \mathbf{F}_{i})over^ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (3)

To accurately decode motion information, we follow prior image-to-motion work [28], which showed that optical flow classification outperforms direct prediction. First, we flatten the vectors from all optical flow 𝐎isubscript𝐎𝑖\mathbf{O}_{i}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in training set and apply K-means clustering to obtain a codebook 𝐁Nvec×2𝐁superscriptsubscript𝑁𝑣𝑒𝑐2\mathbf{B}\in\mathbb{R}^{N_{vec}\times 2}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT × 2 end_POSTSUPERSCRIPT, where Nvecsubscript𝑁𝑣𝑒𝑐N_{vec}italic_N start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT is the number of clusters. Each vector in the optical flow 𝐎isubscript𝐎𝑖\mathbf{O}_{i}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is then quantized by its nearest vector in the codebook. The quantized optical flow 𝐎~~𝐎\tilde{\mathbf{O}}over~ start_ARG bold_O end_ARG is defined as:

𝐎~i,h,wk=𝐁c,c=argminc𝐎i,h,w𝐁c22formulae-sequencesuperscriptsubscript~𝐎𝑖𝑤𝑘subscript𝐁superscript𝑐superscript𝑐subscript𝑐superscriptsubscriptnormsubscript𝐎𝑖𝑤subscript𝐁𝑐22\displaystyle\tilde{\mathbf{O}}_{i,h,w}^{k}=\mathbf{B}_{c^{\star}},\ \ c^{% \star}=\arg\min_{c}\parallel\mathbf{O}_{i,h,w}-\mathbf{B}_{c}\parallel_{2}^{2}over~ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_i , italic_h , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_B start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ bold_O start_POSTSUBSCRIPT italic_i , italic_h , italic_w end_POSTSUBSCRIPT - bold_B start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4)

More specifically, 𝐕iksuperscriptsubscript𝐕𝑖𝑘\mathbf{V}_{i}^{k}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are sent to their corresponding pretrained encoders, DINOv2 [20] and fMRI-PTE [23] to generate token-level embeddings. As shown in Fig. 2 (Stage 2), after separate CNN processing, the two embeddings are concatenated and passed through a CNN and softmax layer to predict probability distribution 𝐏ikHo×Wo×Nvecsuperscriptsubscript𝐏𝑖𝑘superscriptsubscript𝐻𝑜subscript𝑊𝑜subscript𝑁𝑣𝑒𝑐\mathbf{P}_{i}^{k}\in\mathbb{R}^{H_{o}\times W_{o}\times N_{vec}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of vectors in codebook. The final prediction of optical flow is then given by 𝐎^ik=𝐏ik𝐁superscriptsubscript^𝐎𝑖𝑘superscriptsubscript𝐏𝑖𝑘𝐁\hat{\mathbf{O}}_{i}^{k}=\mathbf{P}_{i}^{k}\mathbf{B}over^ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_B.

Stage 3: video generation. Based on the pre-generated image and optical flow, we reconstruct the video using DragNUWA [32], a pretrained video diffusion model conditioned on motion. First, to ensure more stable video generation, we mask the optical flow using foreground detection from DINOv2 [20]. Additionally, to generate an Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT-frame video, we extend the single-frame optical flow by linearly dividing the vector to Nf1subscript𝑁𝑓1N_{f}-1italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - 1 sub-vectors.

3.3 Differential neural encoding

Since DecoFuse is inspired by the two-streams hypothesis [11], we conduct neural encoding to examine whether and how the decoded embeddings differentially align with the two streams identified in biological studies. To prevent overfitting, we first apply Principal Component Analysis (PCA) to reduce the dimension of the semantic embedding 𝐄semsubscript𝐄𝑠𝑒𝑚\mathbf{E}_{sem}bold_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT and spatial embedding 𝐄spasubscript𝐄𝑠𝑝𝑎\mathbf{E}_{spa}bold_E start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT, resulting in 𝐄semPCA,𝐄spaPCAT×Dsubscriptsuperscript𝐄𝑃𝐶𝐴𝑠𝑒𝑚subscriptsuperscript𝐄𝑃𝐶𝐴𝑠𝑝𝑎superscript𝑇𝐷\mathbf{E}^{PCA}_{sem},\mathbf{E}^{PCA}_{spa}\in\mathbb{R}^{T\times D}bold_E start_POSTSUPERSCRIPT italic_P italic_C italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT , bold_E start_POSTSUPERSCRIPT italic_P italic_C italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT, where T𝑇Titalic_T is the number of time points in the fMRI volumes, and D𝐷Ditalic_D is the reduced dimension. We then use ridge regression to predict the Gaussian-smoothed and flattened fMRI data 𝐅T×Nv𝐅superscript𝑇subscript𝑁𝑣\mathbf{F}\in\mathbb{R}^{T\times N_{v}}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT based on semantic or spatial embeddings, where Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT represents the number of voxels.

𝐅^X=RidgeRegressor(𝐄XPCA),X{sem,spa}formulae-sequencesubscript^𝐅𝑋RidgeRegressorsubscriptsuperscript𝐄𝑃𝐶𝐴𝑋𝑋𝑠𝑒𝑚𝑠𝑝𝑎\displaystyle\hat{\mathbf{F}}_{X}=\mathrm{RidgeRegressor}(\mathbf{E}^{PCA}_{X}% ),X\in\{sem,spa\}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = roman_RidgeRegressor ( bold_E start_POSTSUPERSCRIPT italic_P italic_C italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) , italic_X ∈ { italic_s italic_e italic_m , italic_s italic_p italic_a } (5)

Next, we compute the average temporal correlation 𝐫Xsubscript𝐫𝑋\mathbf{r}_{X}bold_r start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT for the predicted and GT fMRI signals over a window size Twsubscript𝑇𝑤T_{w}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT:

𝐫X=1Nwt=1Nwcorr(𝐅X,t:t+Tw,𝐅^X,t:t+Tw),subscript𝐫𝑋1subscript𝑁𝑤superscriptsubscript𝑡1subscript𝑁𝑤corrsubscript𝐅:𝑋𝑡𝑡subscript𝑇𝑤subscript^𝐅:𝑋𝑡𝑡subscript𝑇𝑤\displaystyle\mathbf{r}_{X}=\frac{1}{N_{w}}\sum_{t=1}^{N_{w}}\mathrm{corr}(% \mathbf{F}_{X,t:t+T_{w}},\hat{\mathbf{F}}_{X,t:t+T_{w}}),bold_r start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_corr ( bold_F start_POSTSUBSCRIPT italic_X , italic_t : italic_t + italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_X , italic_t : italic_t + italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,
X{sem,spa}𝑋𝑠𝑒𝑚𝑠𝑝𝑎\displaystyle X\in\{sem,spa\}italic_X ∈ { italic_s italic_e italic_m , italic_s italic_p italic_a } (6)

Following metrics in [3], we differentiate the relative contributions of the semantic and spatial embeddings in predicting the brain’s dorsal and ventral streams for individual voxels:

𝐩spa=𝐫spa2𝐫spa2+𝐫sem20.5subscript𝐩𝑠𝑝𝑎superscriptsubscript𝐫𝑠𝑝𝑎2superscriptsubscript𝐫𝑠𝑝𝑎2superscriptsubscript𝐫𝑠𝑒𝑚20.5\displaystyle\mathbf{p}_{spa}=\frac{\mathbf{r}_{spa}^{2}}{\mathbf{r}_{spa}^{2}% +\mathbf{r}_{sem}^{2}}-0.5bold_p start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT = divide start_ARG bold_r start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG bold_r start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + bold_r start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 0.5 (7)

Here, 𝐩spasubscript𝐩𝑠𝑝𝑎\mathbf{p}_{spa}bold_p start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT ranges from -0.5 to 0.5. A value of pspa>0subscript𝑝𝑠𝑝𝑎0p_{spa}>0italic_p start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT > 0 indicates that the spatial embedding better predicts the voxel, while pspa<0subscript𝑝𝑠𝑝𝑎0p_{spa}<0italic_p start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT < 0 suggests that semantic embedding provides a better prediction.

3.4 Training Strategy

We perform training in both Stage 1 and Stage 2.

Stage 1. We freeze the SD model to retain its strong image synthesis capabilities, while finetuning the semantic, spatial, and fMRI encoders to extract semantic and spatial information from fMRI data. Since the image represents static information, we use a single fMRI frame for image decoding. The pipeline is trained with fMRI-image pairs (𝐅i1,𝐕ik)superscriptsubscript𝐅𝑖1superscriptsubscript𝐕𝑖𝑘(\mathbf{F}_{i}^{1},\mathbf{V}_{i}^{k})( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), where 1km21𝑘𝑚21\leq k\leq\lfloor\frac{m}{2}\rfloor1 ≤ italic_k ≤ ⌊ divide start_ARG italic_m end_ARG start_ARG 2 end_ARG ⌋ denotes data augmentation for random initial frame.

Specifically, the input image 𝐕iksuperscriptsubscript𝐕𝑖𝑘\mathbf{V}_{i}^{k}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is first encoded into a latent representation z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The diffusion process then progressively adds noise to z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over t𝑡titalic_t time steps, resulting in a noisy latent ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. During the denoising stage, the frozen U-Net predicts a denoised version of ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, conditioned on the time step t𝑡titalic_t, semantic embedding 𝐄embsubscript𝐄𝑒𝑚𝑏\mathbf{E}_{emb}bold_E start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT, and spatial embedding Espasubscript𝐸𝑠𝑝𝑎E_{spa}italic_E start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT. The denoising loss for optimizing the SD latent is defined as follow:

s1=𝔼z0,t,𝐅emb,ϵ𝒩(0,1)[ϵϵθ(zt,t,𝐅emb,𝐄emb)22]subscript𝑠1subscript𝔼similar-tosubscript𝑧0𝑡subscript𝐅𝑒𝑚𝑏italic-ϵ𝒩01delimited-[]subscriptsuperscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝐅𝑒𝑚𝑏subscript𝐄𝑒𝑚𝑏22\displaystyle\mathcal{L}_{s1}=\mathbb{E}_{z_{0},t,\mathbf{F}_{emb},\epsilon% \sim\mathcal{N}(0,1)}\Big{[}\parallel\epsilon-\epsilon_{\theta}(z_{t},t,% \mathbf{F}_{emb},\mathbf{E}_{emb})\parallel^{2}_{2}\Big{]}caligraphic_L start_POSTSUBSCRIPT italic_s 1 end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t , bold_F start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_F start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (8)

Stage 2. For training the motion decoder, we use fMRI-image-motion paired data (𝐅i,𝐕ik,𝐎ik)subscript𝐅𝑖superscriptsubscript𝐕𝑖𝑘superscriptsubscript𝐎𝑖𝑘(\mathbf{F}_{i},\mathbf{V}_{i}^{k},\mathbf{O}_{i}^{k})( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), where 1km21𝑘𝑚21\leq k\leq\lfloor\frac{m}{2}\rfloor1 ≤ italic_k ≤ ⌊ divide start_ARG italic_m end_ARG start_ARG 2 end_ARG ⌋ denotes data augmentation for random frame. We combine cross-entropy loss entropysubscript𝑒𝑛𝑡𝑟𝑜𝑝𝑦\mathcal{L}_{entropy}caligraphic_L start_POSTSUBSCRIPT italic_e italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT and mean squared error (MSE) loss MSEsubscript𝑀𝑆𝐸\mathcal{L}_{MSE}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT to form the total loss s2=entropy+λ2MSEsubscript𝑠2subscript𝑒𝑛𝑡𝑟𝑜𝑝𝑦subscript𝜆2subscript𝑀𝑆𝐸\mathcal{L}_{s2}=\mathcal{L}_{entropy}+\lambda_{2}\mathcal{L}_{MSE}caligraphic_L start_POSTSUBSCRIPT italic_s 2 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_e italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT for training the motion decoder 𝒟Msubscript𝒟𝑀\mathcal{D}_{M}caligraphic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT.

entropysubscript𝑒𝑛𝑡𝑟𝑜𝑝𝑦\displaystyle\mathcal{L}_{entropy}caligraphic_L start_POSTSUBSCRIPT italic_e italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT =CrossEntropy(𝐏ik,𝐜ik)absentCrossEntropysuperscriptsubscript𝐏𝑖𝑘superscriptsubscript𝐜𝑖𝑘\displaystyle=\mathrm{CrossEntropy}(\mathbf{P}_{i}^{k},\mathbf{c}_{i}^{k})= roman_CrossEntropy ( bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) (9)
MSEsubscript𝑀𝑆𝐸\displaystyle\mathcal{L}_{MSE}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT =𝐎ik𝐎^ik22absentsuperscriptsubscriptnormsuperscriptsubscript𝐎𝑖𝑘superscriptsubscript^𝐎𝑖𝑘22\displaystyle=\parallel\mathbf{O}_{i}^{k}-\hat{\mathbf{O}}_{i}^{k}\parallel_{2% }^{2}= ∥ bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - over^ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)

where 𝐜iksuperscriptsubscript𝐜𝑖𝑘\mathbf{c}_{i}^{k}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is codebook label for optical flow 𝐎iksuperscriptsubscript𝐎𝑖𝑘\mathbf{O}_{i}^{k}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

4 Experiments

Pre-training Dataset. The UK Biobank (UKB) [18] is a large-scale biomedical resource that gathers extensive genetic and health-related data from roughly 500,000 individuals across the UK. A subset of this repository is utilized, specifically the resting-state fMRI data from approximately 39,630 participants. Each participant provides a single session consisting of 490 time-point volumes.

Paired fMRI-video Dataset. Experiments used the CC2017 dataset [30], which pairs fMRI data with video stimuli. It includes data from three participants, with fMRI frames captured using a 3T MRI scanner at a 2-second repetition time (TR). The dataset covers about 3 hours of video and provides around 5,500 fMRI-stimulus pairs per subject.

Vision metrics. 1) Semantic-level. Following Mind-Video [2], we use both image-based and video-based classification metrics to assess semantic-level performance. For image classification, we rely on ImageNet classifier. For the video-based metrics, we apply a similar classification framework, utilizing VideoMAE [27]. In both cases, the N-way top-K accuracy metric is employed, where for video the top-3 predicted classes are compared against the ground truth (GT) class. Specifically, N candidates include the ground truth class along with N-1 randomly selected classes from the classifier’s full class set. This approach is consistent with the methodology used in MinD-Video. 2) Spatial-level. We evaluate spatial performance by calculating the ratio of foreground-background matching between the ground truth and decoded images. Foreground detection is performed using DINOv2 [20]. Let the matrix 𝐌{0,1}H×W𝐌superscript01𝐻𝑊\mathbf{M}\in\{0,1\}^{H\times W}bold_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT represent the foreground mask, where 𝐌i,j=1subscript𝐌𝑖𝑗1\mathbf{M}_{i,j}=1bold_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 indicates pixel (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) is detected as foreground, and 𝐌i,j=0subscript𝐌𝑖𝑗0\mathbf{M}_{i,j}=0bold_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 indicates background. The matching ratio rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is then calculated as follows:

rm=1𝐌GT𝐌pred0H×Wsubscript𝑟𝑚1subscriptnormsubscript𝐌𝐺𝑇subscript𝐌𝑝𝑟𝑒𝑑0𝐻𝑊\displaystyle r_{m}=1-\frac{\parallel\mathbf{M}_{GT}-\mathbf{M}_{pred}% \parallel_{0}}{H\times W}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 1 - divide start_ARG ∥ bold_M start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT - bold_M start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_H × italic_W end_ARG (11)

where 𝐌GT,𝐌predsubscript𝐌𝐺𝑇subscript𝐌𝑝𝑟𝑒𝑑\mathbf{M}_{GT},\mathbf{M}_{pred}bold_M start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT , bold_M start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT represent foreground mask metrics of GT and predicted images, respectively. The value of rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ranges from 0 to 1, with a value closer to 1 indicating better matching of the foreground and background between ground truth and predicted images, and a value closer to 0 indicating worse matched. 3) Pixel-level. We use the structural similarity index measure (SSIM) [29] to assess pixel-level decoding performance. For video evaluation, SSIM is computed for each frame of both the ground truth and reconstructed videos, with results averaged across frames.

Motion metrics. We evaluate motion decoding performance using cosine similarity between the ground truth and decoded optical flow vectors. To handle scene changes that may produce invalid optical flow, we apply scene-change detection to remove such samples. Additionally, we mask all predicted optical flow using foreground detection and also mask ground truth values close to the zero vector to reduce noise. Specifically, the shortest cluster in the quantized codebook is set to the zero vector.

Implementation Details. For CC2017 [30], we used an fMRI window of α𝛼\alphaitalic_αT = 2s to generate videos lasting T = 2s and videos were downsampled to 8 FPS. All training and inference processes were conducted on a single NVIDIA A100 GPU. Please refer to the Supplementary for detailed hyperparameter configurations and additional training information. Codes&Models will be released.

4.1 Verifying the ‘What’ and ‘Where’ Factor

Methods Semantic-level Spatial-level
2-way 50-way rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
sub1 MinD-Video [2] 0.792 0.172 0.660
fMRI-PTE-video [14] 0.793 0.169 0.652
NeuroPictor [13] 0.808 0.195 0.687
DecoFuse(w/o what) 0.774 0.130 0.704
DecoFuse(w/o where) 0.792 0.171 0.668
DecoFuse(1 frame) 0.816 0.201 0.690
DecoFuse 0.824 0.208 0.706
sub2 MinD-Video 0.784 0.158 0.669
fMRI-PTE-video 0.780 0.159 0.648
NeuroPictor 0.785 0.169 0.679
DecoFuse 0.802 0.190 0.692
sub3 MinD-Video 0.812 0.193 0.662
fMRI-PTE-video 0.799 0.173 0.637
NeuroPictor 0.803 0.194 0.671
DecoFuse 0.816 0.215 0.689
Table 1: Results of fMRI-to-image decoding. Evaluations of semantic and spatial metrics are presented for all three subjects, with bolded results indicating performance surpassing all baselines.
Refer to caption
Figure 3: Results of fMRI-to-image reconstruction. Our model successfully generates images that align well with the ground truth in both semantic and spatial aspects. By comparing the results with and without semantic(“what”)/spatial(“where”) embeddings, we demonstrate that semantic and spatial embeddings significantly enhance the model’s ability to accurately reconstruct and localize objects within the image.

To isolate the “What” and “Where” components in decoded images from fMRI signals, we compare our method, DecoFuse, with other established fMRI-to-video decoding approaches, including MinD-Video [2], fMRI-PTE-video [14], and NeuroPictor [13].

Our findings, as in Fig. 3 and Tab. 1, show results across three subjects with both semantic and spatial metrics. (1) DecoFuse consistently outperforms the other methods in these metrics, capturing detailed semantic content and accurately decoding spatial locations. This shows DecoFuse’s ability to better align “What” (semantic content) and “Where” (spatial arrangement) from brain activity, setting a new benchmark in fMRI-to-video decoding. (2) At the semantic level, DecoFuse achieves significantly higher accuracy than the other methods. For example, in subject 1, DecoFuse achieves a 50-way accuracy of 0.208, compared to 0.172 for MinD-Video, 0.169 for fMRI-PTE-video, and 0.195 for NeuroPictor. This trend holds across subjects, with DecoFuse leading in both 2-way and 50-way accuracy, demonstrating its effectiveness in capturing semantic content from fMRI data. (3) At the spatial level, DecoFuse excels in preserving spatial locations. For instance, in subject 1, DecoFuse achieves a matching ratio of 0.706, outperforming MinD-Video (0.660), fMRI-PTE-video (0.652), and NeuroPictor (0.687), indicating better object localization.

To evaluate the impact of semantic and spatial features, we ablate these embeddings in DecoFuse respectively. DecoFuse(w/o where), which excludes spatial features, shows a clear drop in spatial metrics, confirming their importance. DecoFuse(w/o what), which removes semantic conditioning, experiences a significant decline in semantic accuracy but retains a high spatial score of 0.704. Additionally, to reduce randomness, DecoFuse generates 20 frames and selects the one with the least deviation (see Supplementary for details), while DecoFuse(1 frame) generates only a single frame. The results show that filtering one frame from multiple frames improves performance by reducing generation variance. Overall, DecoFuse excels in both semantic and spatial decoding, capturing fine fMRI details and generating high-quality visual reconstructions, surpassing previous methods.

4.2 Verifying the ‘How’ factor

Refer to caption
Figure 4: Results of fMRI-to-motion decoding. Our model effectively predicts optical flow based on fMRI and image data, demonstrating accurate motion decoding performance.
Methods cosine similarity
20% 30% 40% 50% 60%
sub1 F2M [31] 0.174
DecoFuse(w/o fMRI) 0.051 0.049 0.026 0.016 -0.042
DecoFuse 0.139 0.147 0.150 0.212 0.179
sub2 F2M 0.085
DecoFuse 0.045 0.052 0.055 -0.028 -0.137
sub3 F2M 0.110
DecoFuse 0.106 0.129 0.153 0.144 0.050
Table 2: Results of fMRI-to-motion decoding. The details of how F2M [31] computes cosine similarity are not provided. Therefore, we evaluate our method on optical flow where the foreground occupies more than various ratios.

‘Disclaimer’. Since there is no direct way to make a fair comparison for the “How” factor, we adapt optical flow metrics for evaluation. However, optical flow is highly sensitive to various factors—occlusions, rapid motion and motion blur, changes in illumination, and even noise or artifacts—all of which commonly appear in generated images of all methods. As a result, it is challenging to quantify the exact impact these sensitivities might have on our comparisons. Nonetheless, optical flow still provides a useful baseline metric, offering a general gauge for assessing the effectiveness of each method.

To assess motion decoding performance, we measure cosine similarity between predicted and ground truth optical flow vectors across varying foreground coverage levels. In Tab. 2, each percentile (e.g., 20%, 30%, etc.) represents the proportion of the scene occupied by the foreground, offering insights into how well each model decodes motion with emphasis on larger, more prominent objects. This approach reflects the human tendency to focus on movement associated with larger scene elements.

The motion decoding results in Fig. 4 and Tab. 2 demonstrate DecoFuse’s capabilities relative to the fMRI-to-motion (F2M) method [31], using cosine similarity across these foreground thresholds. Although exact comparisons are limited by the F2M algorithm’s incomplete details, DecoFuse presents a notable edge. For instance, our method’s computation of optical flow at one-second intervals introduces added complexity, yet DecoFuse still demonstrates strong performance. In particular, DecoFuse excels in capturing motion within larger foreground regions, outperforming F2M. This pattern supports our hypothesis that DecoFuse aligns closely with human perceptual biases, effectively prioritizing motion decoding for visually dominant areas. These results affirm DecoFuse’s robust motion decoding ability, especially in challenging conditions that require precision with significant scene elements.

We also tested optical flow prediction after ablating fMRI input, which is equivalent to optical flow prediction based only on images. The results show that predictions based solely on images perform much worse compared to predictions made with both fMRI and images. This suggests that the model successfully learns motion information from the fMRI data.

4.3 More Ablation Study

Other impacting factors in decoding videos. We further evaluate the direct decoding of videos from fMRI by semantic-level accuracy and structural similarity (SSIM), following the metrics used in [2]. For each subject, we report both 2-way and 50-way semantic accuracy. As shown in Tab. 3, DecoFuse demonstrates best performance on most cases, highlighting the improved accuracy of our decoded videos. These results affirm DecoFuse’s effectiveness in preserving both semantic and structural details from fMRI data. We also provide visualizations of the decoded frames in Fig. 5, highlighting the clarity and fidelity of our approach. Additionally, we assess video decoding (DecoFuse(NeuroPictor)) based on images generated by NeuroPictor [13], showing a significant decrease in semantic metrics, which further proves the improvement of our fMRI-to-image decoding pipeline.

Refer to caption
Figure 5: Our fMRI-to-video decoding. Our model shows accurate decoding performance at both the semantic and pixel levels.
Methods Semantic-level Pixel-level
2-way 50-way SSIM
sub1 LEA [22] 0.825 0.149 0.137
MinD-Video [2] 0.853 0.202 0.171
fMRI-PTE-video [14] 0.851 0.214 0.193
DecoFuse(NeuroPictor) 0.839 0.204 0.370
DecoFuse 0.855 0.219 0.339
sub2 LEA 0.826 0.148 0.145
MinD-Video 0.841 0.173 0.171
fMRI-PTE-video 0.834 0.192 0.182
DecoFuse 0.846 0.193 0.306
sub3 LEA 0.834 0.160 0.137
MinD-Video 0.846 0.216 0.187
fMRI-PTE-video 0.851 0.225 0.176
DecoFuse 0.856 0.218 0.314
Table 3: Results of fMRI-to-video decoding. Our method outperforms the baselines in most cases, with bolded results highlighting superior performance over all baselines.
Refer to caption
Figure 6: Results of differential neural encoding. The differential encoding distribution for “what” and “where” is represented by pspasubscript𝑝𝑠𝑝𝑎p_{spa}italic_p start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT and visualized on the medial view of the brain surface. Red indicates regions that encode “where” information, while blue indicates regions that encode “what” information. These results align with the two-streams hypothesis [11].

Differential Neural Encoding. We explore how the brain encodes semantic and spatial information through distinct pathways using differential neural encoding,as visualized in Fig. 6. It highlights the contributions of semantic and spatial features in predicting fMRI responses. So our findings support the two-streams hypothesis [11]. In the primary visual cortex, when pspasubscript𝑝𝑠𝑝𝑎p_{spa}italic_p start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT approaches 0, both types of information are encoded equally. As processing progresses through the dorsal and ventral pathways, a bias emerges, favoring spatial or semantic cues, respectively. In higher-order regions, such as the frontal lobe, this distinction diminishes, supporting the idea that our approach decodes brain activity in a manner consistent with biological encoding processes.

5 Conclusion

This paper introduces DecoFuse, a novel fMRI-to-video decoding framework that separates video into semantic, spatial, and motion components. By independently decoding these aspects, DecoFuse provides a more accurate reconstruction of visual experiences, addressing the brain’s “what”, “where”, and “how” pathways. Unlike existing methods focused on semantic information, DecoFuse incorporates spatial and motion components for more realistic video reconstruction.

References

  • Chen et al. [2023a] Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023a.
  • Chen et al. [2023b] Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity. arXiv preprint arXiv:2305.11675, 2023b.
  • Choi et al. [2023] Minkyu Choi, Kuan Han, Xiaokai Wang, Yizhen Zhang, and Zhongming Liu. A dual-stream neural network explains the functional segregation of dorsal and ventral visual pathways in human brains. Advances in Neural Information Processing Systems, 36:50408–50428, 2023.
  • Dong and Fu [2024] Qiaole Dong and Yanwei Fu. Memflow: Optical flow estimation and prediction with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19068–19078, 2024.
  • Fosco et al. [a] Camilo Fosco, Benjamin Lahner, Bowen Pan, Alex Andonian, Emilie Josephs, Alex Lascelles, and Aude Oliva. Brain netflix: Scaling data to reconstruct videos from brain signals. a.
  • Fosco et al. [b] Camilo Fosco, Benjamin Lahner, Bowen Pan, Alex Andonian, Emilie Josephs, Alex Lascelles, and Aude Oliva. Brain netflix: Scaling data to reconstruct videos from brain signals. b.
  • Gao et al. [2023] Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng, and Yanwei Fu. Mind-3d: Reconstruct high-quality 3d objects in human brain. arXiv preprint arXiv:2312.07485, 2023.
  • Glasser et al. [2013] Matthew F Glasser, Stamatios N Sotiropoulos, J Anthony Wilson, Timothy S Coalson, Bruce Fischl, Jesper L Andersson, Junqian Xu, Saad Jbabdi, Matthew Webster, Jonathan R Polimeni, et al. The minimal preprocessing pipelines for the human connectome project. Neuroimage, 80:105–124, 2013.
  • Glasser et al. [2016] Matthew F. Glasser, Timothy S. Coalson, Emma Claire Robinson, Carl D. Hacker, John W. Harwell, Essa Yacoub, Kâmil Uğurbil, Jesper L. R. Andersson, Christian F. Beckmann, Mark Jenkinson, Stephen M. Smith, and David C. Van Essen. A multi-modal parcellation of human cerebral cortex. Nature, 536:171 – 178, 2016.
  • Gong et al. [2024] Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Changwei Wang, Rongtao Xu, Liang Hu, et al. Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction. arXiv preprint arXiv:2410.19452, 2024.
  • Goodale and Milner [1992] Melvyn A. Goodale and A.David Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(1):20–25, 1992.
  • Huang et al. [2021] Shuo Huang, Wei Shao, Mei-Ling Wang, and Dao-Qiang Zhang. fmri-based decoding of visual information from human brain activity: A brief review. International Journal of Automation and Computing, 18(2):170–184, 2021.
  • Huo et al. [2025] Jingyang Huo, Yikai Wang, Yun Wang, Xuelin Qian, Chong Li, Yanwei Fu, and Jianfeng Feng. Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation. In European Conference on Computer Vision, pages 56–73. Springer, 2025.
  • Li et al. [2024] Chong Li, Xuelin Qian, Yun Wang, Jingyang Huo, Xiangyang Xue, Yanwei Fu, and Jianfeng Feng. Enhancing cross-subject fmri-to-video decoding with global-local functional alignment. In European Conference on Computer Vision, pages 353–369. Springer, 2024.
  • Lin et al. [2022] Sikun Lin, Thomas Sprague, and Ambuj K Singh. Mind reader: Reconstructing complex images from brain activities. Advances in Neural Information Processing Systems, 35:29624–29636, 2022.
  • Liu et al. [2023] Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding from fmri. arXiv preprint arXiv:2302.12971, 2023.
  • Lu et al. [2024] Yizhuo Lu, Changde Du, Chong Wang, Xuanliu Zhu, Liuyun Jiang, and Huiguang He. Animate your thoughts: Decoupled reconstruction of dynamic natural vision from slow brain activity. arXiv preprint arXiv:2405.03280, 2024.
  • Miller et al. [2016] Karla L. Miller, Fidel Alfaro-Almagro, Neal Kepler Bangerter, David L. Thomas, Essa Yacoub, Junqian Xu, Andreas J. Bartsch, Saâd Jbabdi, Stamatios N. Sotiropoulos, Jesper L. R. Andersson, Ludovica Griffanti, Gwenaëlle Douaud, Thomas W. Okell, Peter J. Weale, Iulius Dragonu, Steve Garratt, Sarah Hudson, Rory Collins, Mark Jenkinson, Paul M. Matthews, and Stephen M. Smith. Multimodal population brain imaging in the uk biobank prospective epidemiological study. Nature neuroscience, 19:1523 – 1536, 2016.
  • Milner and Goodale [2006] David Milner and Mel Goodale. The visual brain in action. Oup Oxford, 2006.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Ozcelik and VanRullen [2023] Furkan Ozcelik and Rufin VanRullen. Brain-diffuser: Natural scene reconstruction from fmri signals using generative latent diffusion. arXiv preprint arXiv:2303.05334, 2023.
  • Qian et al. [2023a] Xuelin Qian, Yikai Wang, Yanwei Fu, Xinwei Sun, Xiangyang Xue, and Jianfeng Feng. Joint fmri decoding and encoding with latent embedding alignment. arXiv preprint arXiv:2303.14730, 2023a.
  • Qian et al. [2023b] Xuelin Qian, Yun Wang, Jingyang Huo, Jianfeng Feng, and Yanwei Fu. fmri-pte: A large-scale fmri pretrained transformer encoder for multi-subject brain activity decoding. arXiv preprint arXiv:2311.00342, 2023b.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Sun et al. [2024] Jingyuan Sun, Mingxiao Li, Zijiao Chen, and Marie-Francine Moens. Neurocine: Decoding vivid video sequences from human brain activties. arXiv preprint arXiv:2402.01590, 2024.
  • Tong and Pratte [2012] Frank Tong and Michael S Pratte. Decoding patterns of human brain activity. Annual review of psychology, 63:483–509, 2012.
  • Tong et al. [2022] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
  • Walker et al. [2015] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In Proceedings of the IEEE international conference on computer vision, pages 2443–2451, 2015.
  • Wang et al. [2004] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • Wen et al. [2017] Haiguang Wen, Junxing Shi, Yizhen Zhang, Kun-Han Lu, Jiayue Cao, and Zhongming Liu. Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. Cerebral Cortex, 28(12):4136–4160, 2017.
  • Yeung et al. [2024] Jacob Yeung, Andrew F Luo, Gabriel Sarch, Margaret M Henderson, Deva Ramanan, and Michael J Tarr. Neural representations of dynamic visual stimuli. arXiv preprint arXiv:2406.02659, 2024.
  • Yin et al. [2023] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023.
  • Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.