0% found this document useful (0 votes)
16 views11 pages

Combining Knowledge Graph and LLMs For

Uploaded by

Sathya Narayana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Combining Knowledge Graph and LLMs For

Uploaded by

Sathya Narayana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Combining Knowledge Graph and LLMs for

Enhanced Zero-shot Visual Question Answering


Qian Tao1,* , Xiaoyang Fan1 , Yong Xu1 , Xingquan Zhu2 , and Yufei Tang2
1 South China University of Technology, Guangzhou, China
2 FloridaAtlantic University, Boca Raton, Florida, USA
* Corresponding author: [email protected]

ABSTRACT

Zero-shot visual question answering (ZS-VQA), an emerged critical research area, intends to answer visual questions without
providing training samples. Existing research in ZS-VQA has proposed to leverage knowledge graphs or large language
arXiv:2501.12697v1 [cs.CV] 22 Jan 2025

models (LLMs), respectively, as external information sources to help VQA model comprehend images and questions. However,
LLMs often struggle in accurately interpreting specific question meanings. Meanwhile, although knowledge graph has rich
entity relationships, it is challenging to effectively connect entities to individual image content for visual question answers.
In this paper, we propose a novel design to combine knowledge graph and LLMs for zero-shot visual question answer. Our
approach uses LLMs’ powerful understanding capabilities to accurately interpret image content through a strategic question
search mechanism. Meanwhile, the knowledge graph is used to expand and connect users’ queries to the image content
for better visual question answering. An optimization algorithm is further used to determine the optimal weights for the loss
functions derived from different information sources, towards a globally optimal set of candidate answers. Experimental results
on two benchmark datasets demonstrate that our model achieves state-of-the-art (SOTA) performance. Both source code and
benchmark data will be released for public access.

Visual question answering (VQA) tasks present significant 315 have demonstrated strong generalization capabilities in
challenges. In these tasks, a model is given an image and a natural language processing tasks, such as information ex-
corresponding question, and it must generate an answer based traction and logical reasoning16–18 , due to their advanced
on the information in the image1 . Unlike image captioning, intellectual reasoning abilities19 . However, current LLMs
which involves describing visible content, VQA requires the primarily rely on their own internal understanding to resolve
model to interpret and respond to questions that often involve ambiguities and perform question reasoning20 . This reliance
understanding context, reasoning, and inferring details not can introduce unexpected biases, as the models may not fully
explicitly present in the image. Nowadays, researchers have comprehend the underlying meaning of objects in images.
developed various approaches to train models for VQA us- Furthermore, LLMs are less adept at handling uncertain ques-
ing labeled training data. A typical approach is employing tions and can be easily disrupted by noise in both images and
attention mechanism to fuse multimodal features better2, 3 . text.
However, although these methods have improved the model’s
reasoning and generalization abilities to a certain extent, they To address the aforementioned limitations, this paper pro-
still necessitate retraining the model from scratch whenever poses a novel model that combines knowledge graphs and
new objects,questions,or answersare are introduced. To ad- large language models for enhanced zero-shot visual ques-
dress this limitation, Zero-Shot Visual Question Answering tion answering. The model comprises two main components:
(ZS-VQA) has been proposed. ZS-VQA enables models to the answer generation component powered by LLMs, and
predict answers about objects, questions, or answers that were the answer selection component utilizing knowledge graphs.
not present in the training samples. These methods leverage The LLM component employs an image captioning model to
a wide range of knowledge, from common sense to ency- convert images into corresponding textual descriptions and
clopedic information about specific elements within the im- utilizes a question search strategy to diversify the question
age4 . For example, incorporating external knowledge, such set appropriately. These image descriptions and questions
as Wikipedia5–7 and ConceptNet8–10 , to enhance the informa- are then fed into an LLM, which generates a candidate set
tion extraction capability of models, or utilizing knowledge of answers based on specific semantic representations. The
graphs to improve the reasoning ability11–14 . However, the knowledge graph answer selection component models the
open-domain images are vast, and external knowledge sources [entity, relation, question] triplets derived from a knowledge
cannot cover all possible information. Additionally, existing graph. It defines three loss functions related to answer gen-
models predominantly focus on understanding the known eration, entity recognition, and relation identification. This
world and struggle to fully comprehend and reason about component then refines the candidate answers generated by
images or questions that are not represented in the dataset. the LLM by integrating them with the knowledge graph-based
candidate set, enhancing the overall accuracy and relevance
In recent years, large language models (LLMs) like GPT- of the answers.
Figure 1. Comparison between existing methods using knowledge graph vs. our approach using LLMs and knowledge graph
on VQA tasks. It is very common that an image matches to the users’ query at the semantic level, but not at the word level. For
example, the “pet” in the query is actually refer to the “cat” in the image. Our approach combines LLMs and knowledge graph
to benefit VQA: (1) LLMs precisely interpret the image content, and (2) knowledge graph will help connect users’ query to the
image content, though their rich entity relationships.

As illustrated in Fig.1, the baseline method relies solely on indicated in bold. The analysis of the experimental results is
a question classifier to establish an initial correspondence be- as follows:
tween the image-related targets in the question, but it fails to Analysis of the comparative experiments based on ZS-
accurately match the entities in the knowledge graph with the F-VQA: We selected the ZS-F-VQA dataset and conducted
targets in the image. In contrast, our method leverages LLMs two experimental settings during the testing phase. Table 1
to precisely extract target entities and identify corresponding indicates that models designed for traditional VQA tasks, such
answers within the knowledge graph in a more detailed man- as Up-Down, are not applicable to the ZS-F-VQA dataset, as
ner. To achieve effective integration and knowledge fusion, the Hit@1 scores are less than 1%. This suggests that the
Particle Swarm Optimization is employed to calculate the predicted answers are fundamentally incorrect.
weights for different loss functions, and a scoring function Analysis of the comparative experiments based on F-
is used to select the optimal ratio for generating the highest- VQA: Table 2 compares our method with state-of-the-art
scoring candidate answer set. Overall, our proposed model models based on the F-VQA dataset. In all these settings, the
can produce visual question answers that are more aligned results indicate that our model outperforms the corresponding
with real-world scenarios and exhibit higher accuracy. classifier-based or mapping-based models to varying degrees.
In summary, we make the following contributions: The stable performance improvement achieved by our model
suggests that incorporating our method into other end-to-end
• We introduce a novel framework for zero-shot visual frameworks in the context of generalized knowledge V can
question answering that integrates large language models also lead to consistent performance improvements.
and knowledge graphs effectively.
Ablation study
• By diversifying the input questions appropriately, we We conducted ablation studies to validate each component of
enhance the understanding capability of large language our model, shown in Table 3.
models towards textual inputs, consequently improving Effect of LLMs: The model leverages LLMs by inputting
the accuracy of answer predictions. both the generated image description and the question into
the LLM, enabling it to generate the corresponding answer.
• Through optimization of the loss function weights associ- Experimental results demonstrate that, compared to the base-
ated with both the large language model and knowledge line method, incorporating LLMs into the model significantly
graph, we achieve a more effective integration of the two enhances performance. This improvement is likely due to
modules, resulting in a candidate set of answers with the model’s ability to better utilize the extensive knowledge
increased confidence. base inherent in LLMs, thereby enriching the composition of
the answer candidate set. However, since LLMs are designed
Results for language processing and cannot directly handle images,
the accuracy of the generated answers relies on the quality of
Comparison with the State-of-the-arts the image description generation. Therefore, it is crucial to
The performance of each comparative method is measured impose constraints on the input and output of LLMs to ensure
by evaluating the results of unsupervised image captioning. accurate results.
Table 1 and Table 2 present the VQA results on the F-VQA Effect of QS: In this part, we restrict the input to the LLMs
and ZS-F-VQA datasets, respectively. The best results are by appropriately expanding the question. Specifically, we

2/11
Table 1. Overall results (% for Hit@K) on the ZS-F-VQA dataset under the setting of ZSL/GZSL.

GZSL ZSL
Methods Hit@1↑ Hit@3↑ Hit@10↑ MRR↑ MR↓ Hit@1↑ Hit@3↑ Hit@10↑ MRR↑ MR↓
Up-down21 0.00 2.67 16.48 - - 13.88 25.87 45.15 - -
BAN22 0.22 4.18 18.55 - - 13.14 26.92 46.90 - -
MLP 0.07 4.07 27.40 - - 18.84 37.85 59.88 - -
SAN23 0.11 6.27 31.66 0.096 48.18 20.42 37.20 62.24 0.331 19.14
ZS-F-VQA24 29.39 43.71 62.17 0.401 29.52 46.87 62.00 78.14 0.572 12.22
R-ZS-F-VQA14 49.04 61.88 73.61 0.577 23.90 59.39 72.45 82.49 0.676 11.27
Ours 45.92 59.11 74.66 0.581 22.11 60.36 74.11 82.99 0.712 10.96

Table 2. Comparative results on the F-VQA dataset.

Hit@1↑ Hit@3↑ Hit@10↑ MRR↑ MR↓


Up-down21 34.81 50.13 64.37 - -
BAN22 44.02 58.92 71.34 - -
MLP 34.12 52.26 69.11 - -
SAN23 41.62 58.17 72.69 0.605 14.75
ZS-F-VQA24 58.27 75.2 86.4 0.685 11.72
R-ZS-F-VQA14 66.81 80.3 88.66 - -
F-VQA12 58.76 - - - -
HQIPV25 43.14 59.44 72.20 - -
Ours 67.52± 1.25 79.58± 2.12 89.12± 1.58 0.699± 0.01 10.63± 0.57

Table 3. Ablation experiments on F-VQA.

Baseline LLM QS PSO Hit@1↑ Hit@3↑ Hit@10↑ MRR↑ MR↓


✓ 38.64 43.71 63.14 0.401 21.55
✓ ✓ 44.50 50.12 70.89 0.450 14.62
✓ ✓ ✓ 50.25 62.46 78.53 0.515 13.11
✓ ✓ 46.21 56..23 80.12 0.566 12.64
✓ ✓ ✓ 59.98 74.92 82.17 0.636 11.34
✓ ✓ ✓ ✓ 62.52 79.58 89.12 0.699 10.63

extract the words most relevant to the image from the question ing the scoring function of the answer candidate set, we search
and search for the top-k nearest neighboring words in the for the optimal weights that best match the image and question.
entire corpus. We then replace the generated questions based Experimental results show that using PSO for optimization
on the overall fluency of the sentence. Finally, these k+1 alone improves the model by an average of 10 percentage
questions are inputted into the LLM model. Experimental points in the Hit@k metric compared to the baseline.
results show that compared to the LLM model alone, the
LLM+QS performs better, with the Hit@k metric in the F- Parameter sensitivity analysis
VQA dataset improving by an average of 8%. This further The parameter sensitivity analysis experiments were con-
demonstrates the necessity of imposing restrictions on the ducted to analyze two hyper-parameters: the search threshold
large language model to achieve better results. µ for question generation and the number of iterations k in
the PSO.
Effect of weights optimization: In this part, due to the Question search threshold µ: As shown in Table 4, we
existence of two loss functions in the LLM and three loss func- controlled the search threshold µ for question generation to
tions in the knowledge graph, it is not feasible to determine determine which similar questions could be selected into the
the values of (λ1 , λ2 , λ3 , λ4 , λ5 ) solely through parameter tun- question candidate set. Through parameter analysis, we found
ing. Therefore, we use the Particle Swarm Optimization (PSO) that the model performs best when the search threshold for
method to calculate the weights of different losses. By evaluat- questions is set to 0.7.

3/11
Figure 2. Typical examples of VQA generated by our model.

Table 4. Sensitivity analysis on the search threshold µ. lected ZS-F-VQA as the VQA dataset and generated image
descriptions using an image description method tailored for
ZSL GZSL VQA. Unlike other methods, our answer candidate set in-
µ Hit@1 Hit@3 Hit@10 Hit@1 Hit@3 Hit@10 cludes answers generated by the LLMs and those generated
0.2 34.81 50.13 64.37 33.81 49.13 70.42 based on the knowledge graph. The final answer candidate set
0.4 44.02 58.92 71.34 41.42 56.82 69.85 consists of the union of LLM-generated answers and knowl-
0.7 58.27 75.2 86.4 45.92 59.11 70.66 edge graph answers. From the figure, it can be observed
0.8 41.62 58.17 72.69 41.82 58.77 69.22 that the answers generated by LLMs rely heavily on the im-
0.9 34.12 52.26 69.11 34.57 52.23 69.14 age description, which inevitably depends on the pre-trained
model for image description. Conversely, by using the knowl-
edge graph to extract entities and relationships relevant to
the image and question, the model can accurately reflect the
relevance between the current image and the question. This
dual approach ensured a more comprehensive and accurate
generation of answers.

Discussion
VQA based on external knowledge
Wang et al.11 initially identified relevant concepts from im-
Figure 3. Sensitivity analysis for the hyperparameter k. ages by integrating image data with knowledge base concepts,
matching them with semantic concepts in the knowledge base
to generate queries for natural language questions. However,
The number of iterations k in PSO: By designing the this approach did not address the issue of answer bias—many
number of iterations k in the Particle Swarm Optimization answers might not have been encountered during training (i.e.,
(PSO) algorithm, we primarily use it when the score function unseen answers). To improve upon this, Wang et al.12 in-
S is continuously less than the optimal score for K times. troduced the FVQA method. This method employs LSTM
Fig.3 illustrates the effect of selecting different thresholds k and image-question mapping to pinpoint significant content in
on the model for different datasets. When k = 0, it indicates images and generate queries based on both the image and the
that the model does not use PSO evolutionary computation. knowledge base. Chen et al.13 introduced a new answer-based
The results show that when k = 3 better performance can be zero-shot VQA segmentation dataset, ZS-VQA, designed for
achieved. the F-VQA dataset. Most recently, Wu et al.14 focused on
enhancing the accuracy of extracting entities and relationships
Qualitative results from the knowledge base. They achieved this by employing
In this section, we compared VQA using different methods a specialized pre-training model for representing input infor-
and visualized their performance, as shown in Fig.2 We se- mation and utilizing a contrastive learning-based common

4/11
feature space for information retrieval. Similarly, Fei et al.26 Societal impacts
developed a large-scale multimodal foundation model utiliz- In real-life scenarios, leveraging visual question answering
ing cross-modal contrastive learning. They demonstrated that models can greatly assist individuals with visual impairments
the weak semantic correlation train data helped improve the in enhancing their understanding of the natural environment.
generality and cognition of the model to perform VQA task. Additionally, it can provide them with additional learning
resources and tools to better comprehend the world around
VQA based on LLMs them. Furthermore, due to the significant cost of manually
Unlike traditional VQA training tasks, the VQA paradigm annotating large-scale image-text paired datasets required for
utilizing Large Language Models (LLMs) directly relies on training large models, zero-shot visual question answering can
natural language as an intermediate representation of images, effectively provide annotated datasets for multimodal large
thus eliminating the need for expensive pre-training. Liang models.However, if a large number of images are used in
et al.16 introduced TOA, where LLMs make an initial hy- visual question answering models, it may lead to copyright
pothesis based on their knowledge and then actively gather infringement issues and the potential for malicious tampering.
the visual evidence needed to verify this hypothesis. Guo et
al.27 proposed Img2LLM, a plug-and-play module that con- Methods
verts images into synthesized question-answer pairs derived
solely from the current question image. Additionally, Lan To address the issue of noise in generating answers with tradi-
et al.20 developed a question-guided visual question answer- tional large language models, we have developed a joint frame-
ing reasoning model to consider sentence fluency, semantic work that combines a large language model with a knowledge
completeness, and syntactic invariance. Other researchers graph to generate a set of answer candidates, as illustrated
have explored the task planning capabilities of LLMs. For in Fig.4 To further improve the model’s granularity in image
example, Gupta et al.28 and Surís et al.29 suggest that LLMs recognition, we utilized the particle swarm optimization algo-
can generate programs for subtasks executed by predefined rithm to evaluate the influence of various external knowledge
sub-modules. This approach is supported by works such as sources on the answers.
those by Khot et al.30 , Huang et al.31 , and Wang et al.? , who
have shown the effectiveness of decomposing complex tasks Answer generation through large language model
into manageable subtasks that LLMs can handle efficiently. and question search strategy
Since LLMs are inherently designed for natural language
Conclusion processing, they cannot directly process image data as input.
In this paper, we propose a novel method that combines a Therefore, it is necessary to establish a connection between
knowledge graph with a LLMs for enhanced zero-shot visual images and text. Inspired by Zhang et al.32 , we use VinVL
question answering. To address the traditional large language as a pre-trained model for image caption generation, which
models’ lack of sensitivity to noise and their inability to ac- produces informative descriptions of the current image. These
curately comprehend the specific meaning of a question, our descriptions are then used as supplementary inputs to the
model consists of two main components: the knowledge graph LLMs. However, due to the extensive information contained
for answer aggregation and the LLMs for answer generation. in images, directly generating descriptions can result in sen-
The knowledge graph extracts entities and relationships from tences that may not be relevant to the question. To address
images and questions, while the LLMs refines these entities, this issue, we tokenize each word of the question and input
enabling more accurate image recognition and question an- them alongside the image vector:
swering. Experimental results demonstrate that our model
surpasses state-of-the-art methods on benchmark datasets. (O, fi ) = Faster-RCNN(Ii ) i = 1, . . . , k (1)

Y = VinVL((Q, O), fi ) (2)


Limitation
Although the proposed evolutionary computation-based zero- where k represents the number of images, fi ∈ Rm denotes
shot visual question answering in this paper effectively en- the distribution representation in the high-dimensional latent
hances the model’s ability to understand external knowledge space obtained from the Faster-RCNN model, O represents the
by leveraging large language models, it still requires other detected objects, Q represents the question vector associated
tools to bridge the gap between images and text. Therefore, ef- with the current image to be answered, and Y represents the
fectively transferring image features to large language models generated image description. To ensure that each image has a
like image captioning, will be an important direction for future unique caption, we incorporate visual concepts generated by a
improvements in the knowledge question answering model convolutional neural network as part of the question and input
proposed in this paper.In this paper, we only utilized OPT them together with the image features into the pre-trained
as the large language model. However, employing the latest model for generating image descriptions. Additionally, we
GPT-4 model would potentially yield higher performance for repeat this process M times, filter out duplicate sentences, and
the model. obtain k distinct and relevant captions for the images.

5/11
Figure 4. Overview of our proposed model.The model includes VQA training module, external knowledge module (LLMs
and knowledge graph) and PSO training module.

However, in LLMs, there are instances where the model assess their quality:
may fail to fully grasp the true meaning of an object in an im- m l
p(Qi )
age. Additionally, LLMs exhibit low sensitivity to ambiguous Lse = − ∑ ∑ P(q j )p(Qi ) log (5)
questions and can be easily disrupted by noise in the images i=1 j=1 q(Qi )
or text. This means that the model may not accurately identify where m represents the number of questions, P(q j ) denotes the
which region of the image corresponds to the question or the probability of the j-th word being replaced in the i-th question,
specific intent of the question. To address this, we employ a p(Qi ) represents the distribution of the original question for
Question Search (QS) strategy. In this strategy, keywords in the i-th image, and q(Qi ) represents the distribution of the
the question are replaced with various alternatives to enable generated target question for the i-th image.
the LLMs to develop a more comprehensive understanding of We sequentially input the k-question prompts into the LLM
the question. To ensure that irrelevant words are not replaced, and perform greedy decoding for each prompt. The LLM
we calculate the similarity between each word in the image is tasked with outputting a set of answer candidates along
description and the objects present in the image, allowing us with their confidence scores. By inputting various image
to distinguish them: descriptions and their corresponding questions into the frozen
LLM, we obtain the respective answers and their associated
P(q) = arg max(sim(q, oi )) i = 1, . . . , m (3)
probabilities as follows:

where m represents the number of targets extracted from each Âi , P(Âi ) = LLM(Qi ,Yi ) i = 1, ..., n (6)
image, sim function denotes the cosine similarity between
where n represents the number of generated answers. To en-
words, and q represents the generated word. We set a threshold
sure the fluency of the generated answers in terms of sentence
µ, and if P(q) >µ, we search for the top-k neighboring words
structure, the above model incorporates a probability language
of the current word for diverse replacements. To prevent
model (LM) to assess whether the generated answers adhere
changes in sentence structure and semantics, we ensure that
to grammatical and logical rules. By training a joint probabil-
only one word in a question is replaced:
ity scoring function, the aim is to maximize the likelihood of
generating sentences. This can be expressed as follows:
q̂ = q, q1 , q2 , ..., qk if p(q) > µ and sim(qi , q) > δ (4)
T
where qi represents the neighboring word, and δ represents FLM (Â) = ln PLM (Â) = ln ∏ P(wi |wi−1 , ..., w1 ) (7)
i=1
the similarity interval. Therefore, we can utilize the generated
question and the original question to construct a loss function where wi represents the i-th word of the answer Â, and T
Lse to measure the diversity of the generated questions and represents the length of the question. This approach allows us

6/11
to obtain the scoring function for the answer Â. Utilizing the In this way, the model obtains the loss function for the
joint probability FLM (Â) and the probability of the currently answer set. At the same time, the initial candidate set for
generated answer, we then construct the loss function for the answers can be obtained using the following:
LLM:
m â = R(a)T Fiq (15)
LLLM = ∑ P(Âi ) FLM (Âi ) (8)
i=1 Now that we have obtained the loss functions Lse and
LLLM for generating answers using the LLM, as well as the
Enhanced answering through knowledge graph
loss functions Le , Lr , and La for generating question, entity,
In VQA, the knowledge graph can be represented as a triplet
and answer based on the knowledge graph. The final loss
G = [E, R, A], where E denotes the entities, R denotes the
function of the model can be expressed as:
relationships between entities, and A denotes the answers.
The knowledge graph provides relationships and entities that L = λ1 Lse + λ2 LLLM + λ3 Le + λ4 Lr + λ5 La (16)
can be used to compare corresponding components in images
and sentences, as well as to provide answer candidates. Since where λ are the weights and will be optimized. See Appendix
question Q itself includes entities and relationships, we utilize A.
Bi-LSTM to extract contextual information from question
Q, resulting in a representation of the question, denoted as Q. Dataset
This allows us to better identify the specific location of entities In this paper, we utilized two primary datasets to evaluate our
and relationships in the question. Furthermore, because the model’s performance. The first dataset, the supervised F-VQA
dimensions of the entity set, relationship set, and question dataset, is designed for supervised visual question answering.
representation in the knowledge graph are different, we need It consists of 2,190 images, 5,286 questions, and a knowledge
to align the dimensions. This is achieved using a forward graph containing 193,449 facts. Each triplet [image, question,
propagation network consisting of fully connected layers and answer] in this dataset is constructed using information from
regularization. The process can be represented using the various public knowledge bases, requiring external knowledge
following formulas: for effective answering. The second dataset, the ZS-F-VQA
dataset, is used for zero-shot visual question answering. De-
Q = BiLST M(wi ) (9) rived from the F-VQA dataset, it is divided into five parts
with no overlap between the answers in the training and test
R(y) = FN(y) (10)
sets. This careful partitioning allows for the creation of zero-
where y represents any set from the entity set, relationship set, shot and fact-based evaluation scenarios, ensuring that the
and question set. This way, we obtain entity and relationship model’s ability to generalize to unseen questions and answers
representations from the knowledge graph and question in is rigorously tested.
the same dimension. We aim to train the model to ensure
that the entities and relationships from the knowledge graph Evaluation
can be paired one-to-one with the corresponding sets in the We followed the settings of ZS-F-VQA and evaluated the
question. Therefore, we use a commonly used loss function performance of the models based on the comparison results
in contrastive learning, InfoNCE loss33 , as follows: using the Hit@1, Hit@3, and Hit@10 metrics. Additionally,
in the ablation experiments, we used Mean Reciprocal Rank
exp (Q · R(e)+ /τ)
Le = −log (11) (MRR) and Mean Rank (MR) as additional evaluation metrics.
∑ki=1 exp (QR(e)i /τ) Here, the Hit@k metric is used to indicate whether the ground
truth value is ranked within the top k predicted answers.
exp (Q · R(r)+ /τ)
Lr = −log (12)
∑ki=1 exp (QR(r)i /τ) Implementation details
where Le and Lr represent the loss functions for entities and re- The model utilizes a pre-trained ResNet-15235 on ImageNet36
lationships, respectively, and τ is a hyper-parameter. However, to extract visual features. Additionally, a Faster-RCNN37
the answer set in the knowledge graph has not been aligned model trained on the COCO dataset38 is used to obtain object
with features. Therefore, we utilize a pre-trained multimodal and visual features from the images. For the feature extraction
model such as LXMERT34 to fuse the question features and of questions and answers, Glove 1 vectors are employed to
image features, obtaining a shared feature Fi q for the question transform textual features into 300-dimensional embeddings,
and image: which are then input into a Bi-LSTM39 . As for the large lan-
Fiq = LXMERT (Fimg , wi ) (13) guage model, OPT40 is chosen with parameters of 2 billion.
Adam optimizer and progressive learning rate warm-up strat-
Then, we pair this feature with the dimension-transformed
egy are used for all models 2 . The learning rate for the first
answer set and obtain the corresponding loss function:
1 https://nlp.stanford.edu/projects/glove/
exp (Fiq · R(a)+ /τ) 2 https://github.com/huggingface/transformers/
La = −log k (14)
∑i=1 exp (Fiq · R(a)i /τ) blob/v4.19.0/src/transformers/models/opt

7/11
five epochs is set as (epoch + 1) ∗ 10−3 , and thereafter, the similarity score. Additionally, since the large language model
decay rate is set to 0.7 for subsequent epochs. The parameter t (LLM) also generates corresponding answer candidate sets,
in the infoNCE loss function is set to 0.01. In order to conduct we combine the candidate sets from the knowledge graph and
comparative experiments, we followed the experimental set- LLM to form the final answer candidate set.
tings of the model24 and used standard dataset configurations Therefore, the final scoring function can be determined by
3 . This includes 5 data splits based on images, and for the
both the predictions from the LLMs and the answers computed
experiment, we specified the top 500 candidate answers which from the knowledge graph. If the predicted answer is in the
accounted for 94.30% of the total. We use 2*RTX TITAN as answer candidate set A+ ,where A+ = A∗ ∪ Â we increase the
our GPU, memory requirement is 48GB. score; otherwise, we decrease the score. The scoring function
Due to the limitations of traditional VQA datasets in com- for the predictions from the LLMs can be defined as follows:
prehending information from images and questions and ade-
quately handling answers not appearing in the training sam- SLLM = λ1 P(a) fLM (a) (19)
ples, we studied two configurations for the testing phase of
ZSVQA: ZSL and GZSL. In the ZSL setting, the answer can- Since the newly generated questions are already involved in
didate set in the test samples (i, q, a) consists of answers the answer generation process of the large language mod-
not present in the training set, i.e., the unseen dataset Au . els (LLMs), and the loss functions LLLM and Lse are not
In the GZSL setting, the answer candidate set includes all completely independent, we did not select a separate scoring
answers from the training and test sets, i.e.,Au andAs , where function for question search.
Au ∩ As = 0./
SG = R(a)T Fiq + β (sim(e) + sim(r)) (20)
Particle Swarm Optimization to find the optimal
weights Since the generated answers can come from both the predic-
Selecting the weights for the corresponding loss functions is tions of the large language models (LLMs), denoted as A∗ ,
an important consideration for the model. If we directly train and the predictions of the knowledge graph, denoted as A∗ ,
the model by freezing the weights of the loss functions, it the overall scoring of the model needs to be discussed based
is difficult to obtain the optimal solution, and the generated on different scenarios:
answer candidates may in turn affect the weights of the cor-
/ A∗

SLLM a ∈ Â ∧ a ∈
responding loss functions, ultimately impacting the accuracy 

/ Â ∧ a ∈ A∗

of the answers. To address this, we designed a combination
 S
G a∈
score(a) = ∗
(21)
of multiple loss functions as the training objective. During 

 SLLM + SG a ∈ Â ∧ a ∈ A
training, we utilize the Particle Swarm Optimization (PSO) 
R(a)T Fiq − b other
algorithm to select an adaptively weighted combination of
loss functions. In each training iteration, the PSO algorithm where b is a fixed value used to subtract an appropriate score
is employed to calculate the optimal weights through evolu- from predicted answers that do not belong to the answer set.
tionary computation. The PSO requires a scoring function to During training, we evaluate the performance of each iteration
control the quality of the currently generated answers. by scoring the answer candidates formed. If the value of the
Firstly, for the knowledge graph, we can set a similarity scoring function remains continuously below the maximum
score to measure the similarity between the question and value for k consecutive times, we consider that the current
entities/relationships. training phase is not progressing towards a better solution. In
such cases, PSO is activated to search for the optimal objective
R(ŷ)T Q function:
sim(x) = (17)
|R(ŷ)||Q| Sbest = max(score(a), Sbest ) (22)
Then, we can calculate the similarity scores for relationships After that, we utilize a fitness function to score the weight
and entities as a constraint to compute the candidate set for combinations of each particle during training, and the best
answers based on the knowledge graph. result is saved at the end of the PSO training. When it comes
to evaluating quality, the higher the prediction on generated
A∗ = {a|sim(ê) + sim(r̂) > δ ∧ (ê, â, r̂) ∈ G ∧ ê ∈ E ∧ r̂ ∈ R} data, the better the quality. It measures the gap between the
(18) generated samples and the real samples.
where ê and r̂ represent the entities and relationships men-
tioned in the question, â represents the initial candidate answer
set, sim(ê) and simr̂) represent the similarity scores for enti- Data availability
ties and relationships, E represents the entity set, R represents Data related to this paper can be downloaded from: https:
the relationship set, and G represents the set of triplets that //github.com/China-UK-ZSL/ZS-F-VQA,
includes the three sets. δ is the threshold used to control the https://github.com/wangpengnorman/FVQA?
3 https://github.com/China-UK-ZSL/ZS-F-VQA tab=readme-ov-file.

8/11
Algorithm 1: PSO-VQA for finding the optimal weights
input :Image feature Fimg , question embedding Q, batch, epoch, particle_num, iteration, knowledge graph [e, r, a]
output :The optimized answer candidate set and the corresponding weight parameters λi , i = 1, ..., 5
1 Initializing weight parameters (λ1 , λ2 , λ3 , λ4 , λ5 ) and λ1 + λ2 + λ3 + λ4 + λ5 = 1;
2 for i ← 1 to epoch do
3 for v ← 1 to batch do
4 Le ← L (Q, e);
5 Lr ← L (Q, r);
6 Le ← L (Q, e, Fimg );
7 end
8 A+ ← update(A+ , L (λ1 , λ2 , λ3 , λ4 , λ5 ));
9 SLLM , SG ← evaluate(r, e, a);
10 score(a) ← (SLLM , SG );
11 Sbest ← max(Sbest , score(a));
12 if score(a) < Sbest for K times continuously then
13 for j ← 1 to iteration do
14 for h ← 1 to particle_num do
15 training with PSO;
16 A+ ← update(A+ , L (λ1j,h , λ2j,h , λ3j,h , λ4j,h , λ5j,h ));
17 S j,h ← score(a) j,h ;
18 end
19 end
20 end
21 {S j1 ,h1 , S j2 ,h2 , ...} ← sort({S j,h });
22 (λ1 , λ2 , λ3 , λ4 , λ5 ) ← (λ1j,h , λ2j,h , λ3j,h , λ4j,h , λ5j,h );
23 end

Code availability References


The code to reproduce the experiments is available at https: 1. Wu, Q. et al. Visual question answering: A survey of
//github.com/Alexlabcode/Fan. methods and datasets. Comput. Vis. Image Underst. 163,
21–40 (2017).
2. Dhruv, S., Sanjay, P. & Chandan, R., K. Medfusenet:
An attention-based multimodal deep learning model for
visual question answering in the medical domain. Sci.
Reports (2021).
3. Yalin, M., Shuyun, H., WenFang, C., Guodong, L. &
Meng, T. Research on visual question answering based
on dynamic memory network model of multiple attention
mechanisms. Sci. Reports (2022).
4. Antol, S. et al. Vqa: Visual question answering. In Pro-
ceedings of the IEEE international conference on com-
puter vision, 2425–2433 (2015).
5. Wu, J., Lu, J., Sabharwal, A. & Mottaghi, R. Multi-modal
answer validation for knowledge-based vqa. In Proceed-
ings of the AAAI conference on artificial intelligence,
vol. 36, 2712–2721 (2022).
6. Gui, L. et al. Kat: A knowledge augmented transformer
for vision-and-language. In Proceedings of the 2022
Conference of the North American Chapter of the Associ-

9/11
ation for Computational Linguistics: Human Language prompts. In Proceedings of the 31st ACM International
Technologies, 956–968 (2022). Conference on Multimedia, 4389–4400 (2023).
7. Lin, Y. et al. Revive: Regional visual representation 21. Anderson, P. et al. Bottom-up and top-down attention
matters in knowledge-based visual question answering. for image captioning and visual question answering. In
Adv. Neural Inf. Process. Syst. 35, 10560–10571 (2022). Proceedings of the IEEE conference on computer vision
8. Liu, H. & Singh, P. Conceptnet—a practical common- and pattern recognition, 6077–6086 (2018).
sense reasoning tool-kit. BT technology journal 22, 211– 22. Kim, J.-H., Jun, J. & Zhang, B.-T. Bilinear attention
226 (2004). networks. Adv. neural information processing systems 31
(2018).
9. Gardères, F., Ziaeefard, M., Abeloos, B. & Lecue, F. Con-
ceptbert: Concept-aware representation for visual ques- 23. Yang, Z., He, X., Gao, J., Deng, L. & Smola, A. Stacked
tion answering. In Findings of the Association for Com- attention networks for image question answering. In
putational Linguistics: EMNLP 2020, 489–498 (2020). Proceedings of the IEEE conference on computer vision
and pattern recognition, 21–29 (2016).
10. Marino, K., Rastegari, M., Farhadi, A. & Mottaghi, R.
Ok-vqa: A visual question answering benchmark requir- 24. Ramesh, A. et al. Zero-shot text-to-image generation.
ing external knowledge. In Proceedings of the IEEE/cvf In International conference on machine learning, 8821–
conference on computer vision and pattern recognition, 8831 (Pmlr, 2021).
3195–3204 (2019). 25. Lu, J., Yang, J., Batra, D. & Parikh, D. Hierarchical
11. Wang, P., Wu, Q., Shen, C., Hengel, A. v. d. & Dick, question-image co-attention for visual question answer-
A. Explicit knowledge-based reasoning for visual ques- ing. Adv. neural information processing systems 29
tion answering. In Proceedings of the Twenty-Sixth In- (2016).
ternational Joint Conference on Artificial Intelligence, 26. Nanyi, F. et al. Towards artificial general intelligence via
1290–1296 (2017). a multimodal foundation model. Nat. Commun. others
12. Wang, P., Wu, Q., Shen, C., Dick, A. & Van Den Hengel, (2022).
A. Fvqa: Fact-based visual question answering. IEEE 27. Guo, J. et al. From images to textual prompts: Zero-shot
transactions on pattern analysis machine intelligence 40, visual question answering with frozen large language
2413–2427 (2017). models. In Proceedings of the IEEE/CVF Conference on
13. Chen, Z. et al. Zero-shot visual question answering using Computer Vision and Pattern Recognition, 10867–10877
knowledge graph. In The Semantic Web–ISWC 2021: (2023).
20th International Semantic Web Conference, ISWC 2021, 28. Gupta, T. & Kembhavi, A. Visual programming: Compo-
Virtual Event, October 24–28, 2021, Proceedings 20, 146– sitional visual reasoning without training. In Proceedings
162 (Springer, 2021). of the IEEE/CVF Conference on Computer Vision and
14. Wu, S., Zhao, G. & Qian, X. Resolving zero-shot and Pattern Recognition, 14953–14962 (2023).
fact-based visual question answering via enhanced fact 29. Surís, D., Menon, S. & Vondrick, C. Vipergpt: Visual
retrieval. IEEE Transactions on Multimed. (2023). inference via python execution for reasoning. In Pro-
ceedings of the IEEE/CVF International Conference on
15. Brown, T. et al. Language models are few-shot learners.
Computer Vision, 11888–11898 (2023).
Adv. neural information processing systems 33, 1877–
1901 (2020). 30. Khot, T. et al. Decomposed prompting: A modular
approach for solving complex tasks. In Decomposed
16. Liang, M., Wu, Y. et al. Toa: Task-oriented active vqa.
prompting: A modular approach for solving complex
Adv. Neural Inf. Process. Syst. 36 (2024).
tasks (2023).
17. Yeonghun, K. & Jihan, K. Chatmof: an artificial intelli-
31. Huang, W. et al. Inner monologue: Embodied reasoning
gence system for predicting and generating metal-organic
through planning with language models. In Proceedings
frameworks using large language models. Nat. Commun.
of Machine Learning Research, 1769–1782 (2022).
others (2024).
32. Zhang, P. et al. Vinvl: Revisiting visual representations in
18. Shanqing, C. et al. Using large language models to accel- vision-language models. In Proceedings of the IEEE/CVF
erate communication for eye gaze typing users with als. conference on computer vision and pattern recognition,
Nat. Commun. others (2024). 5579–5588 (2021).
19. Wei, J. et al. Chain-of-thought prompting elicits reason- 33. He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Mo-
ing in large language models. Adv. neural information mentum contrast for unsupervised visual representation
processing systems 35, 24824–24837 (2022). learning. In Proceedings of the IEEE/CVF conference
20. Lan, Y. et al. Improving zero-shot visual question answer- on computer vision and pattern recognition, 9729–9738
ing via large language models with reasoning question (2020).

10/11
34. Tan, H. & Bansal, M. Lxmert: Learning cross-modality
encoder representations from transformers. In Empirical
Methods in Natural Language Processing, 5100–5111
(2019).
35. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
770–778 (2016).
36. Deng, J. et al. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision
and pattern recognition, 248–255 (Ieee, 2009).
37. Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn:
Towards real-time object detection with region proposal
networks. Adv. neural information processing systems 28
(2015).
38. Lin, T.-Y. et al. Microsoft coco: Common objects in
context. In Computer Vision–ECCV 2014: 13th European
Conference, Zurich, Switzerland, September 6-12, 2014,
Proceedings, Part V 13, 740–755 (Springer, 2014).
39. Huang, Z., Xu, W. & Yu, K. Bidirectional lstm-crf models
for sequence tagging. arXiv preprint arXiv:1508.01991
(2015).
40. Zhang, S. et al. Opt: Open pre-trained trans-
former language models, 2022. URL https://arxiv.
org/abs/2205.01068 3, 19–0 (2023).

Competing interests
The authors declare no competing interests.

11/11

You might also like