0% found this document useful (0 votes)
23 views5 pages

Mediaeval 2023

This paper presents a solution for the MediaEval NewsImages 2023 task, focusing on matching images to articles using pre-trained cross-modal networks, specifically variations of the CLIP model. The authors fine-tuned these models for domain adaptation, employed data augmentation techniques, and utilized a dual softmax operation to enhance performance. Results indicate that fine-tuning improves outcomes, and the choice of CLIP model significantly affects performance, particularly in handling generated versus real images.

Uploaded by

sakthivel2310758
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views5 pages

Mediaeval 2023

This paper presents a solution for the MediaEval NewsImages 2023 task, focusing on matching images to articles using pre-trained cross-modal networks, specifically variations of the CLIP model. The authors fine-tuned these models for domain adaptation, employed data augmentation techniques, and utilized a dual softmax operation to enhance performance. Results indicate that fine-tuning improves outcomes, and the choice of CLIP model significantly affects performance, particularly in handling generated versus real images.

Uploaded by

sakthivel2310758
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Cross-modal Networks, Fine-Tuning, Data

Augmentation and Dual Softmax Operation for


MediaEval NewsImages 2023
Antonios Leventakis1,* , Damianos Galanopoulos1,* and Vasileios Mezaris1
1
Information Technologies Institute / Centre for Research and Technology Hellas, Thessaloniki, Greece

Abstract
Matching images to articles is challenging and can be considered a special version of the cross-media
retrieval problem. This notebook paper presents our solution for the MediaEval NewsImages 2023
benchmarking task. We investigate the performance of pre-trained cross-modal networks. Specifically, we
investigate two pre-trained CLIP model variations and fine-tuned one for domain adaptation. Additionally,
we utilize a data augmentation technique and a method for revising the similarities produced by either
one of the networks, i.e., a dual softmax operation, to improve our solutions’ performance. We report
the official results for our submitted runs and additional experiments we conducted to evaluate our runs
internally. We conclude that fine-tuning benefits the performance, and it is important to consider the
data’s nature when selecting the appropriate pre-trained CLIP model.

1. Introduction
In this paper, we deal with the text-to-image retrieval task adapted for the needs of the MediaEval
NewsImages 2023 task [1]. Nowadays, news sites publish multimedia content in their online
news articles to better convey the message the textual article wants to convey to readers. So,
associating news articles with multimedia content is crucial for several research tasks such as
cross-modal retrieval and disinformation detection. Our participation [2] in the NewsImages
2022 task showed that cross-modal networks trained on large sets of data, such as CLIP [3],
perform optimally. Based on that outcome, to deal with image retrieval using textual articles,
this year’s approach is based on pre-trained versions of CLIP [3]. To further adapt them
to this specific task, we fine-tune them with extra news article-based datasets to improve
the performance. Moreover, similarly to our previous works [2, 4], we adopt a dual-softmax
operation (DS) to recalculate the initially computed title-image similarities, an approach that in
some cases leads to improved performance. Lastly, we utilize a data augmentation technique
on the textual part of the data to increase the amount of available data for training and the
robustness that derives from the diversity that data augmentation introduces to the models.

2. Related Work
Text-image association is a challenging task that has gained a lot of interest in recent years.
The task has been extensively examined in the multimedia research community e.g. see [5, 6],
and there is consensus that the evolution of deep learning methods has boosted performance.
Indicative relevant methods include VinVL [7], where an object detector is pre-trained to encode
images and visual objects on images and a cross-modal model is trained to associate visual and
MediaEval’23, 1-2 February 2024, Amsterdam, The Netherlands and Online *
Corresponding authors.
$ aleventakis@[Link] (A. Leventakis); dgalanop@[Link] (D. Galanopoulos); bmezaris@[Link] (V. Mezaris)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
[Link]
ISSN 1613-0073
CEUR Workshop Proceedings ([Link])
textual features. Regarding the NewsImages 2021 participations, HCMUS [8] proposed a solution
based on the pre-trained model CLIP [3] along with sophisticated text preprocessing, which
achieved the best performance. In NewsImages 2022 the best-performing approach [2] explored
CLIP’s capabilities alongside a trainable cross-modal network; and concluded that using CLIP
was, by a small margin, better than training a custom cross-modal network. Therefore, utilizing
the power of CLIP models seems to be the most suitable approach for the task.

3. Approach
3.1. Data, pre-processing and augmentation
To adapt the CLIP model to the specific needs of the task, we explore the fine-tuning capabilities
for this model. We preprocess both training, evaluation and the official test textual data in order
to fully exploit our approach’s power. We gathered around 4.8 million image-title pairs from
the news domain to fine-tune the pre-trained CLIP model for training. Specifically, we utilize
the NYTimes800k [9], N24News [10] and BreakingNews [11] datasets along with data publicly
available in [Link] from news websites including Al Jazeera1 , CNN2 , BBC3 , HuffPost News4
and Bloomberg5 to fine-tune the model. To internally evaluate our approach, we merge last
year’s NewsImages training data [12] and use them to investigate the performance of our
approach. For each one of these datasets we utilize a data augmentation technique to double the
amount of data available. Specifically, we exploit the paraphrasing ability of the Text-to-Text
Transformer [13] to create diverse but semantically similar text titles for every image. This
approach not only enables us to have more training data but also lets us compute the image-title
similarities of the evaluation and test datasets from both the original and the generated text
titles for each image. Then, by using a mean pooling operation between the values that occur
from the computations we end up with our final predictions.

3.2. Pre-trained models


As pre-trained cross-modal networks, we utilize two different implementations of the CLIP [3]
model in order to examine their performance. More specifically, we utilize the “ViT-L/14@336px”,
the largest version of the CLIP model currently available to the public by OpenAI, and as a second
variation, we utilize the “ViT-H/14” model of openCLIP [14], the open-source implementation
of CLIP. We use these models to calculate text and image feature representations. For a given
article, in order to retrieve the most relevant images from the test set, we calculate the cosine
similarity between the article’s title CLIP embedding and the embeddings of all test images, and
the top-100 most relevant images are selected in a ranked list, from the most relevant to the
least relevant image.

3.3. Fine-tuned model


We also examined fine-tuning the “ViT-L/14@336px” CLIP model using the aforementioned
training datasets to improve its performance. We choose to keep the image encoder of the
model frozen and only train the text encoder’s parameters for one epoch with a batch size of 480
(performing gradient accumulation to handle GPU memory limitations). The Adam optimizer is
employed while the learning rate is set to 3e-7.

1
[Link] 2 [Link]
3
[Link] 4 [Link]
5
[Link]
3.4. Dual-softmax similarity revision
At the retrieval stage, we calculate the similarities between all images from the test set and
all testing articles, resulting in a similarity matrix Z ∈ ℛ𝐶×𝐷 , where 𝐶 is the number of the
testing article queries and 𝐷 the number of test images. Following [2, 4], to revise the calculated
similarities, we apply two cross-dimension softmax operations (one row-wise: dim = 0, and
one column-wise: dim = 0) as follows: Z* = Softmax(Z, dim = 0) ⊙ Softmax(Z, dim = 1):
where ⊙ denotes the element-wise product.

3.5. Inference-stage scores aggregation


As mentioned before, we also augment the test data’s textual part, resulting in two article-image
pairs for each original pair contained in the dataset. So, in all our runs (e.g. regardless of whether
we use a pre-trained CLIP or we fine-tune it), we end up with two article-image similarity scores.
To aggregate these scores, we experimented with different aggregation methods (not presented
here for brevity), and we chose to perform mean pooling to obtain our final prediction.

4. Submitted Runs and Results


We submitted five runs for each testing dataset (GDELT-P1, GDELT-P2, RT), as detailed below:

• Run #1 (ViT-H/14_ds): This uses the text and image embeddings of the “ViT-H/14” pre-
trained openCLIP model and calculates the cosine similarity between the embedding of
an article and all images. Then, the dual-softmax revision method is used to recalculate
the similarities. Finally, for each article, the 100 most relevant images are selected.
• Run #2 (ViT-L/14@336px): This uses the text and image embeddings of the “ViT-
L/14@336px” pre-trained CLIP model and calculates the cosine similarity between the
embedding of an article and all images. Then for each article, the 100 most relevant
images are selected.
• Run #3 (ViT-L/14@336px_ds): Similarly to Run #2, additionally using dual softmax
revision to revise the computed similarities.
• Run #4 (ViT-L/14@336px_ft): We fine-tune the “ViT-L/14@336px” pre-trained model
using the original and the augmented data from the collected datasets.
• Run #5 (ViT-L/14@336px_ft_ds): Similarly to Run #4, additionally using dual softmax
revision to revise the computed similarities.

We present the official results on the three testing datasets and results from the internal
experiments we conducted in order to evaluate our methods and select our final runs. Recall@K,
where 𝐾 = 5, 10, 50, 100 and Mean Reciprocal Rank (MRR) are used as evaluation metrics.
Table 1 (A) presents the results on the three testing datasets evaluated officially by the task
organizers. Run #1 (ViT-H/14 + DS) performs the best on the GDELT-P2 dataset on all metrics.
Run #4 (ViT-L/14@336px_ft) and Run #5 (ViT-L/14@336px_ft_ds) perform the best in MRR
terms on GDELT-P1 and RT respectively, while in Recall@K terms the results are mixed. The
dual softmax operation is beneficial in the RT dataset but not in GDELT-P1 and GDELT-P2
while the CLIP fine-tuning (comparison between Run #2 and Run #4) is beneficial in all datasets
in the majority of the metrics but achieves the best results only in GDELT-P1.
The above official results contrast with the findings of our internal experiments, conducted
prior to the release of the official results. Table 1 (B) presents our internal results on the dataset
we used for selecting our best models and examining our runs’ performance. From these
Table 1
Evaluation results for the five submitted runs.
A. Official evaluation results on the three testing datasets.
Test dataset R@5 R@10 R@50 R@100 MRR
Run #1 0.76733 0.84000 0.93533 0.96000 0.62368
Run #2 0.77800 0.85133 0.94267 0.96867 0.62431
GDELT-P1 Run #3 0.76933 0.84467 0.93933 0.97067 0.62380
Run #4 0.77933 0.84867 0.94533 0.97067 0.62972
Run #5 0.76933 0.84400 0.93733 0.96867 0.62716
Run #1 0.69067 0.77600 0.90133 0.93200 0.56156
Run #2 0.64133 0.73533 0.86933 0.92267 0.52082
GDELT-P2 Run #3 0.63867 0.72667 0.87067 0.91533 0.51986
Run #4 0.64400 0.73267 0.87800 0.92867 0.52615
Run #5 0.64267 0.73200 0.87333 0.91933 0.52025
Run #1 0.34400 0.43800 0.63333 0.71300 0.26153
Run #2 0.33467 0.41100 0.60033 0.68633 0.24712
RT Run #3 0.34733 0.43267 0.63000 0.71300 0.26048
Run #4 0.33967 0.41700 0.60900 0.69300 0.25292
Run #5 0.35400 0.43633 0.63300 0.71933 0.26162

B. Results on our internal evaluation dataset.


Run #1 0.43720 0.51466 0.6919 0.75926 0.343
Run #2 0.45129 0.53137 0.71286 0.77548 0.354
Test dataset:
Run #3 0.45503 0.53711 0.71261 0.77959 0.356
NewsImages 2022 training data
Run #4 0.44917 0.53561 0.71373 0.78047 0.356
Run #5 0.45603 0.5401 0.71673 0.78358 0.357

preliminary experiments, we concluded that Run #5 constantly outperforms the rest of the
runs in every dataset, i.e. the use of the “ViT-L/14@336px” model, our fine-tuning and the dual
softmax revision seemed to be beneficial for performance.
The contrast between our findings and the official results in the GDELT-P2 dataset is probably
explained by the significant amount (80%) of generated images that exist in that dataset. Our
results suggest that the “ViT-H/14” model is more capable of handling such synthetic data than
the “ViT-L/14@336px”, but the reasons for this need to be further investigated.

5. Conclusion
In this work we proposed a solution for the MediaEval NewsImages task using state-of-the-art
text and image representations calculated from a pre-trained cross-modal network, a fine-
tuned cross-modal network and a similarity revision approach. We concluded from the official
evaluation results that for generated images the “ViT-H/14” model is more suitable for the
task while the “ViT-L/14@336px” models perform better for real images. Also, fine-tuning
pre-trained models for domain adaptation seems beneficial in most cases, while employing
different CLIP version can significantly affect the final performance.

Acknowledgements This work was supported by the EU’s Horizon Europe and Horizon
2020 research and innovation programmes under grant agreements 101070190 AI4Trust and
101021866 CRiTERIA, respectively.
References
[1] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval
2023, in: Proceedings of the MediaEval Benchmarking Initiative 2023, CEUR Workshop Proceedings,
2024. URL: [Link]
[2] D. Galanopoulos, V. Mezaris, Cross-modal Networks and Dual Softmax Operation for MediaEval
NewsImages 2022, in: Working Notes Proceedings of the MediaEval 2022 Workshop, volume 3583,
CEUR Workshop Proceedings, 2023.
[3] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, et al.,
Learning Transferable Visual Models From Natural Language Supervision, in: Proc. of the 38th Int.
Conf. on Machine Learning (ICML), 2021.
[4] D. Galanopoulos, V. Mezaris, Are all combinations equal? Combining textual and visual features
with multiple space learning for text-based video retrieval, in: European Conference on Computer
Vision Workshops (ECCVW), Springer, 2022.
[5] N. Borah, U. Baruah, Image retrieval using neural networks for word image spotting—a review, in:
H. K. Deva Sarma, V. Piuri, A. K. Pujari (Eds.), Machine Learning in Information and Communication
Technology, Springer Nature Singapore, Singapore, 2023, pp. 243–268.
[6] K. Ueki, Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval, in: 2021
20th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2021,
pp. 628–634.
[7] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, VinVL: Revisiting visual
representations in vision-language models, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2021, pp. 5579–5588.
[8] T. Cao, N. Ngô, T. D. Le, T. Huynh, N. T. Nguyen, H. Nguyen, M. Tran, HCMUS at MediaEval
2021: Fine-tuning CLIP for Automatic News-Images Re-Matching, in: Working Notes Proceedings
of the MediaEval 2021 Workshop, Online, 13-15 December 2021, volume 3181 of CEUR Workshop
Proceedings, [Link], 2021.
[9] A. Tran, A. Mathews, L. Xie, Transform and tell: Entity-aware news image captioning, in: IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[10] W. Zhen, S. Xu, Z. Xiangxie, Y. Jie, N24News: A New Dataset for Multimodal News Classification,
in: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 2022,
pp. 6768–6775.
[11] R. Arnau, Y. Fei, M.-N. Francesc, M. Krystian, BreakingNews: Article Annotation by Image and
Text Processing, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, pp.
1072–1085.
[12] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval 2022,
in: Working Notes Proceedings of the MediaEval 2022 Workshop, volume 3583, CEUR Workshop
Proceedings, 2023.
[13] R. Colin, S. Noam, R. Adam, L. Katherine, N. Sharan, M. Michael, Z. Yanqi, W. Li, P. J. Liu, Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer, in: Journal of Machine
Learning Research, 2020, pp. 1–67.
[14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language
Supervision, in: ICML, 2021.

You might also like