Mediaeval 2023

This paper presents a solution for the MediaEval NewsImages 2023 task, focusing on matching images to articles using pre-trained cross-modal networks, specifically variations of the CLIP model. The authors fine-tuned these models for domain adaptation, employed data augmentation techniques, and utilized a dual softmax operation to enhance performance. Results indicate that fine-tuning improves outcomes, and the choice of CLIP model significantly affects performance, particularly in handling generated versus real images.

Uploaded by

sakthivel2310758

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views5 pages

Mediaeval 2023

Uploaded by

sakthivel2310758

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Cross-modal Networks, Fine-Tuning, Data

Augmentation and Dual Softmax Operation for

MediaEval NewsImages 2023
Antonios Leventakis1,* , Damianos Galanopoulos1,* and Vasileios Mezaris1
1
Information Technologies Institute / Centre for Research and Technology Hellas, Thessaloniki, Greece

Abstract
Matching images to articles is challenging and can be considered a special version of the cross-media
retrieval problem. This notebook paper presents our solution for the MediaEval NewsImages 2023
benchmarking task. We investigate the performance of pre-trained cross-modal networks. Specifically, we
investigate two pre-trained CLIP model variations and fine-tuned one for domain adaptation. Additionally,
we utilize a data augmentation technique and a method for revising the similarities produced by either
one of the networks, i.e., a dual softmax operation, to improve our solutions’ performance. We report
the official results for our submitted runs and additional experiments we conducted to evaluate our runs
internally. We conclude that fine-tuning benefits the performance, and it is important to consider the
data’s nature when selecting the appropriate pre-trained CLIP model.

1. Introduction
In this paper, we deal with the text-to-image retrieval task adapted for the needs of the MediaEval
NewsImages 2023 task [1]. Nowadays, news sites publish multimedia content in their online
news articles to better convey the message the textual article wants to convey to readers. So,
associating news articles with multimedia content is crucial for several research tasks such as
cross-modal retrieval and disinformation detection. Our participation [2] in the NewsImages
2022 task showed that cross-modal networks trained on large sets of data, such as CLIP [3],
perform optimally. Based on that outcome, to deal with image retrieval using textual articles,
this year’s approach is based on pre-trained versions of CLIP [3]. To further adapt them
to this specific task, we fine-tune them with extra news article-based datasets to improve
the performance. Moreover, similarly to our previous works [2, 4], we adopt a dual-softmax
operation (DS) to recalculate the initially computed title-image similarities, an approach that in
some cases leads to improved performance. Lastly, we utilize a data augmentation technique
on the textual part of the data to increase the amount of available data for training and the
robustness that derives from the diversity that data augmentation introduces to the models.

2. Related Work
Text-image association is a challenging task that has gained a lot of interest in recent years.
The task has been extensively examined in the multimedia research community e.g. see [5, 6],
and there is consensus that the evolution of deep learning methods has boosted performance.
Indicative relevant methods include VinVL [7], where an object detector is pre-trained to encode
images and visual objects on images and a cross-modal model is trained to associate visual and
MediaEval’23, 1-2 February 2024, Amsterdam, The Netherlands and Online *
Corresponding authors.
$ aleventakis@[Link] (A. Leventakis); dgalanop@[Link] (D. Galanopoulos); bmezaris@[Link] (V. Mezaris)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
[Link]
ISSN 1613-0073
CEUR Workshop Proceedings ([Link])
textual features. Regarding the NewsImages 2021 participations, HCMUS [8] proposed a solution
based on the pre-trained model CLIP [3] along with sophisticated text preprocessing, which
achieved the best performance. In NewsImages 2022 the best-performing approach [2] explored
CLIP’s capabilities alongside a trainable cross-modal network; and concluded that using CLIP
was, by a small margin, better than training a custom cross-modal network. Therefore, utilizing
the power of CLIP models seems to be the most suitable approach for the task.

3. Approach
3.1. Data, pre-processing and augmentation
To adapt the CLIP model to the specific needs of the task, we explore the fine-tuning capabilities
for this model. We preprocess both training, evaluation and the official test textual data in order
to fully exploit our approach’s power. We gathered around 4.8 million image-title pairs from
the news domain to fine-tune the pre-trained CLIP model for training. Specifically, we utilize
the NYTimes800k [9], N24News [10] and BreakingNews [11] datasets along with data publicly
available in [Link] from news websites including Al Jazeera1 , CNN2 , BBC3 , HuffPost News4
and Bloomberg5 to fine-tune the model. To internally evaluate our approach, we merge last
year’s NewsImages training data [12] and use them to investigate the performance of our
approach. For each one of these datasets we utilize a data augmentation technique to double the
amount of data available. Specifically, we exploit the paraphrasing ability of the Text-to-Text
Transformer [13] to create diverse but semantically similar text titles for every image. This
approach not only enables us to have more training data but also lets us compute the image-title
similarities of the evaluation and test datasets from both the original and the generated text
titles for each image. Then, by using a mean pooling operation between the values that occur
from the computations we end up with our final predictions.

3.2. Pre-trained models

As pre-trained cross-modal networks, we utilize two different implementations of the CLIP [3]
model in order to examine their performance. More specifically, we utilize the “ViT-L/14@336px”,
the largest version of the CLIP model currently available to the public by OpenAI, and as a second
variation, we utilize the “ViT-H/14” model of openCLIP [14], the open-source implementation
of CLIP. We use these models to calculate text and image feature representations. For a given
article, in order to retrieve the most relevant images from the test set, we calculate the cosine
similarity between the article’s title CLIP embedding and the embeddings of all test images, and
the top-100 most relevant images are selected in a ranked list, from the most relevant to the
least relevant image.

3.3. Fine-tuned model

We also examined fine-tuning the “ViT-L/14@336px” CLIP model using the aforementioned
training datasets to improve its performance. We choose to keep the image encoder of the
model frozen and only train the text encoder’s parameters for one epoch with a batch size of 480
(performing gradient accumulation to handle GPU memory limitations). The Adam optimizer is
employed while the learning rate is set to 3e-7.

1
[Link] 2 [Link]
3
[Link] 4 [Link]
5
[Link]
3.4. Dual-softmax similarity revision
At the retrieval stage, we calculate the similarities between all images from the test set and
all testing articles, resulting in a similarity matrix Z ∈ ℛ𝐶×𝐷 , where 𝐶 is the number of the
testing article queries and 𝐷 the number of test images. Following [2, 4], to revise the calculated
similarities, we apply two cross-dimension softmax operations (one row-wise: dim = 0, and
one column-wise: dim = 0) as follows: Z* = Softmax(Z, dim = 0) ⊙ Softmax(Z, dim = 1):
where ⊙ denotes the element-wise product.

3.5. Inference-stage scores aggregation

As mentioned before, we also augment the test data’s textual part, resulting in two article-image
pairs for each original pair contained in the dataset. So, in all our runs (e.g. regardless of whether
we use a pre-trained CLIP or we fine-tune it), we end up with two article-image similarity scores.
To aggregate these scores, we experimented with different aggregation methods (not presented
here for brevity), and we chose to perform mean pooling to obtain our final prediction.

4. Submitted Runs and Results

We submitted five runs for each testing dataset (GDELT-P1, GDELT-P2, RT), as detailed below:

• Run #1 (ViT-H/14_ds): This uses the text and image embeddings of the “ViT-H/14” pre-
trained openCLIP model and calculates the cosine similarity between the embedding of
an article and all images. Then, the dual-softmax revision method is used to recalculate
the similarities. Finally, for each article, the 100 most relevant images are selected.
• Run #2 (ViT-L/14@336px): This uses the text and image embeddings of the “ViT-
L/14@336px” pre-trained CLIP model and calculates the cosine similarity between the
embedding of an article and all images. Then for each article, the 100 most relevant
images are selected.
• Run #3 (ViT-L/14@336px_ds): Similarly to Run #2, additionally using dual softmax
revision to revise the computed similarities.
• Run #4 (ViT-L/14@336px_ft): We fine-tune the “ViT-L/14@336px” pre-trained model
using the original and the augmented data from the collected datasets.
• Run #5 (ViT-L/14@336px_ft_ds): Similarly to Run #4, additionally using dual softmax
revision to revise the computed similarities.

We present the official results on the three testing datasets and results from the internal
experiments we conducted in order to evaluate our methods and select our final runs. Recall@K,
where 𝐾 = 5, 10, 50, 100 and Mean Reciprocal Rank (MRR) are used as evaluation metrics.
Table 1 (A) presents the results on the three testing datasets evaluated officially by the task
organizers. Run #1 (ViT-H/14 + DS) performs the best on the GDELT-P2 dataset on all metrics.
Run #4 (ViT-L/14@336px_ft) and Run #5 (ViT-L/14@336px_ft_ds) perform the best in MRR
terms on GDELT-P1 and RT respectively, while in Recall@K terms the results are mixed. The
dual softmax operation is beneficial in the RT dataset but not in GDELT-P1 and GDELT-P2
while the CLIP fine-tuning (comparison between Run #2 and Run #4) is beneficial in all datasets
in the majority of the metrics but achieves the best results only in GDELT-P1.
The above official results contrast with the findings of our internal experiments, conducted
prior to the release of the official results. Table 1 (B) presents our internal results on the dataset
we used for selecting our best models and examining our runs’ performance. From these
Table 1
Evaluation results for the five submitted runs.
A. Official evaluation results on the three testing datasets.
Test dataset R@5 R@10 R@50 R@100 MRR
Run #1 0.76733 0.84000 0.93533 0.96000 0.62368
Run #2 0.77800 0.85133 0.94267 0.96867 0.62431
GDELT-P1 Run #3 0.76933 0.84467 0.93933 0.97067 0.62380
Run #4 0.77933 0.84867 0.94533 0.97067 0.62972
Run #5 0.76933 0.84400 0.93733 0.96867 0.62716
Run #1 0.69067 0.77600 0.90133 0.93200 0.56156
Run #2 0.64133 0.73533 0.86933 0.92267 0.52082
GDELT-P2 Run #3 0.63867 0.72667 0.87067 0.91533 0.51986
Run #4 0.64400 0.73267 0.87800 0.92867 0.52615
Run #5 0.64267 0.73200 0.87333 0.91933 0.52025
Run #1 0.34400 0.43800 0.63333 0.71300 0.26153
Run #2 0.33467 0.41100 0.60033 0.68633 0.24712
RT Run #3 0.34733 0.43267 0.63000 0.71300 0.26048
Run #4 0.33967 0.41700 0.60900 0.69300 0.25292
Run #5 0.35400 0.43633 0.63300 0.71933 0.26162

B. Results on our internal evaluation dataset.

Run #1 0.43720 0.51466 0.6919 0.75926 0.343
Run #2 0.45129 0.53137 0.71286 0.77548 0.354
Test dataset:
Run #3 0.45503 0.53711 0.71261 0.77959 0.356
NewsImages 2022 training data
Run #4 0.44917 0.53561 0.71373 0.78047 0.356
Run #5 0.45603 0.5401 0.71673 0.78358 0.357

preliminary experiments, we concluded that Run #5 constantly outperforms the rest of the
runs in every dataset, i.e. the use of the “ViT-L/14@336px” model, our fine-tuning and the dual
softmax revision seemed to be beneficial for performance.
The contrast between our findings and the official results in the GDELT-P2 dataset is probably
explained by the significant amount (80%) of generated images that exist in that dataset. Our
results suggest that the “ViT-H/14” model is more capable of handling such synthetic data than
the “ViT-L/14@336px”, but the reasons for this need to be further investigated.

5. Conclusion
In this work we proposed a solution for the MediaEval NewsImages task using state-of-the-art
text and image representations calculated from a pre-trained cross-modal network, a fine-
tuned cross-modal network and a similarity revision approach. We concluded from the official
evaluation results that for generated images the “ViT-H/14” model is more suitable for the
task while the “ViT-L/14@336px” models perform better for real images. Also, fine-tuning
pre-trained models for domain adaptation seems beneficial in most cases, while employing
different CLIP version can significantly affect the final performance.

Acknowledgements This work was supported by the EU’s Horizon Europe and Horizon
2020 research and innovation programmes under grant agreements 101070190 AI4Trust and
101021866 CRiTERIA, respectively.
References
[1] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval
2023, in: Proceedings of the MediaEval Benchmarking Initiative 2023, CEUR Workshop Proceedings,
2024. URL: [Link]
[2] D. Galanopoulos, V. Mezaris, Cross-modal Networks and Dual Softmax Operation for MediaEval
NewsImages 2022, in: Working Notes Proceedings of the MediaEval 2022 Workshop, volume 3583,
CEUR Workshop Proceedings, 2023.
[3] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, et al.,
Learning Transferable Visual Models From Natural Language Supervision, in: Proc. of the 38th Int.
Conf. on Machine Learning (ICML), 2021.
[4] D. Galanopoulos, V. Mezaris, Are all combinations equal? Combining textual and visual features
with multiple space learning for text-based video retrieval, in: European Conference on Computer
Vision Workshops (ECCVW), Springer, 2022.
[5] N. Borah, U. Baruah, Image retrieval using neural networks for word image spotting—a review, in:
H. K. Deva Sarma, V. Piuri, A. K. Pujari (Eds.), Machine Learning in Information and Communication
Technology, Springer Nature Singapore, Singapore, 2023, pp. 243–268.
[6] K. Ueki, Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval, in: 2021
20th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2021,
pp. 628–634.
[7] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, VinVL: Revisiting visual
representations in vision-language models, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2021, pp. 5579–5588.
[8] T. Cao, N. Ngô, T. D. Le, T. Huynh, N. T. Nguyen, H. Nguyen, M. Tran, HCMUS at MediaEval
2021: Fine-tuning CLIP for Automatic News-Images Re-Matching, in: Working Notes Proceedings
of the MediaEval 2021 Workshop, Online, 13-15 December 2021, volume 3181 of CEUR Workshop
Proceedings, [Link], 2021.
[9] A. Tran, A. Mathews, L. Xie, Transform and tell: Entity-aware news image captioning, in: IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[10] W. Zhen, S. Xu, Z. Xiangxie, Y. Jie, N24News: A New Dataset for Multimodal News Classification,
in: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 2022,
pp. 6768–6775.
[11] R. Arnau, Y. Fei, M.-N. Francesc, M. Krystian, BreakingNews: Article Annotation by Image and
Text Processing, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, pp.
1072–1085.
[12] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval 2022,
in: Working Notes Proceedings of the MediaEval 2022 Workshop, volume 3583, CEUR Workshop
Proceedings, 2023.
[13] R. Colin, S. Noam, R. Adam, L. Katherine, N. Sharan, M. Michael, Z. Yanqi, W. Li, P. J. Liu, Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer, in: Journal of Machine
Learning Research, 2020, pp. 1–67.
[14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language
Supervision, in: ICML, 2021.

Graphs
No ratings yet
Graphs
15 pages
Cross-Modal Image-Text Matching for Apps
No ratings yet
Cross-Modal Image-Text Matching for Apps
3 pages
Text-Image Embeddings With OpenAIs CLIP
No ratings yet
Text-Image Embeddings With OpenAIs CLIP
5 pages
Enhancing Multimodal Understanding With CLIP-Based
No ratings yet
Enhancing Multimodal Understanding With CLIP-Based
7 pages
AI Vision Models via Text Supervision
No ratings yet
AI Vision Models via Text Supervision
48 pages
Lecture22 Multimodal
No ratings yet
Lecture22 Multimodal
32 pages
Learning Visual Models via Language
No ratings yet
Learning Visual Models via Language
47 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
Implementation of Simple and Efficient P
No ratings yet
Implementation of Simple and Efficient P
8 pages
Deep Learning For Visual Understanding Bridging Text and Image With CLIP and BLIP
No ratings yet
Deep Learning For Visual Understanding Bridging Text and Image With CLIP and BLIP
9 pages
Abbasi CLIP Under The Microscope A Fine-Grained Analysis of Multi-Object Representation CVPR 2025 Paper
No ratings yet
Abbasi CLIP Under The Microscope A Fine-Grained Analysis of Multi-Object Representation CVPR 2025 Paper
10 pages
CLIP4Clip: An Empirical Study of CLIP For End To End Video Clip Retrieval
No ratings yet
CLIP4Clip: An Empirical Study of CLIP For End To End Video Clip Retrieval
14 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
CLIP - Connecting Text and Images - OpenAI
No ratings yet
CLIP - Connecting Text and Images - OpenAI
16 pages
Chinese CLIP for Vision-Language AI
No ratings yet
Chinese CLIP for Vision-Language AI
18 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
CLIP Reid
No ratings yet
CLIP Reid
11 pages
Neural Image Captioning Report
No ratings yet
Neural Image Captioning Report
10 pages
452 Learning To Adapt Frozen C
No ratings yet
452 Learning To Adapt Frozen C
22 pages
Advanced Text-to-Image Generation
No ratings yet
Advanced Text-to-Image Generation
27 pages
Caption Combiner for Image Analysis
No ratings yet
Caption Combiner for Image Analysis
14 pages
CLIP-Driven Image Segmentation
No ratings yet
CLIP-Driven Image Segmentation
10 pages
Durante An Interactive Agent Foundation Model CVPRW 2025 Paper
No ratings yet
Durante An Interactive Agent Foundation Model CVPRW 2025 Paper
11 pages
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
No ratings yet
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
9 pages
Contrastive Language and Vision Learning of General Fashion Concepts
No ratings yet
Contrastive Language and Vision Learning of General Fashion Concepts
13 pages
Clothing Tag Generation Model Overview
No ratings yet
Clothing Tag Generation Model Overview
36 pages
Vadclip: Adapting Vision-Language Models For Weakly Supervised Video Anomaly Detection
No ratings yet
Vadclip: Adapting Vision-Language Models For Weakly Supervised Video Anomaly Detection
9 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Chinese Clip
No ratings yet
Chinese Clip
19 pages
Laclip
No ratings yet
Laclip
29 pages
Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
Cross-Modal Alignment With Graph Reasoning For Image-Text Retrieval
No ratings yet
Cross-Modal Alignment With Graph Reasoning For Image-Text Retrieval
18 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
Session 15-1 Multimodal
No ratings yet
Session 15-1 Multimodal
82 pages
Low-Resource Multimodal Retrieval Model
No ratings yet
Low-Resource Multimodal Retrieval Model
11 pages
RP Springer
No ratings yet
RP Springer
10 pages
Using Language To Extend To Unseen Domains
No ratings yet
Using Language To Extend To Unseen Domains
19 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Graph-Based Cross-Modal Retrieval Model
No ratings yet
Graph-Based Cross-Modal Retrieval Model
7 pages
ELIP
No ratings yet
ELIP
31 pages
Dense Clip
No ratings yet
Dense Clip
11 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
No ratings yet
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
17 pages
Clip
No ratings yet
Clip
15 pages
Rethinking Benchmarks For Cross-Modal Image-Text Retrieval: Weijing Chen Linli Yao Qin Jin
No ratings yet
Rethinking Benchmarks For Cross-Modal Image-Text Retrieval: Weijing Chen Linli Yao Qin Jin
11 pages
DL Plagiarism Report
No ratings yet
DL Plagiarism Report
8 pages
VadCLIP Paper
No ratings yet
VadCLIP Paper
9 pages
Fine-Tuned CLIP Models Are Efficient Video Learners
No ratings yet
Fine-Tuned CLIP Models Are Efficient Video Learners
13 pages
Fake or Fact A Cross-Modal Deep Learning Pipeline For Multimodal Content Analysis
No ratings yet
Fake or Fact A Cross-Modal Deep Learning Pipeline For Multimodal Content Analysis
30 pages
Project Report
No ratings yet
Project Report
53 pages
ContextRefine CLIP For EPIC KITCHENS 100 Multi Instance Retrieval Challenge 2025
No ratings yet
ContextRefine CLIP For EPIC KITCHENS 100 Multi Instance Retrieval Challenge 2025
4 pages
DualAdapter: Enhancing VLMs with Dual-Path Adaptation
No ratings yet
DualAdapter: Enhancing VLMs with Dual-Path Adaptation
25 pages
Learning CLIP Guided Visual-Text Fusion Transformer For Video-Based Pedestrian Attribute Recognition
No ratings yet
Learning CLIP Guided Visual-Text Fusion Transformer For Video-Based Pedestrian Attribute Recognition
4 pages
Grade 6 Global Perspectives Worksheet
No ratings yet
Grade 6 Global Perspectives Worksheet
3 pages
6 Application of Derivatives
No ratings yet
6 Application of Derivatives
36 pages
Eee Inst 002 Add1
No ratings yet
Eee Inst 002 Add1
338 pages
Experiment No.5 Tensile Test: Objectives
No ratings yet
Experiment No.5 Tensile Test: Objectives
5 pages
q1 Cesc Module 6
No ratings yet
q1 Cesc Module 6
23 pages
Receiving Advice/Acceptance Certificate: Functional Group
No ratings yet
Receiving Advice/Acceptance Certificate: Functional Group
50 pages
SOP Document Formatting Guide
No ratings yet
SOP Document Formatting Guide
5 pages
Education Topic PDF Download - Google Search
No ratings yet
Education Topic PDF Download - Google Search
4 pages
Mastering Spanish Listening Guide Nuevo
No ratings yet
Mastering Spanish Listening Guide Nuevo
15 pages
Mechanical Performance of Intralayer Hybrid 3D Woven Honeycomb Core For Lightweight Structural Composites
No ratings yet
Mechanical Performance of Intralayer Hybrid 3D Woven Honeycomb Core For Lightweight Structural Composites
19 pages
Kleen: Koch 221 Membrane Cleaner
No ratings yet
Kleen: Koch 221 Membrane Cleaner
2 pages
Scarlet Letter The Manga Edition Nathaniel Hawthorne Available Any Format
No ratings yet
Scarlet Letter The Manga Edition Nathaniel Hawthorne Available Any Format
99 pages
Computable Functions in Computation Theory
No ratings yet
Computable Functions in Computation Theory
3 pages
Sustainable Construction Campus Plan
No ratings yet
Sustainable Construction Campus Plan
12 pages
Nursing Theories for Students
No ratings yet
Nursing Theories for Students
23 pages
Digital Signal Processing Basics
No ratings yet
Digital Signal Processing Basics
49 pages
Organisational Behaviour
100% (2)
Organisational Behaviour
23 pages
Predicting Settlement Abroad
100% (1)
Predicting Settlement Abroad
55 pages
Maria
No ratings yet
Maria
7 pages
Quality Control
No ratings yet
Quality Control
8 pages
21csc305p Machine Learning Unit 3 - Updated
No ratings yet
21csc305p Machine Learning Unit 3 - Updated
147 pages
HDC10 Hold-Downs: Specs & Installation Guide
No ratings yet
HDC10 Hold-Downs: Specs & Installation Guide
11 pages
Understanding Advanced Statistical Methods 1st Edition Peter Westfall Instant Download
No ratings yet
Understanding Advanced Statistical Methods 1st Edition Peter Westfall Instant Download
52 pages
Tugas Statistics 29 July
No ratings yet
Tugas Statistics 29 July
8 pages
The Freedom To Pursue Happiness Belief in Free Wil
No ratings yet
The Freedom To Pursue Happiness Belief in Free Wil
9 pages
Fundamentals of Power Electronics
No ratings yet
Fundamentals of Power Electronics
13 pages
Fuel Gas Skid Foundation Drawings For WM33: Raj Ku Mar
No ratings yet
Fuel Gas Skid Foundation Drawings For WM33: Raj Ku Mar
3 pages
Galileo Galilei: Father of Modern Science
No ratings yet
Galileo Galilei: Father of Modern Science
25 pages
Myanmar Airways Electronic Ticket
No ratings yet
Myanmar Airways Electronic Ticket
1 page
What Is Learning in Psychology
No ratings yet
What Is Learning in Psychology
7 pages

Mediaeval 2023

Uploaded by

Mediaeval 2023

Uploaded by

Cross-modal Networks, Fine-Tuning, Data

Augmentation and Dual Softmax Operation for

3.2. Pre-trained models

3.3. Fine-tuned model

3.5. Inference-stage scores aggregation

4. Submitted Runs and Results

B. Results on our internal evaluation dataset.

You might also like