20240726-A Labeled Ophthalmic Ultrasound Dataset With Medical Report Generation Based On Cross-Modal Deep Learning-2407.18667v1
20240726-A Labeled Ophthalmic Ultrasound Dataset With Medical Report Generation Based On Cross-Modal Deep Learning-2407.18667v1
a
School of Electrical and Control Engineering, North China University of
Technology, Beijing, 100144, China
b
School of Automation and Electrical Engineering, Shenyang Ligong
University, Shenyang, 110159, China
c
Department of Ophthalmology, The Fourth Affiliated Hospital of China Medical
University, Shenyang, 110005, China
Abstract
Ultrasound imaging reveals eye morphology and aids in diagnosing and treat-
ing eye diseases. However, interpreting diagnostic reports requires specialized
physicians. We present a labeled ophthalmic dataset for the precise analysis
and the automated exploration of medical images along with their associ-
ated reports. It collects three modal data, including the ultrasound images,
blood flow information and examination reports from 2,417 patients at an
ophthalmology hospital in Shenyang, China, during the year 2018, in which
the patient information is de-identified for privacy protection. To the best
of our knowledge, it is the only ophthalmic dataset that contains the three
modal information simultaneously. It incrementally consists of 4,858 images
with the corresponding free-text reports, which describe 15 typical imaging
findings of intraocular diseases and the corresponding anatomical locations.
Each image shows three kinds of blood flow indices at three specific arteries,
i.e., nine parameter values to describe the spectral characteristics of blood
flow distribution. The reports were written by ophthalmologists during the
clinical care. The proposed dataset is applied to generate medical report
∗
Corresponding author
Email addresses: [email protected] (Jing Wang),
[email protected] (Junyan Fan), [email protected] (Meng Zhou),
[email protected] (Yanzhu Zhang), [email protected] (Mingyu Shi)
1. Introduction
Medical imaging plays a crucial role on the clinical diagnosis and treat-
ment especially in the field of ophthalmology. The popular imaging tech-
niques include fundus photography, optical coherence tomography (OCT),
and fluorescein angiography of the retina. The adequate interpretation of
the ophthalmic examination requires perfossional ophthalmologist or radi-
ologists. Due to the increasing workloads, ophthalmologists face significant
time and effort limitations in analyzing medical images and generating di-
agnostic reports. Moreover, the subjective expertise of different ophthalmol-
ogist will result the potential variations in the interpretation of the same
image. The existing medical report generation has mostly focused on the ra-
diographic images, particularly chest X-ray images. Various medical report
generation datasets have been released for different medical modalities, such
as fluorescein angiography (FFA) images [1], lung CT scans [2], and color
fundus photography (CFP) [3]. There is a lack of research specifically on
ophthalmic ultrasound images and report generation. Moreover, most of the
existing datasets are in English. There is a significant research gap in the
field of Chinese medical report generation. Therefore, the development of
annotated ophthalmic ultrasound image datasets with corresponding report
is necessary for the ophthalmic diagnosis based on artificial intelligence (AI).
Inspired by the image captioning task, an increasing number of researchers
have applied this approach to the generation of medical reports. Its primary
objective is to provide annotation for subsequent diagnosis and treatment
of diseases. Many deep learning-based automatic generation methods have
been proposed to lighten the workload of doctors [4, 5, 6, 7, 8]. These meth-
ods usually use the convolutional neural network (CNN) [9, 10] to extract
visual feature, and use the recurrent neural network (RNN) to predict reports
[11]. They have given a preliminary feasible scheme for medical report gen-
eration (MRG). Latter text-image attention mapping is explored to explain
the automatic generation process although its accuracy is not yet known
2
[12]. There still exist many challenges which limits the practical application
of deep learning method as follows. (1) The construction of large and spe-
cific medical dataset is time-consuming and labor-intensive; (2) Due to the
complexity of medical image interpretation, The accuracy achieved by deep
learning model is not up to the level of specialized doctors.
In this paper, we propose a labeled ophthalmic dataset of ocular ultra-
sound image, text report and blood flow information. To the best of our
knowledge, it is the only ophthalmic dataset that contains all the three modal
information simultaneously. A comprehensive experiment is conducted on
this dataset, and the result demonstrates that the proposed dataset is suit-
able for training all kinds of supervised learning models which concerns the
cross-modal medical data. The main motivation for the dataset construction
is as follows. From the view of scientific research, it is helpful to develop
medical AI learning algorithm fusion the computer vision and natural lan-
guage processing. It can be applied but not limited to the cross-modal report
generation. From the view of clinical applications, this dataset provides the
ultrasound images which give insight into the morphology and structure of
eyes. It includes 15 common ophthalmic diseases, such as retinal detachment,
choroidal detachment, vitreous stellate degeneration, vitreous hemorrhage,
endophthalmitis and vitreous opacity. The rich types disease is to broaden
the learning model for the initial diagnosis and treatment.
The main contributions of this paper are summarized as follows:
• A large-scale medical dataset is constructed which is comprised of 4,858
eye ultrasound images and their corresponding Chinese reports. All the
data were collected from the real-world clinical practices, and the report
accurately represent the writing patterns of ophthalmologists. The
dataset facilitates cross-modal learning and report generation whose
text pattern is more aligned with clinical practices.
3
the predict accuracy of medical report is evaluated based on the NLG
metrics. The result shows that our dataset can be applied to the med-
ical report generation. It is helpful to drive the AI based ophthalmic
medical diagnosis.
The rest of the paper is organized as follows: Section 2 reviews various
existed medical datasets and medical report generation (MRG) methods.
Section 3 presents the construction methodology of the labeled ophthalmic
ultrasound dataset. Section 4 gives a comprehensive report generation ex-
periment based on the cross-modal memory network. Section 5 draws a
conclusion as well as future perspective.
2. Related Work
Many kinds of medical image have been widely utilized to develope an AI-
aided diagnosis system. Here we reviewed the existed medical image datasets
and MRG methods.
4
There are also five retinal datasets, including retinal images and text.
FFA-IR [1] provides interpretable annotations by labeling 46 foci in a total
of 12,166 regions, as well as the Fundus Fluorescein Angiography(FFA) im-
ages and reports. It plays an important role in identifying the disease and
generating report. In contrast to FFA-IR, DEN primarily consists of color
fundus photography (CFP) images (13,898 CFP and 1,811 FFA). STARE
[19] released a total of 397 images including CFP and FFA in 2004. How-
ever, the text provided with these images is short free-text diagnostic labels,
rather than observational reports of image findings. It is therefore not suit-
able for training the medical report generation model. DIARETDB1 [20] has
good annotation of lesion location and size, but the number of CFP images
is limited. MESSIDOR [21] includes 1200 CFP images and 600 Fine-Great
French reports. In summary, it can be seen that most of the current datasets
are for chest diseases. Not much research has been done on datasets for eye
diseases. The existed datasets for eyes are basically color fundus images,
which are generally used for screening, diagnosis and monitoring of fundus
diseases, such as retinopathy and macular degeneration. However, it is dif-
ficult to assess some intraocular diseases such as vitreous humor and retinal
detachment. Unlike the existing medical reports, our dataset builds image
report pairs for clinically collected ocular ultrasound images, Chinese reports
and additional blood flow parameter information. It will play an important
role in the diagnosis of ocular diseases and automated report generation stud-
ies.
5
Table 1: Summary of medical datasets
Name of Dataset Image Modality Number of Images Report Cases Report Language
RDIF[14] Kidney Biopsy 1152 144 English
COV-CHR[2] Lung CT-Scans 728 728 English/Chinese
Fetal Ultrasound[17] Fetal ultrasound 2800 2800 English
PEIR Gross[12] Gross lesions 7,442 7,442 English
Open-IU[13] Chest X-Ray 7470 2955 English
MIMIC-CXR[15] Chest X-Ray 377,110 276,778 English
PADCHEST[16] Chest X-Ray 160,868 22,710 Spanish
CX-CHR[18] Chest X-Ray 45,598 40,410 Chinese
TJU[22] Chest X-ray 19,985 19,985 Chinese
STARE[19] CFP+FFA 397 397 English
DEN[3] CFP+FFA 15709 15709(Keywords) English
DIARETDB1[20] CFP 89 89 English
MESSIDOR[21] CFP 1200 587 French
FFA-IR[1] FFA 1, 048, 584 10790 English/Chinese
Our Ocular ultrasound 4858 4858 Chinese
capacity during the decoding procedure for the retention and utilization of
relevant information. Zhang et al. [26] improved the attention mechanism to
garantee the model focus on the correct region of medical image, and it also
provided interpretable analysis for the diagnostic process. Wang et al. [5]
proposed a textual image embedding model that generated medical report
using the attention mechanism. The model incorporated an attention distri-
bution graph and text embedding information to improve the classification
accuracy. Zeng et al. [27] generated a ultrasound image report using the
target detection algorithm. It detected the lesion region, and generated a
medical diagnostic report by encoding and decoding the ultrasound image
vector.
6
tion. During the report generation, visual features are extracted from the
ultrasound images by a traditional CNN [10], then injected into the image
encoder. Then the encoded image features are fed into a sentence decoder
with embedded text features to generate reports. Usually the sentence de-
coder is designed as LSTM network [11] or Transformer decoder [28], in which
cross-modal attention is utilized to align image and text information.
7
(3) Some ultrasound images may have off-center eye information due to
the variability in eye positioning, and incomplete eyeball is obtained after
cropping. So these image should be excluded from dateset. Additionally, the
images without corresponding textual reports are also omitted.
Figure 2: An example of complete image processing. (a) Screening (b) Regional selection
and manual cropping. (c) Storage.
8
3, in which the patient’s personal identifiable information is removed. The
following principles are considered during the text extraction of diagnositic
report.
(1) The ultrasound description and its corresponding finding text in the
report are integrated. This resulted in a comprehensive textual representa-
tion that included detailed disease description and diagnostic finding.
(2) Some diagnostic texts provided the size parameter to describe the
lesion, such as ”measuring approximately 1.83×1.48×2.79mm”. Consider-
ing the image did not provide the specific size information, we delete this
parameter and replace it by a simplified description, such as ”a hypoechoic
cystic-solid lesion within the nasal quadrant.”
(3) Typically, many diagnostic reports include separate diagnoses for the
left eye and the right one. Therefore, we process the reports independently
for each eye which forms individual case.
(4) The ultrasound findings did not specify diagnoses for the left and right
eyes due to varied writing habits. The report simply mentioned ”posterior
scleral staphyloma” instead of indicating ”left (right) eye with posterior scle-
ral staphyloma”. After the confirmation of ophthalmologists, it implies the
presence of the disease in both eyes.
We use OCR text recognition technology to identify and extract text
from the clinical reports. The ultrasound description and diagnostic section
are manually integrated, and the relevant disease description and diagnostic
result are retained. Non-relevant text is removed, and the results are dif-
ferentiated for the left and right eye. An example of the extracted result is
shown in Figure 3.
9
Figure 3: Example of report preprocessing.
blood flow indices are extracted from three ultrasound images, and gathered
into a 3 × 3 matrix with nine parameter values. For some cases where the
parameter results are shown in the diagnostic report, we read them directly
from the report.
10
Figure 4: Example of blood flow information extraction.
Next all the image names in the floder are batch extracted and exported
to a spreadsheet. This spreadsheet contains the ultrasound number and the
corresponding eye designation. The ultrasound report is matched with the
image based on its ultrasound number, as well as the image path information.
In order to make the data uniformly distributed, we randomly disrupt the
data and divid them into training, testing, and validation sets. Finally,
the data are converted to JSON format for easy retrieval in the subsequent
experiments. The aggregated blood flow parameter matrix is compiled in a
table, which also is matched based on the ultrasound number.
In summary, a complete data consists of the image ID (i.e., image ul-
trasound number), as well as its corresponding report, image path, split set
and blood flow information. The report and image are connected by the
image ID and its file path. An example of complete data with image and its
corresponding report is shown in Figure 5.
11
Figure 5: An example of complete data. (a) cropped image. (b) JSON-formatted content
after extracting the text report. (c) extracted blood flow information.
not considered. The most common disease are ”vitreous opacity,” ”vitreous
hemorrhage,” ”PVD(Posterior Vitreous Detachment)” and ”mild vitreous
opacity”. The number of each disease is calculated based on the frequency
of the corresponding keyword in the dataset.
It is shown that 735 nomal cases are found at 15.3% of the total dataset,
and no significant ophthalmic disease is found in these patients. The symp-
tom ”vitreous opacity” is 3476 cases at 71.6% of the total dataset, and it
is the most common eye diseases. In contrast, ”eye atrophy” case is rare
with only 8 cases at 0.016%. The number of left and right eye cases in the
dataset are statistically 2,394 and 2,465, respectively. The right-eye sam-
ples are slightly more than the left-eye samples. The distribution of left and
right eyes is unbalanced, since different patients underwent one or multiple
ultrasound scanning at different eyes, angle and times. In addition, other
necessary index of the dataset are analyzed, including the number of reports
and patients, the length of sentences in the reports. The detailed results are
shown in Table 2. It is note that all sentences are extracted from diagnostic
reports written by physicians, and each report consists of a series of sen-
tences. There is a significant difference in the sentence length of the report.
12
Figure 6: Disease categories and percentages.
For example, the maximum or minimum tokens in a single report is 114 and
9, respectively. Usually, the report of normal case is described simplely as
”No abnormality was observed in the specific area”. The report of complex
case combines the description of multiple diseases at specical locations, and
its sentence length is far longer than that of the normal case.
13
Table 2: Descriptive statistics of report sentences.
Parameter Value
Total tokens 252676
Reports 4858
Patients 2417
Average (min-max) number of tokens per report 62
Maximum tokens in a single report 114
Minimum tokens in a single report 9
Right eyes 2393
Left eyes 2465
some medical terms about ophthalmic ultrasound images. This will be more
suitable for Chinese medical report generation task.
A schematic framework is given to illustrate the automatic generation of
the ophthalmic ultrasound report based on CMN learning model, as shown in
Figure 7. The whole process consists of four modules: visual feature extrac-
tion, word embedding, cross-modal memory network and report generation.
The visual feature extraction module utilizes ResNet to extract the patch
features from ultrasound image. The word embedding module transforms
text into vector form. The cross-modal memory network projects the image
features and text vectors into the same space and facilitates interaction using
a memory matrix, in order to align the information between different modals,
such as image and text. Finally, the interacted image and text information
are fed into the encoder and decoder of transformer, respectively, and the
report is generated.
We define the generation of ultrasound report as an image-to-text gener-
ation task. The goal is to predict the diagnostic report R for each ultrasound
image I. Then the image can be used as the source sequence and the text
report as the target sequence.
Visual feature extraction: First, a visual extractor is used to extract
the visual features from the ultrasound image I, which is denoted as X =
{x1 , x2 , . . . , xS } , xs ∈ Rd . Here xs are the patch features of the image and d
is the size of the feature vector. The extraction process is expressed as:
where the visual extractor fv (·) is finished by ResNet network [9] in this
14
Figure 7: A framework diagram for cross-modal medical report generation.
experiment.
Word embedding: The text sequence obtained by word embedding is
denoted as Y = {y1 , y2 , . . . , yt−1 }. The process can be expressed as:
where ft (·) is the text feature extractor. Here we used the same embed-
ding module of report generation Transformer during the model training and
learning. The generated output tokens are represented as {y1 , y2 , . . . , yt , . . . ,
yT } , yt ∈ V, where V is all the possible tokens and T is the length of the
report.
Cross modal network: CMN is used to solve the information alignment
problem from different modalities. It introduces a memory matrix M =
{m1 , m2 , . . . , mi , . . . , mN } to deal with the information interaction between
two modalities including image and text. Here N denotes the number of
memory vectors. Specifically, CMN consists of two subprocesses, querying
and responding. In the query subprocess, the image and text features are first
projected into the same representation space, and the most relevant memory
vectors about the image and text are queried. The responding subprocess is
to weight the queried memory vectors of image and text, respectively. Finally,
the obtained memory responses are fed into the Transformer to generate the
15
corresponding reports. The CMN learning can be represented as:
where fe (·) refers the encoder. Then along with the memory response {ry1 , ry2 ,
. . . , ryt−1 , the obtained intermediate states {w1 , w2 , . . . , wS } are sent to the
decoder to generate the current output yt .
yt = fd w1 , w2 , . . . , wS , ry1 , ry2 , . . . , ryt−1 (6)
where θ are the model parameters. The goal is to find the best parameter
value θ∗ to maximize the probability of generating the target sequence Y for
the given input I.
16
4.2. Experiments and Discussions
4.2.1. Dataset and Evaluation Metrics
We randomly separate the ultrasound dataset into three parts for train-
ing, validation and test according to the ratio of 75:10:15. The model is
trained and validated using the training and validation sets. The test set is
adopted for performance evaluation, which is not available during the train-
ing procedure. The detailed data split is shown in Table 3.
In this paper, four Natural Language Generation (NLG) metrics are used
to evaluate the model quality of ultrasound reports generation. They are
BLEU [30], METEOR [31], CIDER [32] and ROUGE-L [33]. In particular,
BLEU measures the similarity between generated and truth text by calcu-
lating the overlap of word n-grams. METEOR takes into account the su-
perficial and semantic similarities between the generated text and the truth
text. It also has built-in mechanisms for handling synonyms and paraphrases.
ROUGE-L is a kind of evaluation based on the precision and recall under
the longest common subsequence L. The semantic content and coherence
of the generated text are also considered. In contrast, CIDER is based on
the cosine similarity between word embedding concurrently considering both
single-word phrases and multi-word phrases. It is a metric used to evaluate
image description and text generation tasks. Here we particularly use CIDER
to evaluate the capture of important information during the generated task.
17
GeForce RTX 3090 32GB GPU. For the encoder-decoder backbone in report
generation Transformer, its structure includes 3 layers and 8 attention heads,
with 512 dimensions of hidden states. All weights in the model are randomly
initialized. The beam size is 3 to balance the effectiveness and efficiency of
all models during the report generation. The model who achieves the best
BLEU-4 score on the validation set serves as the best trained model.
The cleaning for text data is necessary in order to improve the quality of
report generation. The length of the sequence in a single report is basically
less than 110 characters. Therefore, the maximum sequence length is defined
as 115 during the training process. For these sentences that are not long
enough, the special character [unk] is used to fill them. The cut-off threshold
for words is 3. It means the word with fewer occurrences than the threshold is
filtered out and instead marked with [unk]. In addition, the disease abbrevi-
ation in the original report is replaced with its corresponding Chinese name.
For example, ”PVD” is replaced with ”posterior vitreous detachment”.
To validate the generalizability of proposed dataset, we also conducts
other generation model R2Gen [25] as the main baselines in our experiments.
R2Gen model uses a relational memory (RM) to record the previous gener-
ation process and combines memory-driven conditional layer normalization
(MLCN) in the Transformer decoder. The ablation experiments for two mod-
els (R2Gen and CMN) on the proposed dataset are designed, respectively.
BASE: This is an original Transformer with 3 layers, 8 heads and 512
hidden units, with no other extensions or modifications.
BASE+RM: The RM module is connected directly to the converter out-
put before the softmax of each time step, but not integrated into the Trans-
former decoder.
BASE+R2Gen: The MLCN and RM modules are combined and inte-
grated in the Transformer decoder to facilitate the decoding process.
BASE+MEM: Two memory networks for image and text are introduced
into the BASE Transformer, but there is no cross-modal information inter-
action.
BASE+CMN: A shared memory network CMN is introduced to facilitate
information interaction between two modalities, image and text.
18
Table 4: The NLG performance metrics on the test set.
NLG METRICS
MODEL BL 1 BL 2 BL 3 BL 4 METEOR ROUGE-L CIDER
BASE 0.382 0.309 0.154 0.126 0.368 0.339 0.499
BASE+RM 0.470 0.364 0.219 0.165 0.405 0.371 0.550
BASE+R2Gen 0.568 0.457 0.344 0.248 0.475 0.495 0.834
BASE+MEM 0.487 0.390 0.294 0.184 0.457 0.365 0.637
BASE+CMN 0.589 0.459 0.396 0.270 0.493 0.511 0.951
19
Figure 8: Comparison of the report length.
focus attention on the corresponding lesion areas. For example, the model
accurately identifies the location of lesions such as ”vitreous hemorrhage”,
”post-vitrectomy” and ”vitreous opacities” in the images.
In Figure 9, the red font represents that the model failed to predict or
predicted incorrectly during the report generation process. The blue font
represents the additional portion that the model gives more prediction in-
formation than the true report. In the first example, the model does not
accurately distinguish between ”vitreous hemorrhage” and ”vitreous mech-
anized accumulation of blood”. They are two diseases that are very easily
confused both in lesion image and in textual description. In the second and
third examples, the CMN model accurately predicts the report reality. The
interesting area in visualization image focuses on the lesion area, which im-
plies that the model is able to align the image and text information well.
In the fourth example, the model not only accurately predicts the disease
information, but also additionally generates a phrase ”abnormal intraocular
echoes”. It shows the strong learning ability of the CMN model. In addition
to the successful predictions, there are also some interferences present in the
dataset. For instance, in the fifth example, the ”posterior scleral staphyloma”
20
is located at the lower edge of the eye image, which is difficult to recognize
by the CMN model. In addition, due to the visual similarity between the
surrounding non-ocular regions and the lesion region, it causes an additional
interference which affects the accuracy identification and report generation.
In general, the result of visualization and the corresponding generated report
indicates that the proposed dataset is suitable for medical report generation.
21
5. Conclusion and Limitations
This paper presents a labeled ophthalmic dataset for the precise analysis
and the automated exploration of medical image along with its associated re-
port. The dataset contain 4858 Chinese reports and the corresponding 4858
eye ultrasound images, as well as the information of blood flow parameters
measured in clinical practice. To the best of our knowledge, it is the only oph-
thalmic dataset that contains the three modal information simultaneously.
The proposed dataset has also been used to evaluate the cross-modal medical
report generation models including the R2Gen and CMN models. The ac-
curacy report generation and its corresponding interesting disease areas are
also visualized based on CMN model. We hope that this dataset can con-
tribute to the development of automated diagnostic learning algorithm for
ophthalmic domain and reduce the stress of ophthalmologist in their clinical
work.
We also notice that there are several limitations of this study. First,
all these data are collected from only one medical center and may not be
generalizable. Second, there are still some rare disease variants not collected
in the dataset. Third, there is a data bias in the distribution of diseases
because the data are collected in a real clinical process. In the future, we
will continue to expand the volume of dataset to minimize the data bias as
much as possible.
Acknowledgments
This research is funded by the National Natural Science Foundation of
China (62373005, 62273007).
References
[1] M. Li, W. Cai, R. Liu, Y. Weng, X. Zhao, C. Wang, X. Chen, Z. Liu,
C. Pan, M. Li, et al., Ffa-ir: Towards an explainable and reliable medical
report generation benchmark, in: Thirty-fifth Conference on Neural In-
formation Processing Systems Datasets and Benchmarks Track (Round
2), 2021.
22
[3] J.-H. Huang, C.-H. H. Yang, F. Liu, M. Tian, Y.-C. Liu, T.-W. Wu,
I. Lin, K. Wang, H. Morikawa, H. Chang, et al., Deepopht: medical
report generation for retinal images via deep models and visual expla-
nation, in: Proceedings of the IEEE/CVF winter conference on appli-
cations of computer vision, 2021, pp. 2442–2452.
[4] P. Harzig, Y.-Y. Chen, F. Chen, R. Lienhart, Addressing data bias
problems for chest x-ray image report generation, arXiv preprint
arXiv:1908.02123 (2019).
[5] X. Wang, Y. Peng, L. Lu, Z. Lu, R. M. Summers, Tienet: Text-image
embedding network for common thorax disease classification and report-
ing in chest x-rays, in: Proceedings of the IEEE conference on computer
vision and pattern recognition, 2018, pp. 9049–9058.
[6] Y. Xue, T. Xu, L. Rodney Long, Z. Xue, S. Antani, G. R. Thoma,
X. Huang, Multimodal recurrent model with attention for automated
radiology report generation, in: Medical Image Computing and Com-
puter Assisted Intervention–MICCAI 2018: 21st International Confer-
ence, Granada, Spain, September 16-20, 2018, Proceedings, Part I,
Springer, 2018, pp. 457–466.
[7] C. Yin, B. Qian, J. Wei, X. Li, X. Zhang, Y. Li, Q. Zheng, Automatic
generation of medical imaging diagnostic report with hierarchical recur-
rent neural network, in: 2019 IEEE international conference on data
mining (ICDM), IEEE, 2019, pp. 728–737.
[8] Z. Wang, L. Zhou, L. Wang, X. Li, A self-boosting framework for
automated radiographic report generation, in: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021, pp. 2433–2442.
[9] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
recognition, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[10] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
[11] L. S.-T. Memory, Long short-term memory, Neural computation 9 (8)
(2010) 1735–1780.
23
[12] B. Jing, P. Xie, E. Xing, On the automatic generation of medical imaging
reports, in: Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2577–
2586.
24
diaretdb1 diabetic retinopathy database and evaluation protocol., in:
BMVC, Vol. 1, Citeseer, 2007, p. 10.
[27] X.-H. Zeng, B.-G. Liu, M. Zhou, Understanding and generating ultra-
sound image description, Journal of Computer Science and Technology
33 (2018) 1086–1100.
25
Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume
1: Long Papers), 2021, pp. 5904–5914.
[30] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for au-
tomatic evaluation of machine translation, in: Proceedings of the 40th
annual meeting of the Association for Computational Linguistics, 2002,
pp. 311–318.
[31] M. Denkowski, A. Lavie, Meteor 1.3: Automatic metric for reliable opti-
mization and evaluation of machine translation systems, in: Proceedings
of the sixth workshop on statistical machine translation, 2011, pp. 85–91.
[33] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in:
Text summarization branches out, 2004, pp. 74–81.
26