0% found this document useful (0 votes)

18 views26 pages

20240726-A Labeled Ophthalmic Ultrasound Dataset With Medical Report Generation Based On Cross-Modal Deep Learning-2407.18667v1

This document presents a labeled ophthalmic ultrasound dataset that includes ultrasound images, blood flow information, and associated medical reports from 2,417 patients, aimed at improving automated medical report generation through cross-modal deep learning. The dataset, consisting of 4,858 images and corresponding reports, is unique in its inclusion of multiple data modalities and is designed to aid in the diagnosis and treatment of various intraocular diseases. The study demonstrates the dataset's effectiveness for training supervised models in cross-modal medical data applications.

Uploaded by

Stephen Lu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views26 pages

20240726-A Labeled Ophthalmic Ultrasound Dataset With Medical Report Generation Based On Cross-Modal Deep Learning-2407.18667v1

Uploaded by

Stephen Lu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

A Labeled Ophthalmic Ultrasound Dataset with

Medical Report Generation Based on Cross-modal Deep

Learning
Jing Wanga , Junyan Fana , Meng Zhoua , Yanzhu Zhangb,∗, Mingyu Shic
arXiv:2407.18667v1 [cs.CV] 26 Jul 2024

a
School of Electrical and Control Engineering, North China University of
Technology, Beijing, 100144, China
b
School of Automation and Electrical Engineering, Shenyang Ligong
University, Shenyang, 110159, China
c
Department of Ophthalmology, The Fourth Affiliated Hospital of China Medical
University, Shenyang, 110005, China

Abstract
Ultrasound imaging reveals eye morphology and aids in diagnosing and treat-
ing eye diseases. However, interpreting diagnostic reports requires specialized
physicians. We present a labeled ophthalmic dataset for the precise analysis
and the automated exploration of medical images along with their associ-
ated reports. It collects three modal data, including the ultrasound images,
blood flow information and examination reports from 2,417 patients at an
ophthalmology hospital in Shenyang, China, during the year 2018, in which
the patient information is de-identified for privacy protection. To the best
of our knowledge, it is the only ophthalmic dataset that contains the three
modal information simultaneously. It incrementally consists of 4,858 images
with the corresponding free-text reports, which describe 15 typical imaging
findings of intraocular diseases and the corresponding anatomical locations.
Each image shows three kinds of blood flow indices at three specific arteries,
i.e., nine parameter values to describe the spectral characteristics of blood
flow distribution. The reports were written by ophthalmologists during the
clinical care. The proposed dataset is applied to generate medical report

∗
Corresponding author
Email addresses: [email protected] (Jing Wang),
[email protected] (Junyan Fan), [email protected] (Meng Zhou),
[email protected] (Yanzhu Zhang), [email protected] (Mingyu Shi)

Preprint submitted to Medical Image Analysis July 29, 2024

based on the cross-modal deep learning model. The experimental results
demonstrate that our dataset is suitable for training supervised models con-
cerning cross-modal medical data.
Keywords: Medical report generation, Ophthalmic ultrasound, Computer
vision, Natural language processing

1. Introduction
Medical imaging plays a crucial role on the clinical diagnosis and treat-
ment especially in the field of ophthalmology. The popular imaging tech-
niques include fundus photography, optical coherence tomography (OCT),
and fluorescein angiography of the retina. The adequate interpretation of
the ophthalmic examination requires perfossional ophthalmologist or radi-
ologists. Due to the increasing workloads, ophthalmologists face significant
time and effort limitations in analyzing medical images and generating di-
agnostic reports. Moreover, the subjective expertise of different ophthalmol-
ogist will result the potential variations in the interpretation of the same
image. The existing medical report generation has mostly focused on the ra-
diographic images, particularly chest X-ray images. Various medical report
generation datasets have been released for different medical modalities, such
as fluorescein angiography (FFA) images [1], lung CT scans [2], and color
fundus photography (CFP) [3]. There is a lack of research specifically on
ophthalmic ultrasound images and report generation. Moreover, most of the
existing datasets are in English. There is a significant research gap in the
field of Chinese medical report generation. Therefore, the development of
annotated ophthalmic ultrasound image datasets with corresponding report
is necessary for the ophthalmic diagnosis based on artificial intelligence (AI).
Inspired by the image captioning task, an increasing number of researchers
have applied this approach to the generation of medical reports. Its primary
objective is to provide annotation for subsequent diagnosis and treatment
of diseases. Many deep learning-based automatic generation methods have
been proposed to lighten the workload of doctors [4, 5, 6, 7, 8]. These meth-
ods usually use the convolutional neural network (CNN) [9, 10] to extract
visual feature, and use the recurrent neural network (RNN) to predict reports
[11]. They have given a preliminary feasible scheme for medical report gen-
eration (MRG). Latter text-image attention mapping is explored to explain
the automatic generation process although its accuracy is not yet known

2
[12]. There still exist many challenges which limits the practical application
of deep learning method as follows. (1) The construction of large and spe-
cific medical dataset is time-consuming and labor-intensive; (2) Due to the
complexity of medical image interpretation, The accuracy achieved by deep
learning model is not up to the level of specialized doctors.
In this paper, we propose a labeled ophthalmic dataset of ocular ultra-
sound image, text report and blood flow information. To the best of our
knowledge, it is the only ophthalmic dataset that contains all the three modal
information simultaneously. A comprehensive experiment is conducted on
this dataset, and the result demonstrates that the proposed dataset is suit-
able for training all kinds of supervised learning models which concerns the
cross-modal medical data. The main motivation for the dataset construction
is as follows. From the view of scientific research, it is helpful to develop
medical AI learning algorithm fusion the computer vision and natural lan-
guage processing. It can be applied but not limited to the cross-modal report
generation. From the view of clinical applications, this dataset provides the
ultrasound images which give insight into the morphology and structure of
eyes. It includes 15 common ophthalmic diseases, such as retinal detachment,
choroidal detachment, vitreous stellate degeneration, vitreous hemorrhage,
endophthalmitis and vitreous opacity. The rich types disease is to broaden
the learning model for the initial diagnosis and treatment.
The main contributions of this paper are summarized as follows:
• A large-scale medical dataset is constructed which is comprised of 4,858
eye ultrasound images and their corresponding Chinese reports. All the
data were collected from the real-world clinical practices, and the report
accurately represent the writing patterns of ophthalmologists. The
dataset facilitates cross-modal learning and report generation whose
text pattern is more aligned with clinical practices.

• Compared to other datasets, the proposed dataset includes additional

blood flow parameter information extracted from the ultrasound ex-
amination beside the image and text. These parameters describe the
spectral characteristics of blood flow distribution at three specific ar-
teries. It plays a key role on assisting medical diagnosis and treatment
decisions.

• Cross-modal memory network is given to generate report according to

the proposed dataset. A comprehensive experiment is conducted, and

3
the predict accuracy of medical report is evaluated based on the NLG
metrics. The result shows that our dataset can be applied to the med-
ical report generation. It is helpful to drive the AI based ophthalmic
medical diagnosis.
The rest of the paper is organized as follows: Section 2 reviews various
existed medical datasets and medical report generation (MRG) methods.
Section 3 presents the construction methodology of the labeled ophthalmic
ultrasound dataset. Section 4 gives a comprehensive report generation ex-
periment based on the cross-modal memory network. Section 5 draws a
conclusion as well as future perspective.

2. Related Work
Many kinds of medical image have been widely utilized to develope an AI-
aided diagnosis system. Here we reviewed the existed medical image datasets
and MRG methods.

2.1. Medical report datasets

Medical report generation has received an increasing attention in the
fields of AI aided clinical medicine. Many medical report datasets have been
proposed, such as Open-IU [13], DEN [3], RDIF [14] and COV-CHR [2]. We
compared our ocular ultrasound dataset with 14 datasets used for medical
report generation. The relevant statistics are given in Table 1. Open-IU and
MIMIC-CXR [15] are two widely used benchmarks for medical report gener-
ation. The Open-IU Indiana University chest X-ray dataset contains 8,121
images associated with 3,996 de-identified radiology reports. MIMIC-CXR is
the largest public X-ray dataset, containing 473,057 chest X-ray images and
206,563 reports. It also provides more relevant disease impressions that can
be used for disease classification. The PEIR Gross [12] dataset consists of
7,442 pairs of images from 21 different subcategories. Unlike Open-IU, each
image in PEIR Gross dataset is associated with only one sentence in the cor-
responding report. PADCHEST [16] is a labeled large-scale, high resolution
chest X-ray dataset for the automated exploration of medical images along
with their associated reports. Fetal ultrasound dataset [17] consists of 2,800
frames extracted from videos, along with the corresponding reports. Due to
the characteristics of the Chinese vocabulary, the average report lengths of
CX-CHR [18] and COV-CHR [2] are much larger than those of the English
medical report dataset.

4
There are also five retinal datasets, including retinal images and text.
FFA-IR [1] provides interpretable annotations by labeling 46 foci in a total
of 12,166 regions, as well as the Fundus Fluorescein Angiography(FFA) im-
ages and reports. It plays an important role in identifying the disease and
generating report. In contrast to FFA-IR, DEN primarily consists of color
fundus photography (CFP) images (13,898 CFP and 1,811 FFA). STARE
[19] released a total of 397 images including CFP and FFA in 2004. How-
ever, the text provided with these images is short free-text diagnostic labels,
rather than observational reports of image findings. It is therefore not suit-
able for training the medical report generation model. DIARETDB1 [20] has
good annotation of lesion location and size, but the number of CFP images
is limited. MESSIDOR [21] includes 1200 CFP images and 600 Fine-Great
French reports. In summary, it can be seen that most of the current datasets
are for chest diseases. Not much research has been done on datasets for eye
diseases. The existed datasets for eyes are basically color fundus images,
which are generally used for screening, diagnosis and monitoring of fundus
diseases, such as retinopathy and macular degeneration. However, it is dif-
ficult to assess some intraocular diseases such as vitreous humor and retinal
detachment. Unlike the existing medical reports, our dataset builds image
report pairs for clinically collected ocular ultrasound images, Chinese reports
and additional blood flow parameter information. It will play an important
role in the diagnosis of ocular diseases and automated report generation stud-
ies.

2.2. Medical Report Generation Model

Researchers have utilized medical image report datasets to develop med-
ical report generation(MRG) methods. Jing et al. [12] proposed an encoder-
decoder framework with a co-attention mechanism to exact the visual and
textual feature, which simultaneously predicted a medical tag and generated
a single sentence. Xue et al. [6] generated multiple sentences by fusing mul-
tiple image modalities using topic-level LSTM and word-level LSTM [11]. Li
et al. [18] summarized a template library and jointed the retrieval and gener-
ation operations through reinforcement learning to select sentences directly
from the template library. The medical knowledge graph is embedded into
the recursive generative network to improve the learning accuracy [23, 24]. Li
et al. [2] replaced the multi-level recurrent network with a medical tag graph
Transformer to process the long sequence in medical tagging. Chen et al.
[25] proposed a memory-driven Transformer model to improve the memory

5
Table 1: Summary of medical datasets

Name of Dataset Image Modality Number of Images Report Cases Report Language
RDIF[14] Kidney Biopsy 1152 144 English
COV-CHR[2] Lung CT-Scans 728 728 English/Chinese
Fetal Ultrasound[17] Fetal ultrasound 2800 2800 English
PEIR Gross[12] Gross lesions 7,442 7,442 English
Open-IU[13] Chest X-Ray 7470 2955 English
MIMIC-CXR[15] Chest X-Ray 377,110 276,778 English
PADCHEST[16] Chest X-Ray 160,868 22,710 Spanish
CX-CHR[18] Chest X-Ray 45,598 40,410 Chinese
TJU[22] Chest X-ray 19,985 19,985 Chinese
STARE[19] CFP+FFA 397 397 English
DEN[3] CFP+FFA 15709 15709(Keywords) English
DIARETDB1[20] CFP 89 89 English
MESSIDOR[21] CFP 1200 587 French
FFA-IR[1] FFA 1, 048, 584 10790 English/Chinese
Our Ocular ultrasound 4858 4858 Chinese

capacity during the decoding procedure for the retention and utilization of
relevant information. Zhang et al. [26] improved the attention mechanism to
garantee the model focus on the correct region of medical image, and it also
provided interpretable analysis for the diagnostic process. Wang et al. [5]
proposed a textual image embedding model that generated medical report
using the attention mechanism. The model incorporated an attention distri-
bution graph and text embedding information to improve the classification
accuracy. Zeng et al. [27] generated a ultrasound image report using the
target detection algorithm. It detected the lesion region, and generated a
medical diagnostic report by encoding and decoding the ultrasound image
vector.

3. Labeled Ophthalmic Dataset Construction

The dataset included three data modalities: image, free-text report and
blood flow parameter data. Different modal data were extracted indepen-
dently from the same report, then composed into database. Patient informa-
tion was removed to protect privacy.
The framework for dataset construction is shown in Figure 1. It consists
of image cropping, report extraction and blood flow information extraction,
which are explained latter. Additionally, this framework gives a foundational
illustration about how to apply this dataset for cross-modal report genera-

6
tion. During the report generation, visual features are extracted from the
ultrasound images by a traditional CNN [10], then injected into the image
encoder. Then the encoded image features are fed into a sentence decoder
with embedded text features to generate reports. Usually the sentence de-
coder is designed as LSTM network [11] or Transformer decoder [28], in which
cross-modal attention is utilized to align image and text information.

Figure 1: The framework for dataset constrcution and cross-modal generation

3.1. Ultrasound Image Cropping

A comprehensive retrieval was conducted on all patients treated at Shenyang
Eye Hospital in China in 2018. The ocular ultrasound image was taken by
the hospital’s esaote device. The association between ultrasound image and
diagnostic report was established by medical record. Each report of one
patient was linked to one or multiple images shown his pathologic status.
Three principles should be followed during the screening of ultrasound
image.
(1) The lesions or abnormalities are primarily conscentrated in the eye-
ball region. Manual cropping of ultrasound image is necessary to cut the
irrelevant regions outside the eyeball, in order to reduce the interference on
deep learning recognition and diagnosis.
(2) Many images includes color-coded blood flow information and bound-
ary lines during the ultrasound examination. It also introduces interference
for image recognition and diagnosis. To ensure a clean dataset, we specifically
select these images without any interference from blood flow information.

7
(3) Some ultrasound images may have off-center eye information due to
the variability in eye positioning, and incomplete eyeball is obtained after
cropping. So these image should be excluded from dateset. Additionally, the
images without corresponding textual reports are also omitted.

Figure 2: An example of complete image processing. (a) Screening (b) Regional selection
and manual cropping. (c) Storage.

The image screening and cropping consists of three steps as shown in

Fig. 2. Firstly, ultrasound images are selected based on Principle (1-3), and
those with color flow interference are removed. Secondly, region selection
is performed on the screened images according to the lesion information,
then is cropped into square patches. The size of original ultrasound image
is 640×480 pixels. Due to the different size and position of the eyeball in an
original image, it is not possible to perform an automatic batch cropping. The
image dimension of manual cropped patch ranges from 161×161 to 257×257
pixels, and all patches are saved in JPEG format.

3.2. Report Preprocessing

The text dataset included 4858 examination reports. A complete diagnos-
tic report usually contains personal information (name, gender, age), clinical
information (ultrasound number, admission number, bed number), and the
diagnostic text section. It includes ultrasound description (location and con-
dition of the disease) and ultrasound finding (confirmation of the nature of
the disease and diagnosis). An example of clinical report is shown in Figure

8
3, in which the patient’s personal identifiable information is removed. The
following principles are considered during the text extraction of diagnositic
report.
(1) The ultrasound description and its corresponding finding text in the
report are integrated. This resulted in a comprehensive textual representa-
tion that included detailed disease description and diagnostic finding.
(2) Some diagnostic texts provided the size parameter to describe the
lesion, such as ”measuring approximately 1.83×1.48×2.79mm”. Consider-
ing the image did not provide the specific size information, we delete this
parameter and replace it by a simplified description, such as ”a hypoechoic
cystic-solid lesion within the nasal quadrant.”
(3) Typically, many diagnostic reports include separate diagnoses for the
left eye and the right one. Therefore, we process the reports independently
for each eye which forms individual case.
(4) The ultrasound findings did not specify diagnoses for the left and right
eyes due to varied writing habits. The report simply mentioned ”posterior
scleral staphyloma” instead of indicating ”left (right) eye with posterior scle-
ral staphyloma”. After the confirmation of ophthalmologists, it implies the
presence of the disease in both eyes.
We use OCR text recognition technology to identify and extract text
from the clinical reports. The ultrasound description and diagnostic section
are manually integrated, and the relevant disease description and diagnostic
result are retained. Non-relevant text is removed, and the results are dif-
ferentiated for the left and right eye. An example of the extracted result is
shown in Figure 3.

3.3. Blood Flow Parameter Recording

Color Doppler Flow Imaging (CDFI) is primarily used to observe the
blood flow parameters when it is applied to ophthalmic testing. It includes
the blood flow distribution, flow direction, flow properties, flow velocity, and
spectral characteristics at the specific location. The involved detection in-
dices are the end-diastolic velocity (EDV), resistance index (RI), and systolic
peak velocity (SVp) of ophthalmic artery (OA), central retinal artery (CRA),
and posterior ciliary artery (PCA). We independently collected the CDFI ul-
trasound images which contain the blood flow measurement. As shown in
Figure 4, the measurements of blood flow indices are shown on the left side of
the CDFI images. These indices were measured by doctors during the clinical
diagnosis process. We extract and compile the information separately. The

9
Figure 3: Example of report preprocessing.

blood flow indices are extracted from three ultrasound images, and gathered
into a 3 × 3 matrix with nine parameter values. For some cases where the
parameter results are shown in the diagnostic report, we read them directly
from the report.

3.4. Dataset Componuding

The above three parts of data information are summarized into dataset.
First, the cropped image is named according to the corresponding ultrasound
number in the clinical report, as shown in Figure 5(a). The left eye image
will be named ”Ultrasound Number 1” and the right eye will be named ”Ul-
trasound Number 2”. The subscript ”1” or ”2” represents the left and right
eyes, respectively. If there are multiple images for the same eye, we use the
second subscript to discribe the image number. For example, ”Ultrasound
Number 11” (left eye) or ”Ultrasound Number 22” (right eye) represent the
second image of the left or right eye, respectively. Similarly, the third image of
the same eye is named as ”Ultrasound Number 12” (left eye) or ”Ultrasound
Number 23” (right eye). The purpose of naming based on the ultrasound
number is to facilitate the correspondence with the report information. All
the cropped and named images are collected in the same folder.

10
Figure 4: Example of blood flow information extraction.

Next all the image names in the floder are batch extracted and exported
to a spreadsheet. This spreadsheet contains the ultrasound number and the
corresponding eye designation. The ultrasound report is matched with the
image based on its ultrasound number, as well as the image path information.
In order to make the data uniformly distributed, we randomly disrupt the
data and divid them into training, testing, and validation sets. Finally,
the data are converted to JSON format for easy retrieval in the subsequent
experiments. The aggregated blood flow parameter matrix is compiled in a
table, which also is matched based on the ultrasound number.
In summary, a complete data consists of the image ID (i.e., image ul-
trasound number), as well as its corresponding report, image path, split set
and blood flow information. The report and image are connected by the
image ID and its file path. An example of complete data with image and its
corresponding report is shown in Figure 5.

3.5. Statistical analysis

A comprehensive statistical analysis is conducted to investigate the dis-
tribution of disease categories and textual content in the proposed dataset.
There is a variety of diseases shown in this dataset. The 15 most common dis-
eases are shown in Figure 6, in which the diseases less than 5 occurrences are

11
Figure 5: An example of complete data. (a) cropped image. (b) JSON-formatted content
after extracting the text report. (c) extracted blood flow information.

not considered. The most common disease are ”vitreous opacity,” ”vitreous
hemorrhage,” ”PVD(Posterior Vitreous Detachment)” and ”mild vitreous
opacity”. The number of each disease is calculated based on the frequency
of the corresponding keyword in the dataset.
It is shown that 735 nomal cases are found at 15.3% of the total dataset,
and no significant ophthalmic disease is found in these patients. The symp-
tom ”vitreous opacity” is 3476 cases at 71.6% of the total dataset, and it
is the most common eye diseases. In contrast, ”eye atrophy” case is rare
with only 8 cases at 0.016%. The number of left and right eye cases in the
dataset are statistically 2,394 and 2,465, respectively. The right-eye sam-
ples are slightly more than the left-eye samples. The distribution of left and
right eyes is unbalanced, since different patients underwent one or multiple
ultrasound scanning at different eyes, angle and times. In addition, other
necessary index of the dataset are analyzed, including the number of reports
and patients, the length of sentences in the reports. The detailed results are
shown in Table 2. It is note that all sentences are extracted from diagnostic
reports written by physicians, and each report consists of a series of sen-
tences. There is a significant difference in the sentence length of the report.

12
Figure 6: Disease categories and percentages.

For example, the maximum or minimum tokens in a single report is 114 and
9, respectively. Usually, the report of normal case is described simplely as
”No abnormality was observed in the specific area”. The report of complex
case combines the description of multiple diseases at specical locations, and
its sentence length is far longer than that of the normal case.

4. Medical Report Generation

4.1. Methodology
In order to evaluate the usability of the proposed dataset, cross-modal
memory network (CMN) is adopted for the automatic generation of medical
report. It introduces a CMN module to enhance the encoder-decoder frame-
work during the chest radiology report generation [29]. In detail, it designs
a shared memory to capture the interactive information between image and
text. Then the interacted or alignment infromation are fed into the encoder
and decoder of transformer, respectively, which can facilitate the cross-modal
interaction and generation. In order to apply CMN for Chinese report learn-
ing, we additionally used the jieba word-splitting tool to split the text to get

13
Table 2: Descriptive statistics of report sentences.
Parameter Value
Total tokens 252676
Reports 4858
Patients 2417
Average (min-max) number of tokens per report 62
Maximum tokens in a single report 114
Minimum tokens in a single report 9
Right eyes 2393
Left eyes 2465

some medical terms about ophthalmic ultrasound images. This will be more
suitable for Chinese medical report generation task.
A schematic framework is given to illustrate the automatic generation of
the ophthalmic ultrasound report based on CMN learning model, as shown in
Figure 7. The whole process consists of four modules: visual feature extrac-
tion, word embedding, cross-modal memory network and report generation.
The visual feature extraction module utilizes ResNet to extract the patch
features from ultrasound image. The word embedding module transforms
text into vector form. The cross-modal memory network projects the image
features and text vectors into the same space and facilitates interaction using
a memory matrix, in order to align the information between different modals,
such as image and text. Finally, the interacted image and text information
are fed into the encoder and decoder of transformer, respectively, and the
report is generated.
We define the generation of ultrasound report as an image-to-text gener-
ation task. The goal is to predict the diagnostic report R for each ultrasound
image I. Then the image can be used as the source sequence and the text
report as the target sequence.
Visual feature extraction: First, a visual extractor is used to extract
the visual features from the ultrasound image I, which is denoted as X =
{x1 , x2 , . . . , xS } , xs ∈ Rd . Here xs are the patch features of the image and d
is the size of the feature vector. The extraction process is expressed as:

{x1 , x2 , . . . , xS } = fv (I) (1)

where the visual extractor fv (·) is finished by ResNet network [9] in this

14
Figure 7: A framework diagram for cross-modal medical report generation.

experiment.
Word embedding: The text sequence obtained by word embedding is
denoted as Y = {y1 , y2 , . . . , yt−1 }. The process can be expressed as:

{y1 , y2 , . . . , yt−1 } = ft (R) (2)

where ft (·) is the text feature extractor. Here we used the same embed-
ding module of report generation Transformer during the model training and
learning. The generated output tokens are represented as {y1 , y2 , . . . , yt , . . . ,
yT } , yt ∈ V, where V is all the possible tokens and T is the length of the
report.
Cross modal network: CMN is used to solve the information alignment
problem from different modalities. It introduces a memory matrix M =
{m1 , m2 , . . . , mi , . . . , mN } to deal with the information interaction between
two modalities including image and text. Here N denotes the number of
memory vectors. Specifically, CMN consists of two subprocesses, querying
and responding. In the query subprocess, the image and text features are first
projected into the same representation space, and the most relevant memory
vectors about the image and text are queried. The responding subprocess is
to weight the queried memory vectors of image and text, respectively. Finally,
the obtained memory responses are fed into the Transformer to generate the

15
corresponding reports. The CMN learning can be represented as:

{rx1 , rx2 , . . . , rxS } = CM N (x1 , x2 , . . . , xS ) (3)

ry1 , ry2 , . . . , ryt−1 = CM N (y1 , y2 , . . . , yt−1 ) (4)
where rxi and ryj , i = 1, . . . , S, j = 1, . . . , t − 1 are the memory responses of
visual and textual features. The detailed description about CMN querying
and responding can be found in [29].
Report generation: The report generation is finished in Transformer
architecture. The CMN image information (rx1 , rx2 , . . . , rxS ) is input to the
encoder of Transformer to get the intermediate state {w1 , w2 , . . . , wS }, which
is expressed as:

{w1 , w2 , . . . , wS } = fe (rx1 , rx2 , . . . , rxS ) (5)

where fe (·) refers the encoder. Then along with the memory response {ry1 , ry2 ,
. . . , ryt−1 , the obtained intermediate states {w1 , w2 , . . . , wS } are sent to the
decoder to generate the current output yt .

yt = fd w1 , w2 , . . . , wS , ry1 , ry2 , . . . , ryt−1 (6)

where fd (·) refers the decoder.

Loss function: It is found that the entire report generation is recursive
with the visual features X = {x1 , x2 , . . . , xS } from the ultrasound image I as
input, and the report target sequence Y = {y1 , y2 , . . . , yt , . . . , yT } as output.
It can be formalized based on the recursive chain rule as follows:
T
Y
p(Y | I) = p (yt | y1 , . . . , yt−1 , I) (7)
t=1

where p(Y | I) is the probability of generating the target sequence Y for

the given input I. Then the model training is to maximize the conditional
log-likelihood of p(Y | I) :
T
X
∗
θ = arg max log p (yt | y1 , . . . , yt−1 , I; θ) (8)
θ t=1

where θ are the model parameters. The goal is to find the best parameter
value θ∗ to maximize the probability of generating the target sequence Y for
the given input I.

16
4.2. Experiments and Discussions
4.2.1. Dataset and Evaluation Metrics
We randomly separate the ultrasound dataset into three parts for train-
ing, validation and test according to the ratio of 75:10:15. The model is
trained and validated using the training and validation sets. The test set is
adopted for performance evaluation, which is not available during the train-
ing procedure. The detailed data split is shown in Table 3.

Table 3: Detailed information of data split.

Dataset Training Validation Test Total
Images 3644 485 729 4858
Reports 3644 485 729 4858

In this paper, four Natural Language Generation (NLG) metrics are used
to evaluate the model quality of ultrasound reports generation. They are
BLEU [30], METEOR [31], CIDER [32] and ROUGE-L [33]. In particular,
BLEU measures the similarity between generated and truth text by calcu-
lating the overlap of word n-grams. METEOR takes into account the su-
perficial and semantic similarities between the generated text and the truth
text. It also has built-in mechanisms for handling synonyms and paraphrases.
ROUGE-L is a kind of evaluation based on the precision and recall under
the longest common subsequence L. The semantic content and coherence
of the generated text are also considered. In contrast, CIDER is based on
the cosine similarity between word embedding concurrently considering both
single-word phrases and multi-word phrases. It is a metric used to evaluate
image description and text generation tasks. Here we particularly use CIDER
to evaluate the capture of important information during the generated task.

4.2.2. Experiment Details

We use ResNet-101 [9] backbone pre-trained on ImageNet [34] as the vi-
sual extractor to extract the patch visual features. The dimension of each
feature is 2,048. 60 epochs are trained and the batch size is 16. All the Ultra-
sound images are firstly resized to 224 × 224. Adam optimizer is adopted to
train the model. The learning rates of the visual extractor and other param-
eters are set to 5e-5 and 1e-4, respectively. The whole training framework is
implemented with a PyTorch 1.7.1 library based on Python3.8 and NVIDIA

17
GeForce RTX 3090 32GB GPU. For the encoder-decoder backbone in report
generation Transformer, its structure includes 3 layers and 8 attention heads,
with 512 dimensions of hidden states. All weights in the model are randomly
initialized. The beam size is 3 to balance the effectiveness and efficiency of
all models during the report generation. The model who achieves the best
BLEU-4 score on the validation set serves as the best trained model.
The cleaning for text data is necessary in order to improve the quality of
report generation. The length of the sequence in a single report is basically
less than 110 characters. Therefore, the maximum sequence length is defined
as 115 during the training process. For these sentences that are not long
enough, the special character [unk] is used to fill them. The cut-off threshold
for words is 3. It means the word with fewer occurrences than the threshold is
filtered out and instead marked with [unk]. In addition, the disease abbrevi-
ation in the original report is replaced with its corresponding Chinese name.
For example, ”PVD” is replaced with ”posterior vitreous detachment”.
To validate the generalizability of proposed dataset, we also conducts
other generation model R2Gen [25] as the main baselines in our experiments.
R2Gen model uses a relational memory (RM) to record the previous gener-
ation process and combines memory-driven conditional layer normalization
(MLCN) in the Transformer decoder. The ablation experiments for two mod-
els (R2Gen and CMN) on the proposed dataset are designed, respectively.
BASE: This is an original Transformer with 3 layers, 8 heads and 512
hidden units, with no other extensions or modifications.
BASE+RM: The RM module is connected directly to the converter out-
put before the softmax of each time step, but not integrated into the Trans-
former decoder.
BASE+R2Gen: The MLCN and RM modules are combined and inte-
grated in the Transformer decoder to facilitate the decoding process.
BASE+MEM: Two memory networks for image and text are introduced
into the BASE Transformer, but there is no cross-modal information inter-
action.
BASE+CMN: A shared memory network CMN is introduced to facilitate
information interaction between two modalities, image and text.

4.2.3. Results and Analyses

The four NLG metrics obtained from the ablation experiments are shown
in Table 4. It is shown that various models achieve good report genera-
tion performance on the proposed dataset. This indicates that the proposed

18
Table 4: The NLG performance metrics on the test set.

NLG METRICS
MODEL BL 1 BL 2 BL 3 BL 4 METEOR ROUGE-L CIDER
BASE 0.382 0.309 0.154 0.126 0.368 0.339 0.499
BASE+RM 0.470 0.364 0.219 0.165 0.405 0.371 0.550
BASE+R2Gen 0.568 0.457 0.344 0.248 0.475 0.495 0.834
BASE+MEM 0.487 0.390 0.294 0.184 0.457 0.365 0.637
BASE+CMN 0.589 0.459 0.396 0.270 0.493 0.511 0.951

dataset can be serviced as a standard verification for different automatic

medical report generation methods. The structure of proposed ophthalmic
dataset also gives a guide for other image-text medical report dataset.
Besides the NLG metrics, another important criterion for evaluating the
generation model performance is the length of generated report. The closer
the length of the generated report is to the ground-truth report length, the
better the generation model performance. We divide the generated reports
into 12 groups based on their report lengths (the word range is [0, 120]) with
intervals of 10. Figure 8 shows the comparison results of the report length
distribution in each intervals generated by two models R2Gen and CMN. The
majority length of ground-truth reports in the test set concentrates around
50-60 words, with approximately 350 reports falling in this range. It can be
found that the length of generated report both closely match with that of the
ground-truth report in intervals [10,30) and [70,120]. The report generated
by CMN model is closer to the ground-truth reports than the R2Gen model
in the interval [30,70). Overall, the proposed dataset is applicable to the
cross-modal medical report generation task and provides a good validation
for the medical report generation algorithm.
We use heat map to visualize the mapping relationship between image
and text. Figure 9 shows the visualization results of specific regions in the
ocular ultrasound image (represented by color weights) with the reports gen-
erated by CMN model. The color plots represent the weight values from
low to high in the range [0, 1]. The professional ophthalmologist gives the
interpretation of the visualization results. They point out the areas in the
images where lesions are present. Based on the descriptions provided by the
doctors and the visualized results, the model demonstrates a good ability to

19
Figure 8: Comparison of the report length.

focus attention on the corresponding lesion areas. For example, the model
accurately identifies the location of lesions such as ”vitreous hemorrhage”,
”post-vitrectomy” and ”vitreous opacities” in the images.
In Figure 9, the red font represents that the model failed to predict or
predicted incorrectly during the report generation process. The blue font
represents the additional portion that the model gives more prediction in-
formation than the true report. In the first example, the model does not
accurately distinguish between ”vitreous hemorrhage” and ”vitreous mech-
anized accumulation of blood”. They are two diseases that are very easily
confused both in lesion image and in textual description. In the second and
third examples, the CMN model accurately predicts the report reality. The
interesting area in visualization image focuses on the lesion area, which im-
plies that the model is able to align the image and text information well.
In the fourth example, the model not only accurately predicts the disease
information, but also additionally generates a phrase ”abnormal intraocular
echoes”. It shows the strong learning ability of the CMN model. In addition
to the successful predictions, there are also some interferences present in the
dataset. For instance, in the fifth example, the ”posterior scleral staphyloma”

20
is located at the lower edge of the eye image, which is difficult to recognize
by the CMN model. In addition, due to the visual similarity between the
surrounding non-ocular regions and the lesion region, it causes an additional
interference which affects the accuracy identification and report generation.
In general, the result of visualization and the corresponding generated report
indicates that the proposed dataset is suitable for medical report generation.

Figure 9: Visualizations of image-text attention mappings of generated report from CMN

model.

21
5. Conclusion and Limitations
This paper presents a labeled ophthalmic dataset for the precise analysis
and the automated exploration of medical image along with its associated re-
port. The dataset contain 4858 Chinese reports and the corresponding 4858
eye ultrasound images, as well as the information of blood flow parameters
measured in clinical practice. To the best of our knowledge, it is the only oph-
thalmic dataset that contains the three modal information simultaneously.
The proposed dataset has also been used to evaluate the cross-modal medical
report generation models including the R2Gen and CMN models. The ac-
curacy report generation and its corresponding interesting disease areas are
also visualized based on CMN model. We hope that this dataset can con-
tribute to the development of automated diagnostic learning algorithm for
ophthalmic domain and reduce the stress of ophthalmologist in their clinical
work.
We also notice that there are several limitations of this study. First,
all these data are collected from only one medical center and may not be
generalizable. Second, there are still some rare disease variants not collected
in the dataset. Third, there is a data bias in the distribution of diseases
because the data are collected in a real clinical process. In the future, we
will continue to expand the volume of dataset to minimize the data bias as
much as possible.

Acknowledgments
This research is funded by the National Natural Science Foundation of
China (62373005, 62273007).

References
[1] M. Li, W. Cai, R. Liu, Y. Weng, X. Zhao, C. Wang, X. Chen, Z. Liu,
C. Pan, M. Li, et al., Ffa-ir: Towards an explainable and reliable medical
report generation benchmark, in: Thirty-fifth Conference on Neural In-
formation Processing Systems Datasets and Benchmarks Track (Round
2), 2021.

[2] M. Li, R. Liu, F. Wang, X. Chang, X. Liang, Auxiliary signal-guided

knowledge encoder-decoder for medical report generation, World Wide
Web 26 (1) (2023) 253–270.

22
[3] J.-H. Huang, C.-H. H. Yang, F. Liu, M. Tian, Y.-C. Liu, T.-W. Wu,
I. Lin, K. Wang, H. Morikawa, H. Chang, et al., Deepopht: medical
report generation for retinal images via deep models and visual expla-
nation, in: Proceedings of the IEEE/CVF winter conference on appli-
cations of computer vision, 2021, pp. 2442–2452.
[4] P. Harzig, Y.-Y. Chen, F. Chen, R. Lienhart, Addressing data bias
problems for chest x-ray image report generation, arXiv preprint
arXiv:1908.02123 (2019).
[5] X. Wang, Y. Peng, L. Lu, Z. Lu, R. M. Summers, Tienet: Text-image
embedding network for common thorax disease classification and report-
ing in chest x-rays, in: Proceedings of the IEEE conference on computer
vision and pattern recognition, 2018, pp. 9049–9058.
[6] Y. Xue, T. Xu, L. Rodney Long, Z. Xue, S. Antani, G. R. Thoma,
X. Huang, Multimodal recurrent model with attention for automated
radiology report generation, in: Medical Image Computing and Com-
puter Assisted Intervention–MICCAI 2018: 21st International Confer-
ence, Granada, Spain, September 16-20, 2018, Proceedings, Part I,
Springer, 2018, pp. 457–466.
[7] C. Yin, B. Qian, J. Wei, X. Li, X. Zhang, Y. Li, Q. Zheng, Automatic
generation of medical imaging diagnostic report with hierarchical recur-
rent neural network, in: 2019 IEEE international conference on data
mining (ICDM), IEEE, 2019, pp. 728–737.
[8] Z. Wang, L. Zhou, L. Wang, X. Li, A self-boosting framework for
automated radiographic report generation, in: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021, pp. 2433–2442.
[9] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
recognition, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[10] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
[11] L. S.-T. Memory, Long short-term memory, Neural computation 9 (8)
(2010) 1735–1780.

23
[12] B. Jing, P. Xie, E. Xing, On the automatic generation of medical imaging
reports, in: Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2577–
2586.

[13] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan,

L. Rodriguez, S. Antani, G. R. Thoma, C. J. McDonald, Preparing a col-
lection of radiology examinations for distribution and retrieval, Journal
of the American Medical Informatics Association 23 (2) (2016) 304–310.

[14] S. Maksoud, A. Wiliem, K. Zhao, T. Zhang, L. Wu, B. Lovell, Coral8:

Concurrent object regression for area localization in medical image pan-
els, in: International Conference on Medical Image Computing and
Computer-Assisted Intervention, 2019, pp. 432–441.

[15] A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y.

Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, S. Horng, Mimic-
cxr-jpg, a large publicly available database of labeled chest radiographs,
arXiv preprint arXiv:1901.07042 (2019).

[16] A. Bustos, A. Pertusa, J.-M. Salinas, M. De La Iglesia-Vaya, Padchest:

A large chest x-ray image dataset with multi-label annotated reports,
Medical image analysis 66 (2020) 101797.

[17] M. Alsharid, H. Sharma, L. Drukker, P. Chatelain, A. T. Papageorghiou,

J. A. Noble, Captioning ultrasound images automatically, in: Medical
Image Computing and Computer Assisted Intervention–MICCAI 2019:
22nd International Conference, Shenzhen, China, October 13–17, 2019,
Proceedings, Part IV 22, Springer, 2019, pp. 338–346.

[18] Y. Li, X. Liang, Z. Hu, E. P. Xing, Hybrid retrieval-generation rein-

forced agent for medical image report generation, Advances in neural
information processing systems 31 (2018).

[19] A. Hoover, M. Goldbaum, Locating the optic nerve in a retinal image

using the fuzzy convergence of the blood vessels, IEEE transactions on
medical imaging 22 (8) (2003) 951–958.

[20] T. Kauppi, V. Kalesnykiene, J.-K. Kamarainen, L. Lensu, I. Sorri,

A. Raninen, R. Voutilainen, H. Uusitalo, H. Kälviäinen, J. Pietilä, The

24
diaretdb1 diabetic retinopathy database and evaluation protocol., in:
BMVC, Vol. 1, Citeseer, 2007, p. 10.

[21] E. Decencière, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone,

P. Gain, R. Ordonez, P. Massin, A. Erginay, et al., Feedback on a pub-
licly distributed image database: the messidor database, Image Analysis
& Stereology 33 (3) (2014) 231–234.

[22] M. Gu, X. Huang, Y. Fang, Automatic generation of pulmonary ra-

diology reports with semantic tags, in: 2019 IEEE 11th International
Conference on Advanced Infocomm Technology (ICAIT), IEEE, 2019,
pp. 162–167.

[23] Y. Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, D. Xu, When radiology

report generation meets knowledge graph, in: Proceedings of the AAAI
Conference on Artificial Intelligence, Vol. 34, 2020, pp. 12910–12917.

[24] C. Y. Li, X. Liang, Z. Hu, E. P. Xing, Knowledge-driven encode, retrieve,

paraphrase for medical image report generation, in: Proceedings of the
AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 6666–6673.

[25] Z. Chen, Y. Song, T.-H. Chang, X. Wan, Generating radiology reports

via memory-driven transformer, in: Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing (EMNLP), 2020,
pp. 1439–1449.

[26] Z. Zhang, Y. Xie, F. Xing, M. McGough, L. Yang, Mdnet: A seman-

tically and visually interpretable medical image diagnosis network, in:
Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017, pp. 6428–6436.

[27] X.-H. Zeng, B.-G. Liu, M. Zhou, Understanding and generating ultra-
sound image description, Journal of Computer Science and Technology
33 (2018) 1086–1100.

[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,

L. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural
information processing systems 30 (2017).

[29] Z. Chen, Y. Shen, Y. Song, X. Wan, Cross-modal memory networks

for radiology report generation, in: Proceedings of the 59th Annual

25
Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume
1: Long Papers), 2021, pp. 5904–5914.

[30] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for au-
tomatic evaluation of machine translation, in: Proceedings of the 40th
annual meeting of the Association for Computational Linguistics, 2002,
pp. 311–318.

[31] M. Denkowski, A. Lavie, Meteor 1.3: Automatic metric for reliable opti-
mization and evaluation of machine translation systems, in: Proceedings
of the sixth workshop on statistical machine translation, 2011, pp. 85–91.

[32] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Consensus-based

image description evaluation, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2015, pp. 4566–4575.

[33] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in:
Text summarization branches out, 2004, pp. 74–81.

[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A

large-scale hierarchical image database, in: 2009 IEEE conference on
computer vision and pattern recognition, Ieee, 2009, pp. 248–255.

TMRGM A Template-Based Multi-Attention Model For X
No ratings yet
TMRGM A Template-Based Multi-Attention Model For X
12 pages
Adaptive Medical Topic Learning For Enhanced Fine-Grained Cross-Modal Alignment in Medical Report Generation-1
No ratings yet
Adaptive Medical Topic Learning For Enhanced Fine-Grained Cross-Modal Alignment in Medical Report Generation-1
12 pages
Cross-Modal Augmented Transformer For Automated Medical Report Generation
No ratings yet
Cross-Modal Augmented Transformer For Automated Medical Report Generation
16 pages
Applsci 15 00343
No ratings yet
Applsci 15 00343
14 pages
A Survey On Automatic Generation of Medical Imaging Reports Based On Deep Learning
No ratings yet
A Survey On Automatic Generation of Medical Imaging Reports Based On Deep Learning
16 pages
Adaptive Medical Topic Learning For
No ratings yet
Adaptive Medical Topic Learning For
3 pages
Bioengineering 10 00966
No ratings yet
Bioengineering 10 00966
12 pages
JHC-RTF 20240402 Short
No ratings yet
JHC-RTF 20240402 Short
29 pages
Automatic Report Generation For Chest X-Ray Images: A Multilevel Multi-Attention Approach
No ratings yet
Automatic Report Generation For Chest X-Ray Images: A Multilevel Multi-Attention Approach
10 pages
Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment
No ratings yet
Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment
9 pages
On The Automatic Generation of Medical Imaging Reports
No ratings yet
On The Automatic Generation of Medical Imaging Reports
10 pages
Unpaired Medical Report Generation Cycle Consistency Hirsch Tal
No ratings yet
Unpaired Medical Report Generation Cycle Consistency Hirsch Tal
16 pages
20250731-A Survey of Multimodal Ophthalmic Diagnostics From Task-Specific Approaches To Foundational Models-2508.03734v1
No ratings yet
20250731-A Survey of Multimodal Ophthalmic Diagnostics From Task-Specific Approaches To Foundational Models-2508.03734v1
25 pages
Integrating Medical Imaging and Clinical Reports Using Multimodal Deep Learning For Advanced Disease Analysis
No ratings yet
Integrating Medical Imaging and Clinical Reports Using Multimodal Deep Learning For Advanced Disease Analysis
7 pages
Medical Paper - Plag Report
No ratings yet
Medical Paper - Plag Report
34 pages
SSRN 5273371
No ratings yet
SSRN 5273371
30 pages
Serpent
No ratings yet
Serpent
10 pages
Attention Based Cross-Domain Synthesis and Segmentation From Unpaired Medical Images
No ratings yet
Attention Based Cross-Domain Synthesis and Segmentation From Unpaired Medical Images
13 pages
3D-CT-GPT - Generating 3D Radiology Reports Through Integration of Large Vision-Language Models
No ratings yet
3D-CT-GPT - Generating 3D Radiology Reports Through Integration of Large Vision-Language Models
9 pages
Research Paper
No ratings yet
Research Paper
20 pages
Structural Entities Extraction and Patient Indications Incorporation For Chest X-Ray Report Generation
No ratings yet
Structural Entities Extraction and Patient Indications Incorporation For Chest X-Ray Report Generation
11 pages
Medical Image Captioning with Transformers
No ratings yet
Medical Image Captioning with Transformers
13 pages
Promptmrg
No ratings yet
Promptmrg
14 pages
Orchestrating Explainable Arti Cial Intelligence For Multimodal and Longitudinal Data in Medical Imaging
No ratings yet
Orchestrating Explainable Arti Cial Intelligence For Multimodal and Longitudinal Data in Medical Imaging
10 pages
2020 - A Global Review of Publicly Available Datasets For Ophtalmological Image 2021
No ratings yet
2020 - A Global Review of Publicly Available Datasets For Ophtalmological Image 2021
16 pages
Developing Generalist Foundation Models From A Multimodal Dataset For 3D Computed Tomography
No ratings yet
Developing Generalist Foundation Models From A Multimodal Dataset For 3D Computed Tomography
47 pages
Lo Chen 2024 Automated Breast Imaging Report Generation Based On The Integration of Multiple Image Features in A
No ratings yet
Lo Chen 2024 Automated Breast Imaging Report Generation Based On The Integration of Multiple Image Features in A
14 pages
AI-Driven Chest X-Ray Report Generation
No ratings yet
AI-Driven Chest X-Ray Report Generation
81 pages
Multi-Layer, Multi-Modal Medical Image Intelligent Fusion
No ratings yet
Multi-Layer, Multi-Modal Medical Image Intelligent Fusion
27 pages
Application of Generative Adversarial Networks (GAN) For Ophthalmology Image Domains: A Survey
No ratings yet
Application of Generative Adversarial Networks (GAN) For Ophthalmology Image Domains: A Survey
19 pages
R C: B B LLM R M: Easoning Urriculum Ootstrapping Road Easoning From ATH
No ratings yet
R C: B B LLM R M: Easoning Urriculum Ootstrapping Road Easoning From ATH
14 pages
Delegated Authorization For Agents Constrained To Semantic Task-to-Scope Matching
No ratings yet
Delegated Authorization For Agents Constrained To Semantic Task-to-Scope Matching
18 pages
The Oversight Game: Learning To Cooperatively Balance An AI Agent's Safety and Autonomy
No ratings yet
The Oversight Game: Learning To Cooperatively Balance An AI Agent's Safety and Autonomy
22 pages
Personalized Generation in Large Model Era A Survey-2503.02614v1
No ratings yet
Personalized Generation in Large Model Era A Survey-2503.02614v1
40 pages
Diagnostic Radiology Recent Advances and Applied Physics in Imaging (2nd Edition)
No ratings yet
Diagnostic Radiology Recent Advances and Applied Physics in Imaging (2nd Edition)
10 pages
NABH Hospital Accreditation Study
No ratings yet
NABH Hospital Accreditation Study
27 pages
National Radiology Policy
No ratings yet
National Radiology Policy
40 pages
Medical Imaging: The Biopolitics of Visibility: Review Essay
No ratings yet
Medical Imaging: The Biopolitics of Visibility: Review Essay
13 pages
Doppler in Obstetrics
No ratings yet
Doppler in Obstetrics
162 pages
2017 Freehand 3-D Ultrasound Imaging A Systematic Review
No ratings yet
2017 Freehand 3-D Ultrasound Imaging A Systematic Review
26 pages
RS Royal Surabaya Facility Schedule 2022
No ratings yet
RS Royal Surabaya Facility Schedule 2022
4 pages
Radiologic Technologist Career Objectives
No ratings yet
Radiologic Technologist Career Objectives
4 pages
IDMP 2023: Celebrating Medical Physics
No ratings yet
IDMP 2023: Celebrating Medical Physics
3 pages
Medical Conditions & Clinical Practice
No ratings yet
Medical Conditions & Clinical Practice
248 pages
A Review of Tomographic Reconstruction Techniques For Computed Tomography
No ratings yet
A Review of Tomographic Reconstruction Techniques For Computed Tomography
5 pages
JAHIMA January Coding Notes Radiology
No ratings yet
JAHIMA January Coding Notes Radiology
3 pages
Physioex Lab Report: Pre-Lab Quiz Results
No ratings yet
Physioex Lab Report: Pre-Lab Quiz Results
5 pages
Advances in Hemodynamic Monitoring Techniques
No ratings yet
Advances in Hemodynamic Monitoring Techniques
8 pages
Askep Paliatif
No ratings yet
Askep Paliatif
21 pages
Impromed Dicom
No ratings yet
Impromed Dicom
40 pages
AccessibleMRIusingAI Joel ISMRMAoCWorkshop2024
No ratings yet
AccessibleMRIusingAI Joel ISMRMAoCWorkshop2024
1 page
Intro Boenninghausen's Therapeutic Pocketbook
100% (1)
Intro Boenninghausen's Therapeutic Pocketbook
95 pages
Dokumen - Pub Textbook of Radiology For Residents and Technicians Sixth Edition 9789354665455
No ratings yet
Dokumen - Pub Textbook of Radiology For Residents and Technicians Sixth Edition 9789354665455
698 pages
Xray Scan Nov 2024
No ratings yet
Xray Scan Nov 2024
2 pages
DEFIBRILLATOR
No ratings yet
DEFIBRILLATOR
12 pages
Nuclear Medicine and Molecular Imaging: The Requisites 5th Edition Janis P. O'Malley PDF Download
No ratings yet
Nuclear Medicine and Molecular Imaging: The Requisites 5th Edition Janis P. O'Malley PDF Download
124 pages
Nside: Former Deputy Commissioner Alleges Bias at Gene Test Panel Large Firm Mantra Growth Comes by Any M&A Possible
No ratings yet
Nside: Former Deputy Commissioner Alleges Bias at Gene Test Panel Large Firm Mantra Growth Comes by Any M&A Possible
12 pages
Noninvasive Assessment of Hepatic Fibrosis - Overview of Serologic Tests and Imaging Examinations - UpToDate
No ratings yet
Noninvasive Assessment of Hepatic Fibrosis - Overview of Serologic Tests and Imaging Examinations - UpToDate
35 pages
Radiologic Science For Technologists 11th Edition High-Quality Download
100% (11)
Radiologic Science For Technologists 11th Edition High-Quality Download
17 pages
Samuel - NMT
No ratings yet
Samuel - NMT
8 pages
Essentials of Thoracic Imaging
No ratings yet
Essentials of Thoracic Imaging
158 pages
Advanced Ai-Driven Approach For Enhanced Brain Tumor Detection From Mri Images Utilizing Efficientnetb2 With Equalization and Homomorphic Filtering
No ratings yet
Advanced Ai-Driven Approach For Enhanced Brain Tumor Detection From Mri Images Utilizing Efficientnetb2 With Equalization and Homomorphic Filtering
19 pages
Egans - Review of Thoracic Imaging
No ratings yet
Egans - Review of Thoracic Imaging
28 pages

20240726-A Labeled Ophthalmic Ultrasound Dataset With Medical Report Generation Based On Cross-Modal Deep Learning-2407.18667v1

Uploaded by

20240726-A Labeled Ophthalmic Ultrasound Dataset With Medical Report Generation Based On Cross-Modal Deep Learning-2407.18667v1

Uploaded by

A Labeled Ophthalmic Ultrasound Dataset with

Medical Report Generation Based on Cross-modal Deep

Preprint submitted to Medical Image Analysis July 29, 2024

• Compared to other datasets, the proposed dataset includes additional

• Cross-modal memory network is given to generate report according to

2.1. Medical report datasets

2.2. Medical Report Generation Model

3. Labeled Ophthalmic Dataset Construction

Figure 1: The framework for dataset constrcution and cross-modal generation

3.1. Ultrasound Image Cropping

The image screening and cropping consists of three steps as shown in

3.2. Report Preprocessing

3.3. Blood Flow Parameter Recording

3.4. Dataset Componuding

3.5. Statistical analysis

4. Medical Report Generation

{x1 , x2 , . . . , xS } = fv (I) (1)

{y1 , y2 , . . . , yt−1 } = ft (R) (2)

{rx1 , rx2 , . . . , rxS } = CM N (x1 , x2 , . . . , xS ) (3)

{w1 , w2 , . . . , wS } = fe (rx1 , rx2 , . . . , rxS ) (5)

where fd (·) refers the decoder.

where p(Y | I) is the probability of generating the target sequence Y for

Table 3: Detailed information of data split.

4.2.2. Experiment Details

4.2.3. Results and Analyses

dataset can be serviced as a standard verification for different automatic

Figure 9: Visualizations of image-text attention mappings of generated report from CMN

[2] M. Li, R. Liu, F. Wang, X. Chang, X. Liang, Auxiliary signal-guided

[13] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan,

[14] S. Maksoud, A. Wiliem, K. Zhao, T. Zhang, L. Wu, B. Lovell, Coral8:

[15] A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y.

[16] A. Bustos, A. Pertusa, J.-M. Salinas, M. De La Iglesia-Vaya, Padchest:

[17] M. Alsharid, H. Sharma, L. Drukker, P. Chatelain, A. T. Papageorghiou,

[18] Y. Li, X. Liang, Z. Hu, E. P. Xing, Hybrid retrieval-generation rein-

[19] A. Hoover, M. Goldbaum, Locating the optic nerve in a retinal image

[20] T. Kauppi, V. Kalesnykiene, J.-K. Kamarainen, L. Lensu, I. Sorri,

[21] E. Decencière, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone,

[22] M. Gu, X. Huang, Y. Fang, Automatic generation of pulmonary ra-

[23] Y. Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, D. Xu, When radiology

[24] C. Y. Li, X. Liang, Z. Hu, E. P. Xing, Knowledge-driven encode, retrieve,

[25] Z. Chen, Y. Song, T.-H. Chang, X. Wan, Generating radiology reports

[26] Z. Zhang, Y. Xie, F. Xing, M. McGough, L. Yang, Mdnet: A seman-

[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,

[29] Z. Chen, Y. Shen, Y. Song, X. Wan, Cross-modal memory networks

[32] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Consensus-based

[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A

You might also like