AI for Automated Ocular Measurements
AI for Automated Ocular Measurements
Abstract
Purpose: To develop and validate a deep learning facial landmark detection network to automate the assessment of
periocular anthropometric measurements.
Methods: Patients presenting to the ophthalmology clinic were prospectively enrolled and had their images taken using a
standardised protocol. Facial landmarks were segmented on the images to enable calculation of marginal reflex distance
(MRD) 1 and 2, palpebral fissure height (PFH), inner intercanthal distance (IICD), outer intercanthal distance (OICD),
interpupillary distance (IPD) and horizontal palpebral aperture (HPA). These manual segmentations were used to train
a machine learning algorithm to automatically detect facial landmarks and calculate these measurements. The main out-
comes were the mean absolute error and intraclass correlation coefficient.
Results: A total of 958 eyes from 479 participants were included. The testing set consisted of 290 eyes from 145
patients. The AI algorithm demonstrated close agreement with human measurements, with mean absolute errors
ranging from 0.22 mm for IPD to 0.88 mm for IICD. The intraclass correlation coefficients indicated excellent reliability (ICC >
0.90) for MRD1, MRD2, PFH, OICD, IICD, and IPD, while HPA showed good reliability (ICC 0.84). The landmark detection
model was highly accurate and achieved a mean error rate of 0.51% and failure rate at 0.1 of 0%.
Conclusion: The automated facial landmark detection network provided accurate and reliable periocular measurements. This
may help increase the objectivity of periocular measurements in the clinic and may facilitate remote assessment of patients with
tele-health.
Keywords
Periocular, machine learning, marginal reflex distance
We sought to develop a deep learning model for facial land- lateral canthi for each eye were manually annotated
mark detection to automatically detect periocular landmarks (Figure 1). The distances between the periocular landmarks
and conduct accurate periorbital and eyelid measurements. were computed using the open-source OpenCV library.
The calculated dimensions included MRD1, the vertical
distance from the pupillary centre to the centre of the
Methods upper eyelid margin; MRD2, the vertical distance from
We prospectively enrolled participants presenting to the Royal the pupillary centre to the centre of the lower eyelid
Adelaide Hospital ophthalmology clinic who were 18 years of margin; palpebral fissure, the vertical height between the
age or older and gave written informed consent. Patients with upper and lower eyelids, derived by summing MRD1
ocular misalignment, pupil abnormalities, or corneal path- and MRD2; inner intercanthal distance (IICD), the hori-
ology affecting the light reflex were excluded from the zontal distance between the medial canthi; outer inter-
study. The institutional human research ethics committee canthal distance (OICD), the horizontal distance between
approved the study. Study procedures adhered to the princi- the lateral canthi; interpupillary distance (IPD), the hori-
ples of the Declaration of Helsinki. zontal distance between the centres of the two pupils;
and horizontal palpebral aperture (HPA), the horizontal
distance between the medial and lateral canthi within one
Image collection eye (Figure 2).
In a well-lit room, seated participants were placed 1 metre
from a Nikon D90 camera equipped with a 60 mm lens and
positioned on a stand at eye level. Participants were asked Deep learning model development
to look straight, and the photographs were taken head-on. To make our framework reproducible, we adopted a widely
To enable accurate calibration, a circular green adhesive used backbone network HRNet-v2 as our landmark detec-
dot sticker with a diameter of 24 mm was placed on the tion model to predict designed facial landmarks.2
subject’s forehead, allowing for the conversion of pixels HRNet-v2combines the representations from all the
to millimetres. high-to-low resolution parallel streams. Specifically, the
input of the landmark detection model HRNet-v2 is a
facial image of size w × h, and the output of the landmark
Image analysis detection model is likelihood heatmaps H = {H l }Ll=1 for
The images were upload onto Labelbox, a popular web- L pre-defined facial landmarks. In the design of
based annotation tool for segmentation and classification HRNet-v2, the size of the output heatmaps is reduced by
systems.1 Ten periocular landmarks, including the pupil- four times. As described in Section of image analysis, L
lary centre, the midline of the upper eyelid margin, the is equal to ten for the landmark detection model. To opti-
midline of the lower eyelid margin, medial canthi, and mise the landmark detection model, we employ the Mean
Figure 2. The periocular landmarks were used to calculate the periocular dimensions.
Figure 3. Bland-Altman plots demonstrating the bias and limits of agreement for periocular measurements. Bias (mean of differences)
is the dashed dark grey line. Upper and lower confidence intervals of bias are depicted by the dotted grey lines and grey shading. Upper
and lower limits of agreement are depicted by the dashed black lines. Their associated confidence intervals are depicted by the dotted
black lines and red shading.
Rana et al. 5
human measurements and AI predicted measurements was remote patient assessment, making it particularly relevant
assessed using Bland-Altman plots with 95% confidence inter- in the context of increasing telehealth use.
vals for the average difference between measurements Previous studies have developed methods to calculate
between humans and the AI predictions. The left and right MRD1 and MRD2 with less human input. Bodnar, Neimkin3
measurements were pooled for the bilateral measures. The utilised edge detection techniques, including the Canny edge
mean absolute error between paired observations was calcu- detection method, to identify facial features and estimate
lated using the mean of the absolute value of paired differ- MRD1 and MRD2, and Lou, Yang4 employed a facial land-
ences between human and AI predicted measurements for mark detection program in combination with edge detection
each metric. The interrater reliability of the measurements to recognise the pupillary centre and estimate MRD1 and
was calculated using the intraclass correlation coefficient MRD2. Thomas, Gunasekera5 used OpenFace, an open-source
(ICC). The ICC estimates and 95% confidence intervals AI driven facial analysis software, to measure the vertical pal-
were calculated using the R package irr v0.84.1 based on pebral aperture. However, this study did not calculate the
single measures, absolute-agreement, 2-way mixed-effects MRD1 and MRD2 measurements specifically. Our method-
model. ICC estimates were interpreted as poor reliability ology uses deep learning algorithms to detect periocular land-
(ICC < 0.5), moderate reliability (0.5 < ICC < 0.75), good reli- marks and calculate periocular dimensions, including but not
ability (0.75 < ICC < 0.9), and excellent reliability (ICC > limited to MRD1 and MRD2. Machine learning models
0.90). Statistical analysis was performed using R v4.1.2. A provide increased robustness to variations found in real-world
p-value < 0.05 was considered statistically significant. images such as lighting conditions, angles, facial size, and
expressions. Additionally, deep learning techniques automatic-
ally learn features from the raw imaging data enabling accurate
Results localisation of key periocular landmarks. In our study, we
A total of 958 eyes were included from 479 participants. The adopted the widely used HRNet-V2 as our backbone
mean age of participants was 59 ± 17.9 years and 257 (54%) network to learn the high-resolution representations through
were female. Most participants were Caucasian (407, 85%), the whole process for facial landmark detection. Traditional
with other groups being East Asian (34, 7.1%), South Asian computer vision techniques can have difficulty detecting
(28, 6%), and African (10, 2.1%). The testing set consisted of facial landmarks due to large head position, and heavy occlu-
290 eyes from 145 patients. A summary and comparison of sion. By training a Convolutional neural network (CNN) on a
the human and AI predicted periorbital measurements are dataset of images containing labelled facial landmarks, the
detailed in Table 1. On average, the AI predicted measurements algorithm can identify facial features in new images and
were < 1 mm away from human measurements for all metrics achieve high detection performance in a variety of conditions.
(Table 2). The Bland-Altman plots are showed in Figure 3 The accurate conversion of pixels to millimetres in
with the bias and limits of agreement. The magnitude of differ- images is required, and different studies have adopted dif-
ence between human and AI measurements was less for MRD1, ferent techniques for this purpose. In the Van Brummen
MRD2, IPD and PFH (Figure 3). IICD showed a greater differ- study, the AI algorithm’s pixel-to-mm conversion relied
ence between measurements and less agreement for larger mea- on a corneal width of 11.71 mm, which was different
surements (Figure 3). The intraclass correlation coefficients from the corneal width of 11.77 mm measured by human
demonstrated excellent reliability for all measurements except graders.6 Moreover, measuring the corneal width can pose
HPA which showed good reliability (Table 3). The landmark challenges, particularly in cases of ptosis where the eyelid’s
detection model was highly accurate and achieved a mean position may cover part of the cornea.7,8 In our study, we
error rate of 0.51% and failure rate at 0.1 of 0%. affixed a sticker dot with a known diameter on the forehead