Fast Animal Pose Estimation with LEAP
Fast Animal Pose Estimation with LEAP
[Link]
The need for automated and efficient systems for tracking full animal pose has increased with the complexity of behavioral
data and analyses. Here we introduce LEAP (LEAP estimates animal pose), a deep-learning-based method for predicting the
positions of animal body parts. This framework consists of a graphical interface for labeling of body parts and training the net-
work. LEAP offers fast prediction on new data, and training with as few as 100 frames results in 95% of peak performance. We
validated LEAP using videos of freely behaving fruit flies and tracked 32 distinct points to describe the pose of the head, body,
wings and legs, with an error rate of <3% of body length. We recapitulated reported findings on insect gait dynamics and dem-
onstrated LEAP’s applicability for unsupervised behavioral classification. Finally, we extended the method to more challenging
imaging situations and videos of freely moving mice.
C
onnecting neural activity with behavior requires methods to animal and having it walk on a spherical treadmill19 or linear track20;
parse what an animal does into its constituent components applying physical markers to the animal19,21; or using specialized
(movements of its body parts), which can then be connected equipment such as depth cameras22–24, frustrated total internal
with the electrical activity that generates each action. This is particu- reflection imaging19,21,25,26 or multiple cameras27. However, these
larly challenging for natural behavior, which is dynamic, complex and techniques are all designed to work within a narrow range of experi-
noisy. Human classification of behavior is slow and subject to bias1,2, mental conditions and are not easy to adapt to disparate datasets.
but speed can be increased through automation1 including meth- To design a general algorithm capable of tracking body parts
ods to track and analyze animal centroids and shapes over time3–5, from many different kinds of experiments, we turned to deep-learn-
machine learning techniques for identifying user-defined behaviors ing-based methods for pose estimation that have proved success-
such as fighting and courting6,7, and software to segment the acous- ful on images of humans28–34. Breakthroughs in the field have come
tic signals produced by an animal8–10. However, one may not know from the adoption of fully convolutional neural network architec-
a priori which behaviors to analyze; this is particularly true when tures for efficient training and evaluation of images35,36 and the pro-
screening mutant animals or investigating the results of neural per- duction of a probabilistic estimate of the position of each tracked
turbations that can alter behavior in unexpected ways. body part29,31. However, the problems of pose estimation in the typi-
Developments in the unsupervised clustering of postural cal human setting and that for laboratory animals are subtly differ-
dynamics have enabled researchers to overcome many of these ent. Algorithms built for human images can deal with large amounts
challenges by analyzing the raw frames of videos in a reduced of heterogeneity in body shape, environment and image quality,
dimensional space (for example, generated via principal compo- but use very large labeled training sets of images37–39. In contrast,
nent analysis (PCA)). By comparing frequency spectra or fitting behavioral laboratory experiments are often more controlled, but
auto-regressive models from low-dimensional projections11,12, these the imaging conditions may be highly specific to the experimental
methods can both define and record the occurrence of tens to hun- paradigm, and labeled data, not readily available, must be gener-
dreds of unique, stereotyped behaviors in animals such as fruit flies ated for every experimental apparatus and animal type. One recent
and mice. Such methods have been used to uncover structures in attempt to apply these techniques to images of behaving animals
behavioral data, thereby facilitating the investigation of temporal successfully used transfer learning, whereby networks initially
sequences13, social interactions14, genetic mutants12,15 and the results trained for a more general object-classification task are refined by
of neural perturbation16,17. further training with relatively few samples from animal images40.
A major drawback to the aforementioned techniques is their reli- Our approach combines a GUI-driven workflow for labeling
ance on PCA to reduce the dimensionality of the image time series. images with a simple network architecture that is easy to train and
While this produces a more manageable substrate for machine requires few computations to generate predictions. This method can
learning, it would be advantageous to directly analyze the position automatically predict the positions of animal body parts via iterative
of each actuatable body part, as this is what is ultimately under the training of deep convolutional neural networks with as few as ten
control of the motor nervous system. However, measuring all of frames of labeled data for initial prediction and training (training on
the body-part positions from raw images is a challenging computer ten frames results in 74% of estimates within a 2.5-pixel (px) error).
vision problem18. Previous attempts at automated body-part track- After initial de novo training, incrementally refined predictions can
ing in insects and mammals relied on physically constraining the be used to guide labeling in new frames, drastically reducing the time
Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA. 2Department of Molecular Biology, Princeton University, Princeton, NJ, USA.
1
Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA. 4Department of Physics, Princeton University, Princeton, NJ,
3
USA. 5Present address: Program in Neuroscience, Harvard University, Cambridge, MA, USA. 6These authors contributed equally: Talmo D. Pereira,
Diego E. Aldarondo. *e-mail: mmurthy@[Link]; shaevitz@[Link]
a b
Tracking workflow
0.5 mm
(1) Egocentric alignment Click or drag
to fix label
Current frame
with labels
Training (once per dataset)
(2) Select subset of images for labeling
0 to 30 min
Full training
1h
0.5 mm
d
Walking t = 0.090 s t = 0.560 s t = 0.940 s
Right left
1 mm
1 mm
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Time (s)
e t = 0.150 s t = 0.490 s t = 0.860 s
Head grooming
Right left
1 mm
1 mm
Fig. 1 | Body-part tracking via LEAP, a deep learning framework for animal pose estimation. a, Overview of the tracking workflow. b, GUI for labeling
images. Interactive markers denote the default or best estimate for each body part (top left). Users click or drag the markers to the correct location (top
right). Colors indicate labeling progress and denote whether the marker is at the default or estimated position (yellow) or has been updated by the user
(green). Progress indicators mark which frames and body parts have been labeled thus far, while shortcut buttons enable the user to export the labels to
use a trained network to initialize unlabeled body parts with automated estimates. c, Data flow through the LEAP pipeline. For each raw input image (left),
the network outputs a stack of confidence maps (middle). Colors in the confidence maps represent the probability distribution for each individual body
part. Insets overlay individual confidence maps on the image to reveal how confidence density is centered on each body part, with the peak indicated
by a circle. The peak value in each confidence map predicts the coordinate for each body part (right). d, Quantification of walking behavior using leg tip
trajectories. The distance of each of the six leg tips from its own mean position during a walking bout as a function of time (left). Poses at the indicated
time points (right). Blue and red traces correspond to left and right leg tips, respectively. e, Quantitative description of head grooming behavior described
by leg tip trajectories. Position estimates are not confounded by occlusions when the legs pass under the head (right, inset).
required to label sufficient examples (50 frames) to achieve a median using a previously published dataset of high-quality videos of freely
accuracy of less than 3 px per 86 μm (distance from ground truth). behaving adult fruit flies (Drosophila melanogaster11) and recapitu-
Training on a workstation with a modern graphics processing unit lated a number of reported findings on insect gait dynamics as a test
(GPU) is efficient (<1 h) and prediction on new data is fast (up to of its experimental validity. We then used an unsupervised behav-
185 Hz after alignment). We validated the results of our method ioral classification algorithm to describe stereotyped behaviors
PDF
0.2
0.1
0
5 px ~140 μm 0 1 2 3 4 5
35 px 1 mm Error distance (px)
10 Legs 8 250
Fast training 200
R.m.s. error
Wings 0.6 150
(15–20 min)
CDF
6 100
Full training 50
(50–75 min) 0.4 25
5 4
1.97 px 20
1.63 px 15
0.2 101 102 103
2.5 10
Labeled frames 5
0 0
5 10 15 20 25 30 35 40 45 50 0 2.5 5 10 15
Epochs trained Error distance (px)
Fig. 2 | LEAP is accurate and requires little training or labeled data. a, Part-wise accuracy distribution after full training. Circles are plotted on
a reference image to indicate the fraction of held-out testing data (168 images from seven held-out flies) for which estimated positions of the
particular body part are closer to the ground truth than the radii. Scale bars indicate image and physical size; 35 px is equivalent to 1 mm at this resolution.
b, Accuracy summary on held-out test set after full training. PDF, probability density function. c, Accuracy as a function of training time. In the ‘fast
training’ regime, n =1,215 labeled frames were used for training. Lines and shaded area (smaller than line width) indicate the mean and s.e.m. for all held-
out test images pooled over five runs. Run time estimates based on high-end consumer or enterprise GPUs. d, Accuracy as a function of the number of
training examples. Distributions indicate estimation errors in a held-out test set (n =168 frames) with varying numbers of labeled images used for training,
pooled over five ‘fast training’ runs. CDF, cumulative distribution function. Inset: median overall r.m.s. error over these five replicates at each sample size.
in terms of the dynamics of individual body parts. Finally, we 2 min per frame for the first 10 frames to 6 s per frame for the last
showed generalizability by using more challenging imaging condi- 500 frames (Supplementary Fig. 5). The third step is pose estima-
tions and videos from freely moving rodents. tion, in which the network can be applied to new and unlabeled
data (Fig. 1c). With minimal training, LEAP faithfully tracks all the
Results body parts, even during challenging bouts of locomotion and in the
LEAP consists of three phases (see Fig. 1a and Supplementary presence of occlusion (Fig. 1d,e and Supplementary Videos 1–3).
Results for a full description). The first step is registration and In the following sections, we demonstrate the power of this tool,
alignment, in which raw video of a behaving animal is preprocessed using a previously published dataset of 59 male fruit flies, each
into egocentric coordinates with an average error of 2.0°. This step recorded for 1 h at 100 Hz, for a total of >21 million images11. All
increases pose estimation accuracy but can be omitted at the cost code and utilities are available at [Link] and
of prediction accuracy (Supplementary Fig. 1). The second step is as Supplementary Software.
labeling and training, in which the user provides ground truth labels
to train the neural network to find body-part positions on a sub- Performance of LEAP: accuracy, speed, and training sample size.
set of the total images. We used cluster sampling to identify a sub- We evaluated the accuracy of LEAP after full training with 1,500
set of images that were representative of the complete set of poses labeled images by measuring error as the Euclidean distance between
found in a dataset (Supplementary Fig. 2). A GUI with draggable estimated and ground truth coordinates of each body part on a held-
body part markers facilitated the labeling of each training image out test set of 168 frames (from seven held-out flies) without aug-
(Fig. 1b). LEAP uses a 15-layer, fully convolutional neural network mentation. We found that the accuracy level depended on the body
that produces a set of probability distributions for the location of part being tracked, with parts that were more often occluded (for
each body part in an image (Fig. 1c and Supplementary Fig. 3). example, hind legs) resulting in slightly higher error rates (Fig. 2a).
This simple network performs equivalently to, or better than, Overall, we found that error distances for all body parts were
more complicated architectures that have been used in the past well below 3 px for the vast majority of tested images (Fig. 2b).
(Supplementary Fig. 3b). For the fly, we tracked 32 points that define This error was achieved rather quickly during training, with as few as 15
the Drosophila body joints (Supplementary Fig. 4). Labeling and epochs (15–20 min of training time) required to achieve approximately
training occur in an iterative procedure. Labels from the first ten 1.97 px per 56 μm overall accuracy, and less than 50 epochs (50–75 min)
images are used to train the neural network and generate body-part required for convergence to 1.63 px per 47 μm accuracy with the full
estimates for the rest of the training set images. Using these estimates training set (Fig. 2c). To measure the ground truth accuracy during
as the initial guesses in the GUI increases the speed of labeling. This the alternating labeling-training phase, we also measured the errors on
is repeated periodically, and the time to label an image drops from the full test set as a function of the number of labeled images used for
a b c 15 Body speed
Stance –1 –1
s
Stance 100 Swing 10 5–10 mm–1 s–1
Durations (ms)
80 5 20–25 mm–1 s–1
25–30 mm–1 s–1
0 30–35 mm–1 s–1
60 35–40 mm–1 s–1
40–45 mm–1 s–1
–5
40
–10
1 mm 20
Swing –15
10 20 30 40 –10 0 10 20 30 40 50
Average body speed (mm–1 s–1) Time from swing onset (ms)
d 0.6 e ×104
3.5
Tripod Tripod
0.5 Tetrapod Tetrapod
3
Noncanonical Noncanonical
Emission probability
0.4 2.5
Count
0.3
1.5
0.2
1
0.1
0.5
0 0
0 1 2 3 4 5 6 5 10 15 20 25 30 35
Number of legs in stance Forward velocity (mm–1 s–1)
f Tripod g Tetrapod
40 40
(mm–1 s–1)
(mm–1 s–1)
Forward
velocity
Forward
velocity
20 20
0 0
RH RH
RM RM
Swing
Leg tip
Leg tip
RF RF
LH Stance LH
LM LM
LF LF
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Time (s) Time (s)
Fig. 3 | LEAP recapitulates known gait patterning in flies. a, Schematic of swing and stance encoding. Stance is defined by a negative horizontal
velocity in egocentric coordinates. b, Duration of swing and stance as a function of average body speed. These data comprise approximately 7.2 h in which
the fly was moving forward (2.6 million frames). Shaded regions indicate 1 s.d. c, Swing velocity as a function of time from swing onset, and binned by body
speed (n =1,868,732 swing bouts across all legs). Shaded regions indicate 1 s.d. d, Emission probabilities of numbers of legs in stance for each hidden
state in the HMM (Methods). Hidden state emissions resemble tripod, tetrapod and noncanonical gaits. e, Distributions of velocities for each hidden state.
f,g, Examples of tripod (f) and tetrapod (g) gaits identified by the HMM. RH, right hind leg tip; RM, right mid; RF, right fore; LH, left hind; LM, left mid; LF,
left fore.
training under the fast training regime (15 epochs). We found that with producing experimentally valid measurements, we used it to ana-
as few as ten labeled images, the network was able to achieve <2.5 px lyze the gait dynamics of freely moving flies. Previous work on
error (2–3% of body length) in 74% of the test set, while 1,000 labeled Drosophila gait relied on imaging systems that use a combination of
images yielded an accuracy of <2.5 px in 87% of the test set (Fig. 2d). optical touch sensors and high-speed video recording to follow fly
When examining the root-mean-square error (r.m.s. error), we found legs as they walk25. Such systems cannot track the limbs when they
that the performance of the network plateaued at approximately 100 are not in contact with the surface (during swing). Other methods
training frames, and labeling of only ten frames corresponded to 65% to investigate gait dynamics use a semi-automated approach to label
of peak performance (Fig. 2d, inset). This level of accuracy when train- fly limbs18,41 and require manual correction of automatically gener-
ing for few epochs with few samples contributes to the drastic reduction ated predictions; these semi-automated approaches therefore typi-
in time spent hand-labeling after fast training (Supplementary Fig. 5). cally utilize smaller datasets.
For reference, labeling of 100 fly images with the 32-point skeleton took We evaluated our network on a dataset of 59 adult male fruit flies11
a total of 2 h with the LEAP GUI (with fast training performed after and extracted the predicted positions of each leg tip in each of 21
labeling of 10 and 50 frames), training the network took 1 h, and pose million frames. For every frame in which the fly was moving forward
estimation on new images occurred at a rate of 185 Hz. (7.2 h per 2.6 million frames total), we encoded each leg as either in
swing or in stance, depending on whether the leg was moving for-
Leg tracking with LEAP recapitulates previously described gait ward or backward relative to the fly’s direction of motion (Fig. 3a).
structure. To evaluate the usefulness of our pose estimator for Using this encoding, we measured the relationship between
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
PDF (× 10–3) Frequency (Hz) Frequency (Hz) Frequency (Hz)
–3
Normalized amplitude (×10 )
0.6 2.6
0 0.5 1 1.5
b Behavioral clusters d f h
2. Hind grooming (right) 4. Wing grooming (left) 16. Anterior grooming
7 Head Head Head
Neck Neck Neck
Abdomen Abdomen Abdomen
1 2 Wings Wings Wings
18
19 20 Forelegs (L) Forelegs (L) Forelegs (L)
3
8 Forelegs (R) Forelegs (R) Forelegs (R)
17
9 21 Midlegs (L) Midlegs (L) Midlegs (L)
4 5
10 Midlegs (R) Midlegs (R) Midlegs (R)
6 11 16
Hindlegs (L) Hindlegs (L) Hindlegs (L)
12
Hindlegs (R) Hindlegs (R) Hindlegs (R)
13
15 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
14 Frequency (Hz) Frequency (Hz) Frequency (Hz)
Fig. 4 | Unsupervised embedding of body position dynamics. a, Density of freely moving fly body-part trajectories, after projection of their spectrograms
into two dimensions via unsupervised nonlinear manifold embedding11. The distribution shown was generated from 21.1 million frames. Regions in the
space with higher density correspond to stereotyped movement patterns, whereas low-density regions form natural divisions between distinct dynamics.
A watershed algorithm was used to separate the peaks in the probability distribution (Methods). b, Cluster boundaries from a with cluster numbers
indicated. c–h, Average spectrograms for the indicated body parts from time points that fall within the dominant grooming clusters; cluster numbers are
indicated in b. Qualitative labels for each cluster based on visual inspection are provided for convenience. Color map corresponds to normalized power for
each body part.
the fly’s speed and the duration of stance and swing (Fig. 3b). feature vectors for each time point into a two-dimensional (2D)
Similar to previous work, we found that swing duration was rela- manifold we term a behavior space (Fig. 4a). The feature vectors
tively constant across walking speeds, whereas stance duration represent the dynamics of each body part across different time
decreased with walking speed25. Because our methods allowed us scales, and as has been shown previously, the distribution of embed-
to estimate animal pose during both stance and swing (versus only ded time points in this space is concentrated into a number of strong
during stance25), we had the opportunity to investigate the dynamics peaks that represent stereotyped behaviors seen across time and in
of leg motion during the swing phase. We found that swing veloc- multiple individuals11.
ity increased with body speed, in agreement with previous results25 We identified clusters in the behavior space distribution by group-
(Fig. 3c). We also found that fly leg velocities followed a parabolic ing together regions of high occupancy and stereotypy (Fig. 4b).
trajectory parameterized by body speed (Fig. 3c). This distribution was qualitatively similar to what we found
We then trained a three-state hidden Markov model (HMM) previously by using a PCA-based compression of the images
to capture the different gait modes exhibited by Drosophila41. The (Supplementary Fig. 6). A major advantage to using pose estimation
emission probabilities from the model of the resulting hidden over PCA-based image compression is the ability to describe stereo-
states were indicative of tripod, tetrapod and noncanonical/wave typed behaviors by the dynamics of each body part. We calculated
gaits (Fig. 3d). As expected, we observed tripod gait at high body the average concatenated spectrogram for each cluster and found
velocities and tetrapod or noncanonical gaits at intermediate and that specific behaviors were recapitulated in the motion power spec-
low velocities, in accordance with previous work25,41,42 (Fig. 3e–g). trum for each body part (Fig. 4c–h).
These results demonstrate that our pose estimator is able to This method can be used to accurately describe grooming, a
effectively capture the dynamics of known complex behaviors, such class of behaviors that is highly represented in our dataset. Posterior
as locomotion. grooming behaviors exhibited a distinctly symmetric topology
(Fig. 4b–g), revealing both bilateral (Fig. 4e) and unilateral groom-
Body dynamics reveal structure in the fly behavioral repertoire. ing of the wings (Fig. 4c,f) and the rear of the abdomen (Fig. 4d,g).
We next used the output of LEAP as the first step in an unsuper- These behaviors involve unilateral, broadband (1–8 Hz) motion of
vised analysis of the fly behavioral repertoire11. We calculated the the hind legs on one side of the body and a slower (~1.5 Hz) fold-
position of each body part relative to the center of the fly thorax for ing of the wing on the same side of the body. In contrast, anterior
each point in time and then computed a spectrogram for each of grooming is characterized by broadband motions of both front legs
these time series via the continuous wavelet transform (CWT). We with a peak at ~9 Hz, representing the legs rubbing against each
then concatenated these spectrograms and embedded the resulting other (Fig. 4h).
a Locomotion density b Locomotion clusters c 10. Locomotion (slowest) 13. Locomotion (medium-fast)
Head
Neck
2.6 Abdomen
PDF (×10–3) 10 Wings
11 Forelegs (L)
12 Forelegs (R)
Velocity
Midlegs (L)
0.6
13
Midlegs (R)
15 Hindlegs (L)
14
Hindlegs (R)
1 2 4 8 16 32 1 2 4 8 16 32
Frequency (Hz) Frequency (Hz)
d Cluster leg dynamics e Cluster velocity distributions 11. Locomotion (slow) 14. Locomotion (fast)
Head
Neck
20
10
0.12
0.1 5 Forelegs (R)
0.1
0.08 0 Midlegs (L)
PDF
0.08 4 6 8 10 12
0.06 Peak frequency (Hz) Midlegs (R)
0.06
0.04 0.04 Hindlegs (L)
0.02 0.02 Hindlegs (R)
0 1 2 4 8 16 32 1 2 4 8 16 32
1 2 4 8 16 32 0 10 20 30 40
Frequency (Hz) Forward velocity (mm/s) Frequency (Hz) Frequency (Hz)
Midlegs (L)
1 2 4 8 16 32 1 2 4 8 16 32
Frequency (Hz) Frequency (Hz)
Normalized amplitude (×10–3)
0 0.5 1 1.5
Fig. 5 | Locomotor clusters in behavior space separate distinct gait modes. a,b, Density (a) and cluster (b) labels of locomotion clusters (from the same
behavioral space shown in Fig. 4a). c, Average spectrograms (similar to Fig. 4c–h) quantifying the dynamics in each cluster. d, Average power spectra
calculated from the leg joint positions for each cluster in c. Colors correspond to the cluster numbers in b. e, The distribution of forward locomotion
velocity as a function of cluster number. Colors correspond to cluster numbers in b. Inset, forward locomotion velocity as a function of peak leg frequency.
f, Gait modes identified by HMM from swing/stance state correspond to distinct clusters.
We also discovered a number of unique clusters related to loco- approach under more varied imaging conditions, we evaluated the
motion (Fig. 5a,b). The slowest state (cluster 10) involved several performance of LEAP on a dataset in which pairs of flies were imaged
frequencies with a broad peak centered at 5.1 Hz (Fig. 5c–e). This against a nonuniform and low-contrast background of porous mesh
can be seen in both the concatenated spectrograms (Fig. 5c) and (~4.2 million frames, ~11.7 h of video) (Fig. 6a). We first labeled
the power spectrum averaged over all leg positions (Fig. 5d). The only the male flies from these images, and, using the same workflow
fly center-of-mass velocity distribution for this behavior is shown in as in the first dataset, we found that the pose estimator was able
Fig. 5e. As the fly speeds up (clusters 10–15, Fig. 5e), the peak fre- to reliably recover body-part positions with high accuracy despite
quency for the legs increases monotonically to 11.5 Hz (cluster 15). poorer illumination and a complex background that was at times
We next asked whether the tripod and tetrapod gaits we found in indistinguishable from the fly (Fig. 6a and Supplementary Video 4).
our previous analysis (Fig. 3) were represented by distinct regions We then evaluated the performance of the network when the back-
in the behavior space. We found that tripod gait was used predomi- ground was masked out14 (Fig. 6b). Even with substantial errors in
nantly in the three fastest locomotion behaviors, whereas the tetra- the masking (for example, leg or wing segmentation artifacts), we
pod (and to a lesser extent the noncanonical) gait was used for the found that the accuracy improved slightly when the background
three slower locomotion behaviors (Fig. 5f). pixels were excluded from the images compared with that achieved
with the raw images (Fig. 6b and Supplementary Video 4). We also
LEAP generalizes to images with complex backgrounds or of tested whether a single network trained on both male and female
other animals. To test the robustness and generalizability of our images performed better or worse than the network trained on
PDF
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
1 mm
Error distance (px)
b Part type
Error distance (px)
25th percentile
0.35 All
50th percentile
75th percentile
0.3 Body 90th percentile
Legs
0.25 Wings
PDF
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
1 mm Error distance (px)
Paws
0.25
0.2
PDF
0.15
0.1
0.05
0
0 2 4 6 8 10
2 cm
Error distance (px)
Fig. 6 | LEAP generalizes to images with complex backgrounds or of other animals. a, LEAP estimates on a separate dataset of 42 freely moving male
flies, each imaged against a heterogeneous background of mesh and microphones, with side illumination (~4.2 million frames, ~11.7 h). 32 body parts
(Supplementary Fig. 4) were tracked, and 1,530 labeled frames were used for training. Error rates for position estimates were calculated on a held-out test
set of 400 frames (center) and were comparable to those achieved for images with higher signal to noise (compare with Fig. 2b). Part-wise error distances
(right). b, LEAP estimates on masked images from the dataset described in a. Background was subtracted using standard image processing algorithms
(Methods) to reduce the effect of background artifacts. c, LEAP estimates on a dataset of freely moving mice imaged from below (~3 million frames,
~4.8 h). Three points are tracked per leg, in addition to the tip of the snout, neck, and base and tip of the tail (left)—1,000 labeled frames were used for
training. Accuracy rates on a held-out test set (of 242 frames) (center).
only male images. We found that the overall performance was pose across datasets, as is done in the case of human pose estima-
similar (Supplementary Fig. 7) but that the network trained on tion. Rather, we present a framework that uses an active GUI and
only male images performed slightly better. This discrepancy is simple network architecture that can be quickly trained on any new
due largely to body parts that are used in very different ways by image dataset for which pre-existing labels are not available.
males and females (for example, the wings, which generate song Tracking only the centroid of an animal and its change in
in males but never in females), and can be overcome with addi- position or heading over time is probably an insufficient level of
tional training. Finally, we tested the applicability of our framework description for determining how the nervous system controls most
to animals with different morphology by tracking videos of freely behaviors. Previous studies have addressed the issue of pose esti-
behaving mice (Mus musculus) imaged from below in an open mation through centroid tracking3, pixel-wise correlations11,12 or
arena (Fig. 6c). We observed comparable accuracy in these mice specialized apparatus for tracking body parts19,22,25,41,43. For the last,
despite considerable occlusion during behaviors such as rearing applying markers to an animal can limit natural behavior, and sys-
(Fig. 6c and Supplementary Video 5). tems that track particular body parts are not in general scalable to
all body parts or animals with a very different body plan.
Discussion We demonstrate the value of LEAP by showing how it can be
Here we present a pipeline (LEAP) that uses a deep neural network applied to the study of locomotor gait dynamics and unsupervised
to track the body parts of a behaving animal in all frames of a movie behavioral mapping in Drosophila. Previous studies of gait dynam-
via labeling of a small number of images from across the dataset. ics have been limited to short stretches of locomotor bouts that were
LEAP does not use a single trained ‘generalist’ network to analyze captured with a specialized imaging system25 or to the number of
behavioral frames that could be hand-labeled41. We show that LEAP 6. Dankert, H., Wang, L., Hoopfer, E. D., Anderson, D. J. & Perona, P.
not only recapitulates previous findings on locomotor gait, but also Automated monitoring and analysis of social behavior in Drosophila.
Nat. Methods 6, 297–303 (2009).
discovers new aspects of the behavior. Body-part tracking provides 7. Kabra, M., Robie, A. A., Rivera-Alba, M., Branson, S. & Branson, K. JAABA:
a solution to a major shortcoming in existing approaches, namely, interactive machine learning for automatic annotation of animal behavior.
that researchers have to interpret identified behaviors simply by Nat. Methods 10, 64–67 (2013).
watching videos11,12. When LEAP is used as the first step in such 8. Arthur, B. J., Sunayama-Morita, T., Coen, P., Murthy, M. & Stern, D. L.
Multi-channel acoustic recording and automated analysis of Drosophila
unsupervised algorithms, each discovered behavior can be inter-
courtship songs. BMC Biol. 11, 11 (2013).
preted through analysis of the dynamics of each body part. 9. Anderson, S. E., Dave, A. S. & Margoliash, D. Template-based
There are a number of applications for this pipeline beyond those automatic recognition of birdsong syllables from continuous recordings.
demonstrated here. Because the network learns body positions from J. Acoust. Soc. Am. 100, 1209–1219 (1996).
a small number of labeled frames, the network can probably be 10. Tachibana, R. O., Oosugi, N. & Okanoya, K. Semi-automatic classification of
birdsong elements using a linear support vector machine. PLoS ONE 9,
trained to track a wide variety of animal species and classes of behav- e92584 (2014).
ior. Further, LEAP could be extended to tracking of body parts in 11. Berman, G. J., Choi, D. M., Bialek, W. & Shaevitz, J. W. Mapping the
three dimensions with the use of either multiple cameras or depth- stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface 11,
sensitive devices. This will probably be useful for tracking body 20140672 (2014).
parts of head-fixed animals moving on an air-supported treadmill 12. Wiltschko, A. B. et al. Mapping sub-second structure in mouse behavior.
Neuron 88, 1121–1135 (2015).
with simultaneous neural recording44,45. Such experiments would be 13. Berman, G. J., Bialek, W. & Shaevitz, J. W. Predictability and hierarchy in
particularly suited to our approach, as the videos from head-fixed Drosophila behavior. Proc. Natl Acad. Sci. USA 113, 11943–11948 (2016).
animals are inherently recorded in egocentric coordinates. Body- 14. Klibaite, U., Berman, G. J., Cande, J., Stern, D. L. & Shaevitz, J. W. An
part positions could then be used to decode neural activity, with unsupervised method for quantifying the behavior of paired animals.
mapping onto a substrate that approximates muscle coordinates. Phys. Biol. 14, 015006 (2017).
15. Wang, Q. et al. The PSI-U1 snRNP interaction regulates male
Additionally, we note that the fast prediction performance of our mating behavior in Drosophila. Proc. Natl Acad. Sci. USA 113,
method might make it compatible with closed-loop experimenta- 5269–5274 (2016).
tion, where joint positions may be computed in real time to control 16. Vogelstein, J. T. et al. Discovery of brainwide neural-behavioral
experimental parameters such as stimuli presented to the animal or maps via multiscale unsupervised structure learning. Science 344,
optogenetic modulation. Lastly, through the addition of a segmen- 386–392 (2014).
17. Cande, J. et al. Optogenetic dissection of descending behavioral control in
tation step for analyzing videos of multiple animals3,14,46, LEAP can Drosophila. eLife 7, e34275 (2018).
potentially estimate poses for multiple interacting individuals. 18. Uhlmann, V., Ramdya, P., Delgado-Gonzalo, R., Benton, R. & Unser, M.
An important aspect of LEAP is the active training framework FlyLimbTracker: an active contour based approach for leg segment tracking
that identifies useful images for labeling and provides a GUI for in unmarked, freely behaving Drosophila. PLoS ONE 12, e0173433 (2017).
iterative labeling, training and evaluation of network performance. 19. Kain, J. et al. Leg-tracking and automated behavioural classification in
Drosophila. Nat. Commun. 4, 1910 (2013).
We highlight that this framework can be used with any network 20. Machado, A. S., Darmohray, D. M., Fayad, J., Marques, H. G. & Carey, M. R.
architecture. Although we use a relatively simple network that trains A quantitative framework for whole-body coordination reveals specific
quickly, other networks, such as those that utilize transfer learning40 deficits in freely walking ataxic mice. eLife 4, e07892 (2015).
or stacked hourglasses with skip connections and intermediate 21. Nashaat, M. A. et al. Pixying behavior: a versatile real-time and post hoc
automated optical tracking method for freely moving and head fixed animals.
supervision47, can also be implemented within the LEAP framework
eNeuro 4, e34275 (2017).
and may increase performance for other kinds of data. 22. Nanjappa, A. et al. Mouse pose estimation from depth images. arXiv Preprint
In summary, we present a method for tracking body-part posi- at [Link] (2015).
tions of freely moving animals with little manual effort and without 23. Nakamura, A. et al. Low-cost three-dimensional gait analysis system for mice
the use of physical markers. We anticipate that this tool will reduce with an infrared depth sensor. Neurosci. Res. 100, 55–62 (2015).
24. Wang, Z., Mirbozorgi, S. A. & Ghovanloo, M. An automated behavior
the technical barriers to addressing a broad range of previously analysis system for freely moving rodents using depth image.
intractable questions in ethology and neuroscience through quanti- Med. Biol. Eng. Comput. 56, 1807–1821 (2018).
tative analysis of the dynamic changes in the full pose of an animal 25. Mendes, C. S., Bartos, I., Akay, T., Márka, S. & Mann, R. S. Quantification
over time. of gait parameters in freely walking wild type and sensory deprived
Drosophila melanogaster. eLife 2, e00231 (2013).
26. Mendes, C. S. et al. Quantification of gait parameters in freely walking rodents.
Online content BMC Biol. 13, 50 (2015).
Any methods, additional references, Nature Research reporting 27. Petrou, G. & Webb, B. Detailed tracking of body and leg movements of a
summaries, source data, statements of data availability and asso- freely walking female cricket during phonotaxis. J. Neurosci. Methods 203,
ciated accession codes are available at [Link] 56–68 (2012).
s41592-018-0234-5. 28. Toshev, A. & Szegedy, C. DeepPose: human pose estimation via deep neural
networks. arXiv Preprint at [Link] (2013).
29. Tompson, J. J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a
Received: 25 May 2018; Accepted: 31 October 2018; convolutional network and a graphical model for human pose estimation. In
Published online: 20 December 2018 Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani,
Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) 1799–1807
References (Curran Associates, Inc., Red Hook, 2014).
1. Anderson, D. J. & Perona, P. Toward a science of computational ethology. 30. Carreira, J., Agrawal, P., Fragkiadaki, K. & Malik, J. Human pose estimation
Neuron 84, 18–31 (2014). with iterative error feedback. arXiv Preprint at [Link]
2. Szigeti, B., Stone, T. & Webb, B. Inconsistencies in C. elegans behavioural abs/1507.06550 (2015).
annotation. Preprint at bioRxiv [Link] 31. Wei, S.-E., Ramakrishna, V., Kanade, T. & Sheikh, Y. Convolutional pose
early/2016/07/29/066787 (2016). machines. arXiv Preprint at [Link] (2016).
3. Branson, K., Robie, A. A., Bender, J., Perona, P. & Dickinson, M. H. 32. Bulat, A. & Tzimiropoulos, G. Human pose estimation via convolutional part
High-throughput ethomics in large groups of Drosophila. Nat. Methods 6, heatmap regression. arXiv Preprint at [Link] (2016).
451–457 (2009). 33. Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime multi-person 2D pose
4. Swierczek, N. A., Giles, A. C., Rankin, C. H. & Kerr, R. A. High-throughput estimation using part affinity fields. arXiv Preprint at [Link]
behavioral analysis in C. elegans. Nat. Methods 8, 592–598 (2011). abs/1611.08050 (2016).
5. Deng, Y., Coen, P., Sun, M. & Shaevitz, J. W. Efficient multiple object 34. Tome, D., Russell, C. & Agapito, L. Lifting from the deep: convolutional 3D
tracking using mutually repulsive active membranes. PLoS ONE 8, pose estimation from a single image. arXiv Preprint at [Link]
e65769 (2013). abs/1701.00295 (2017).
Methods Oriented bounding boxes were cropped to 192 × 192 px for all datasets to
Datasets. Details on the dataset of 59 adult male Drosophila can be found in ensure consistency in output image size after repeated pooling and upsampling
refs. 11,13. Animals were allowed to move freely in a backlit 100-mm-diameter steps in the neural network. These data were stored in self-describing HDF5 files.
circular arena covered by a 2-mm-tall clear polyethylene terephthalate glycol dome.
Videos were captured from the top with a Point Grey Gazelle camera at a resolution Sampling diverse images for labeling and training in LEAP. To ensure
of ~35 px per mm at 100 frames per second (FPS) for 1 h for each fly, totaling ~21 diversity in image and pose space when operating at low sample sizes, we used
million frames for the dataset. To calculate the spatial resolution for these videos, a multistage cluster sampling technique. First, we sampled n0 images uniformly
we assumed a mean male fly length of 2.82 mm (ref. 48). from each dataset by using a fixed stride over time to minimize correlations being
The second fly dataset reported here (Fig. 5) consists of 42 videos of freely temporally adjacent samples. We then used PCA to reduce their dimensionality
moving pairs of virgin male and female fruit flies (NM91 strain) 3–5 d post- and projected the images down to the first D principal components. After
eclosion. Only males from these videos were analyzed in this study. Flies moved dimensionality reduction, the images were grouped via k-means clustering into
freely within a 30-mm-diameter circular arena with a 2-mm-tall clear polyethylene k subgroups from which n images were randomly sampled from each group. To
terephthalate glycol dome against a white mesh floor covering an array of minimize the time necessary for the network to generalize to images from all
microphones, resulting in an inhomogeneous image background. Videos were groups, we sorted the dataset such that consecutive samples cycled through the
captured from above with a Point Grey Flea3 camera at a resolution of ~25 px per groups. This way, uniform sampling was maintained even at the early phases of
mm at 100 FPS, totaling ~4.2 million frames. user labeling, ensuring that even a network trained on only the first few images
The mouse dataset for Fig. 5 consisted of 29 videos of C57BL/6 strain mice would be optimized to estimate body-part positions for a diversity of poses. We
(Mus musculus), 15 weeks (108 d) old. Animals moved freely in a 45.7 × 45.7 cm used n0 = 500, yielding 29,500 initial samples; D = 50, which is sufficient to explain
open field arena with a clear acrylic floor for 10 min each. Videos were captured 80% of the variance in the data (Supplementary Fig. 2); and k = 10 and n = 150 to
from below with infrared illumination using a Point Grey Blackfly S camera at a produce a final dataset of 1,500 frames for labeling and training.
resolution of 1.95 px per mm at 170 FPS, totaling ~3 million frames. Experimental
procedures were approved by the Princeton University Institutional Animal Care LEAP neural network design and implementation. We based our network
and Use Committee and conducted in accordance with the National Institutes of architecture on previous designs of neural networks for human pose
Health guidelines for the humane care and use of laboratory animals. Mice used in estimation29,31,47. We adopted a fully convolutional architecture that learns a
this study were ordered through The Jackson Laboratory and had at least 1 week of mapping from raw images to a set of confidence maps. These maps are images that
acclimation to the Princeton Neuroscience Institute vivarium before experimental can be interpreted as the 2D probability distribution (that is, a heat map) centered at
procedures were performed. Mice were kept in group cages with food and water ad the spatial coordinates of each body part within the image. We trained the network
libitum under a reversed 12/12-h dark-light cycle (light, 19:30–07:30). to output one confidence map per body part stacked along the channel axis.
Our network consists of 15 layers of repeated convolutions and pooling
Preprocessing and alignment to generate egocentric images for labeling and (Supplementary Fig. 3a). The convolution block consists of ×3 convolution layers
training in LEAP. For the main fly dataset (59 males), we used the alignment (64 filters, 3 × 3 kernel size, 1 × 1 stride, ReLU activation). The full network consists
algorithm from ref. 11. The raw videos consisted of unoriented bounding boxes of ×1 convolution block, ×1 max pooling across channels (2 × 2 pooling size, 2 × 2
around the flies from a closed-loop camera tracking system. This technique relies stride), ×1 convolution block (128 filters), ×1 max pooling (2 × 2 pooling size, 2 × 2
on videos in which the animal remains visible and in [Link] then aligned stride), ×1 convolution block (256 filters), ×1 transposed convolution (128 filters,
individual frames to a template image of an oriented fly by matching the peak of 3 × 3 kernel size, 2 × 2 stride, ReLU activation, Glorot normal initialization), ×
the radon transformed fly image to recover the orientation and then computing 2 convolution (128 filters, 3 × 3 kernel size, 1 × 1 stride, ReLU activation), and ×1
the cross-correlation to center the fly. The centroid and orientation parameters transposed convolution (128 filters, 3 × 3 kernel size, 2 × 2 stride, linear activation,
were used to crop a 192 × 192-px oriented bounding box in each frame. Code for Glorot normal initialization).
alignment is available in the repository accompanying the original paper: https:// We base our decisions of these hyperparameters on the idea that repeated
[Link]/gordonberman/MotionMapper. convolutions and strided max pooling enable the network to learn feature detectors
For the second fly dataset (42 males), we adapted a previously published across spatial scales. This allows the network to learn how to estimate confidence
method for tracking and segmentation of videos of courting fruit flies14. We first maps using a global image structure that provides contextual information that
modeled the mesh background of the images by fitting a normal distribution to can be used to improve estimates, even for occluded parts29,31. Despite the loss of
each pixel in the frame across time with a constant variance to account for camera resolution from pooling, the upsampling learned through transposed convolutions
shot noise. The posterior was evaluated at each pixel of each frame and then is sufficient to recover the spatial precision in the confidence maps. We do not
thresholded to segment the foreground pixels. Because of the inhomogeneity of the use skip connections, residual modules, stacked networks, regression networks
arena floor mesh, substanial segmentation artifacts were introduced, particularly or affinity fields in our architecture as used in other approaches of human
when translucent or very thin body parts (that is, wings and legs) could not be pose estimation29,31–33,47.
disambiguated from the dark background mesh holes. The subsequent steps of For comparison, we also implemented the stacked hourglass network47.
histogram thresholding, morphological filtering and ellipse fitting were performed We tested both the single hourglass version and ×2 stacked hourglass with
as described previously in ref. 14. We developed a simple GUI for proofreading the intermediate supervision. The hourglass network consisted of ×4 residual
automated ellipse tracking before extracting 192 × 192-px oriented bounding boxes. bottleneck modules (64 output filters) with max pooling (2 × 2 pool, 2 × 2 stride),
We extracted bounding boxes for both animals in each frame and saved both the followed by their symmetric upsampling blocks and respective skip connections.
raw pixels containing the background mesh and the foreground-only images that The stacked version adds intermediate supervision in the form of a loss term on
contain segmentation artifacts. This pipeline was implemented in MATLAB, and the output of the first network in addition to the final output.
the code is available in the code repository accompanying this paper. We implemented all versions of neural networks in Python via Keras and
For the mouse videos, a separate preprocessing pipeline was developed. TensorFlow, popular deep learning packages that allow transparent GPU
Raw videos were processed in three stages: (1) animal tracking, (2) segmentation acceleration and easy portability across operating systems and platforms. All
from background and (3) alignment to the body centroid and tail–body Python code was written for Python v.3.6.4. Required libraries were installed
interface. In stage (1), we tracked the mouse’s torso centroid by subtracting a via the pip package manager: numpy (v.1.14.1), h5py (v.2.7.1), TensorFlow-gpu
background image (median calculated at each pixel value across that video), (v.1.6.0), keras (v.2.1.4). We tested our code on machines running either Windows
retrieving pixels with a brightness above a chosen threshold from background 10 (v.1709) and a RedHat-based Linux distribution (Springdale 7.4) with no
(mice were brighter than background) and using morphological opening to additional steps required to port the software other than installing the required
eliminate noise and the mouse’s appendages. The largest contiguous region libraries. All networks were compared using the same aligned dataset so as to
reliably captured the mouse’s torso (referred to below as the torso mask) and remove complications due to differences in preprocessing.
was used to fit an ellipse whose center was used to approximate the center of Code for all network implementations is available in the main
the animal. In stage (2), a similar procedure as in stage (1) was employed to repository accompanying this paper ([Link]
retrieve a full body mask. In this stage, a more permissive threshold and smaller and Supplementary Software.
morphological opening radius were used than in stage (1) to capture the mouse’s
body edges, limbs and tail while still eliminating noise. The pixels outside of LEAP training procedure. Prior to training, we generated an augmented dataset
this body mask were set to zero. In stage (3) each segmented video frame was from the user-provided labels and corresponding images. We first doubled the
translated and rotated such that frame’s center coincided with the center of number of images by mirroring the images along the body symmetric axis (defined
the animal and the x-axis lay on the line connecting the center and tail-body from the preprocessing) and adjusting the body-part coordinates accordingly,
attachment point. The tail-body attachment point was defined as the center of including swapping left/right body part labels (for example, legs). Then, we
a region overlapping between the torso mask and a dilated tail mask. The tail generated confidence maps for each body part in each image by rendering the
mask was defined as the largest region remaining after subtraction of the torso 2D Gaussian probability distribution centered at the ground truth body-part
mask from the full body mask and application of a morphological opening. After coordinates, μ = (x, y), and fixed covariance, Σ = diag(σ) with a constant σ = 5 px.
applying these masks to segment the raw images, we extracted bounding boxes These were pre-generated and cached to disk to minimize the necessary processing
by using the ellipse center and orientation. time during training.
Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.
Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main
text, or Methods section).
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Data analysis Custom code was used for all components of the framework and provided alongside the accompanying open-source code repository
([Link] Additional commercial or third party software used: MathWorks MATLAB R2018a, Python 3.6.4, numpy
(1.14.1), h5py (2.7.1), tensorflow-gpu (1.6.0), keras (2.1.4).
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers
upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.
April 2018
1
Data
Additional fly and mouse datasets used for Fig. 6 can be made available upon reasonable request.
Field-specific reporting
Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see [Link]/authors/policies/[Link]
Replication No major experimental findings are reported for this method description. Application results reproduce previously described findings.
Randomization Blinding was not relevant to this study. We observed natural behavior in a freely moving context with no grouping of the animals.
Blinding Blinding was not relevant to this study. We observed natural behavior in a freely moving context with no grouping of the animals.
Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research
Laboratory animals Fruit flies (Drosophila melanogaster), all males, 3-8 days old, NM91 or Oregon-R strains. Mice (Mus musculus), all males, 15
weeks (108 days) old, C57/BL6 strain.
LEAP improves pose estimation in freely moving rodents by using a 15-layer, fully convolutional neural network that generates probability distributions for each body part. This allows for accurate tracking even during challenging bouts of locomotion and occlusion . It surpasses previous methods by facilitating tracking with minimal training data, achieving high accuracy with as few as 15 epochs . This network performs comparably or better than more complex models and reduces the need for physical markers .
LEAP ensures generalizability by using datasets and model configurations that accommodate diverse imaging conditions. Its design allows the neural network to adapt to freely moving rodents and various locomotion challenges through robust training processes, ensuring consistent performance across different scenarios . Moreover, its simple yet effective neural network structure provides a reliable solution that transcends specific imaging constraints .
LEAP's network architecture plays a crucial role in reducing error rates by producing highly accurate probability distributions for each body part location through its 15-layer, fully convolutional structure. It adapts efficiently to different body part configurations and occlusions, minimizing errors to well below 3 pixels for most estimates. This architecture reduces error rates by treating pose estimation as a probabilistic mapping problem, providing a more precise prediction that maintains high accuracy across various body parts, including those frequently occluded .
LEAP potentially reduces technical barriers in ethological studies by enabling quantitative analysis of dynamic changes in animal poses over time without the need for physical markers. This tool allows researchers to address questions previously limited by the technological complexity of tracking freely moving animals . It facilitates studies on a wide range of behaviors by providing an accurate, automated method to track and analyze animal movement .
The LEAP network, which is relatively simple with only 15 layers, performs as well as or better than more complex networks in terms of accuracy and efficiency. It generates probability distributions for each body part, ensuring high accuracy even in the presence of occlusion, and requires minimal training for effective performance . This makes it a more efficient choice for pose estimation, especially in settings requiring rapid setup and execution .
LEAP's training efficiency significantly enhances its performance by requiring only a small amount of labeled data to achieve a high level of accuracy. For instance, fast training with 1,215 labeled frames allows for rapid convergence to an accuracy of 1.97 px per 56 μm. Moreover, with just 100 labeled images, LEAP achieves approximately 65% of the performance it gets with a much larger dataset . This efficiency reduces the time required for manual labeling and accelerates the overall process .
LEAP's minimal training requirement enhances its usability by allowing researchers to efficiently set up and execute pose estimation tasks with fewer resources. The network achieves high precision with minimal training samples, reducing the time and effort needed for manual labeling, thereby making it a practical tool for various research applications . With a decrease in required labeled data, researchers can focus more on data analysis and less on data preparation .
The LEAP GUI provides significant advantages during the labeling and training phase by enabling faster and more accurate placement of body part markers on training images. It supports the iterative process of refining labels by using initial neural network estimates, reducing the time per frame from 2 minutes for the first 10 frames to 6 seconds for the last 500 frames .
LEAP streamlines the pose estimation process by integrating registration, alignment, training, and pose estimation within a single framework. The iterative process for labeling shortens the labeling time drastically, facilitating quicker transitions to pose estimation. Its ability to generalize across different datasets and conditions reduces the need for subsequent adjustments in new experiments, saving time and resources compared to previous systems .
LEAP's ability to handle occlusions is critical because it allows for accurate tracking of body parts even when they are not fully visible. This feature is particularly important for studies involving animal behavior where parts of the body frequently become occluded. By maintaining accuracy despite occlusions, LEAP increases the reliability of data collected in natural and uncontrived research settings, thus broadening the scope of experiments that can be conducted without artificial constraints .