0% found this document useful (0 votes)

67 views13 pages

Fast Animal Pose Estimation with LEAP

Q: How does LEAP improve pose estimation in freely moving rodents compared to previous methods?

LEAP improves pose estimation in freely moving rodents by using a 15-layer, fully convolutional neural network that generates probability distributions for each body part. This allows for accurate tracking even during challenging bouts of locomotion and occlusion . It surpasses previous methods by facilitating tracking with minimal training data, achieving high accuracy with as few as 15 epochs . This network performs comparably or better than more complex models and reduces the need for physical markers .

Q: How does LEAP ensure the generalizability of results across different imaging conditions?

LEAP ensures generalizability by using datasets and model configurations that accommodate diverse imaging conditions. Its design allows the neural network to adapt to freely moving rodents and various locomotion challenges through robust training processes, ensuring consistent performance across different scenarios . Moreover, its simple yet effective neural network structure provides a reliable solution that transcends specific imaging constraints .

Q: What role does LEAP's network architecture play in reducing error rates during pose estimation?

LEAP's network architecture plays a crucial role in reducing error rates by producing highly accurate probability distributions for each body part location through its 15-layer, fully convolutional structure. It adapts efficiently to different body part configurations and occlusions, minimizing errors to well below 3 pixels for most estimates. This architecture reduces error rates by treating pose estimation as a probabilistic mapping problem, providing a more precise prediction that maintains high accuracy across various body parts, including those frequently occluded .

Q: What is the impact of using LEAP for ethological studies, and what barriers does it potentially reduce?

LEAP potentially reduces technical barriers in ethological studies by enabling quantitative analysis of dynamic changes in animal poses over time without the need for physical markers. This tool allows researchers to address questions previously limited by the technological complexity of tracking freely moving animals . It facilitates studies on a wide range of behaviors by providing an accurate, automated method to track and analyze animal movement .

Q: How does the LEAP network compare to more complex pose estimation models?

The LEAP network, which is relatively simple with only 15 layers, performs as well as or better than more complex networks in terms of accuracy and efficiency. It generates probability distributions for each body part, ensuring high accuracy even in the presence of occlusion, and requires minimal training for effective performance . This makes it a more efficient choice for pose estimation, especially in settings requiring rapid setup and execution .

Q: In what way does LEAP's training efficiency contribute to its overall performance?

LEAP's training efficiency significantly enhances its performance by requiring only a small amount of labeled data to achieve a high level of accuracy. For instance, fast training with 1,215 labeled frames allows for rapid convergence to an accuracy of 1.97 px per 56 μm. Moreover, with just 100 labeled images, LEAP achieves approximately 65% of the performance it gets with a much larger dataset . This efficiency reduces the time required for manual labeling and accelerates the overall process .

Q: How does the minimal training requirement of LEAP affect its usability for researchers?

LEAP's minimal training requirement enhances its usability by allowing researchers to efficiently set up and execute pose estimation tasks with fewer resources. The network achieves high precision with minimal training samples, reducing the time and effort needed for manual labeling, thereby making it a practical tool for various research applications . With a decrease in required labeled data, researchers can focus more on data analysis and less on data preparation .

Q: What are the advantages of using the LEAP GUI during the labeling and training phase?

The LEAP GUI provides significant advantages during the labeling and training phase by enabling faster and more accurate placement of body part markers on training images. It supports the iterative process of refining labels by using initial neural network estimates, reducing the time per frame from 2 minutes for the first 10 frames to 6 seconds for the last 500 frames .

Q: In what way does the LEAP system streamline the process of pose estimation?

LEAP streamlines the pose estimation process by integrating registration, alignment, training, and pose estimation within a single framework. The iterative process for labeling shortens the labeling time drastically, facilitating quicker transitions to pose estimation. Its ability to generalize across different datasets and conditions reduces the need for subsequent adjustments in new experiments, saving time and resources compared to previous systems .

Q: Discuss the importance of LEAP's ability to handle occlusions and its implications for research.

LEAP's ability to handle occlusions is critical because it allows for accurate tracking of body parts even when they are not fully visible. This feature is particularly important for studies involving animal behavior where parts of the body frequently become occluded. By maintaining accuracy despite occlusions, LEAP increases the reliability of data collected in natural and uncontrived research settings, thus broadening the scope of experiments that can be conducted without artificial constraints .

The document introduces LEAP, a deep-learning-based method for fast and accurate animal pose estimation, capable of predicting body part positions with minimal training data. LEAP was validated using videos of fruit flies and mice, demonstrating high accuracy (error rate <3% of body length) and applicability for behavioral classification. The framework includes a user-friendly graphical interface for labeling and training, enabling efficient pose tracking in various experimental conditions.

Uploaded by

wangtingting.uber

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views13 pages

Fast Animal Pose Estimation with LEAP

Uploaded by

wangtingting.uber

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Articles

[Link]

Fast animal pose estimation using deep

neural networks
Talmo D. Pereira 1,6, Diego E. Aldarondo1,5,6, Lindsay Willmore1, Mikhail Kislin1, Samuel S.-H. Wang1,2,
Mala Murthy 1,2* and Joshua W. Shaevitz 1,3,4*

The need for automated and efficient systems for tracking full animal pose has increased with the complexity of behavioral
data and analyses. Here we introduce LEAP (LEAP estimates animal pose), a deep-learning-based method for predicting the
positions of animal body parts. This framework consists of a graphical interface for labeling of body parts and training the net-
work. LEAP offers fast prediction on new data, and training with as few as 100 frames results in 95% of peak performance. We
validated LEAP using videos of freely behaving fruit flies and tracked 32 distinct points to describe the pose of the head, body,
wings and legs, with an error rate of <3% of body length. We recapitulated reported findings on insect gait dynamics and dem-
onstrated LEAP’s applicability for unsupervised behavioral classification. Finally, we extended the method to more challenging
imaging situations and videos of freely moving mice.

C
onnecting neural activity with behavior requires methods to animal and having it walk on a spherical treadmill19 or linear track20;
parse what an animal does into its constituent components applying physical markers to the animal19,21; or using specialized
(movements of its body parts), which can then be connected equipment such as depth cameras22–24, frustrated total internal
with the electrical activity that generates each action. This is particu- reflection imaging19,21,25,26 or multiple cameras27. However, these
larly challenging for natural behavior, which is dynamic, complex and techniques are all designed to work within a narrow range of experi-
noisy. Human classification of behavior is slow and subject to bias1,2, mental conditions and are not easy to adapt to disparate datasets.
but speed can be increased through automation1 including meth- To design a general algorithm capable of tracking body parts
ods to track and analyze animal centroids and shapes over time3–5, from many different kinds of experiments, we turned to deep-learn-
machine learning techniques for identifying user-defined behaviors ing-based methods for pose estimation that have proved success-
such as fighting and courting6,7, and software to segment the acous- ful on images of humans28–34. Breakthroughs in the field have come
tic signals produced by an animal8–10. However, one may not know from the adoption of fully convolutional neural network architec-
a priori which behaviors to analyze; this is particularly true when tures for efficient training and evaluation of images35,36 and the pro-
screening mutant animals or investigating the results of neural per- duction of a probabilistic estimate of the position of each tracked
turbations that can alter behavior in unexpected ways. body part29,31. However, the problems of pose estimation in the typi-
Developments in the unsupervised clustering of postural cal human setting and that for laboratory animals are subtly differ-
dynamics have enabled researchers to overcome many of these ent. Algorithms built for human images can deal with large amounts
challenges by analyzing the raw frames of videos in a reduced of heterogeneity in body shape, environment and image quality,
dimensional space (for example, generated via principal compo- but use very large labeled training sets of images37–39. In contrast,
nent analysis (PCA)). By comparing frequency spectra or fitting behavioral laboratory experiments are often more controlled, but
auto-regressive models from low-dimensional projections11,12, these the imaging conditions may be highly specific to the experimental
methods can both define and record the occurrence of tens to hun- paradigm, and labeled data, not readily available, must be gener-
dreds of unique, stereotyped behaviors in animals such as fruit flies ated for every experimental apparatus and animal type. One recent
and mice. Such methods have been used to uncover structures in attempt to apply these techniques to images of behaving animals
behavioral data, thereby facilitating the investigation of temporal successfully used transfer learning, whereby networks initially
sequences13, social interactions14, genetic mutants12,15 and the results trained for a more general object-classification task are refined by
of neural perturbation16,17. further training with relatively few samples from animal images40.
A major drawback to the aforementioned techniques is their reli- Our approach combines a GUI-driven workflow for labeling
ance on PCA to reduce the dimensionality of the image time series. images with a simple network architecture that is easy to train and
While this produces a more manageable substrate for machine requires few computations to generate predictions. This method can
learning, it would be advantageous to directly analyze the position automatically predict the positions of animal body parts via iterative
of each actuatable body part, as this is what is ultimately under the training of deep convolutional neural networks with as few as ten
control of the motor nervous system. However, measuring all of frames of labeled data for initial prediction and training (training on
the body-part positions from raw images is a challenging computer ten frames results in 74% of estimates within a 2.5-pixel (px) error).
vision problem18. Previous attempts at automated body-part track- After initial de novo training, incrementally refined predictions can
ing in insects and mammals relied on physically constraining the be used to guide labeling in new frames, drastically reducing the time

Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA. 2Department of Molecular Biology, Princeton University, Princeton, NJ, USA.
1

Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA. 4Department of Physics, Princeton University, Princeton, NJ,
3

USA. 5Present address: Program in Neuroscience, Harvard University, Cambridge, MA, USA. 6These authors contributed equally: Talmo D. Pereira,
Diego E. Aldarondo. *e-mail: mmurthy@[Link]; shaevitz@[Link]

Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods 117

Articles Nature Methods

a b
Tracking workflow
0.5 mm
(1) Egocentric alignment Click or drag
to fix label
Current frame
with labels
Training (once per dataset)
(2) Select subset of images for labeling
0 to 30 min

Label 10 frames 1 mm Labeled Initialized

Progress
20 min indicators

Train for 15 epochs Usability options

1 to 60 min
15 to 20 min

Estimate on unlabeled frames

Shortcuts
<1 min

Correct label estimates on 50 frames c

Raw Confidence maps Tracked

Full training
1h

(3) Estimate positions on new data

1 mm
~185 FPS

0.5 mm

d
Walking t = 0.090 s t = 0.560 s t = 0.940 s
Right left

1 mm

1 mm
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Time (s)
e t = 0.150 s t = 0.490 s t = 0.860 s
Head grooming
Right left

1 mm

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time (s)

Fig. 1 | Body-part tracking via LEAP, a deep learning framework for animal pose estimation. a, Overview of the tracking workflow. b, GUI for labeling
images. Interactive markers denote the default or best estimate for each body part (top left). Users click or drag the markers to the correct location (top
right). Colors indicate labeling progress and denote whether the marker is at the default or estimated position (yellow) or has been updated by the user
(green). Progress indicators mark which frames and body parts have been labeled thus far, while shortcut buttons enable the user to export the labels to
use a trained network to initialize unlabeled body parts with automated estimates. c, Data flow through the LEAP pipeline. For each raw input image (left),
the network outputs a stack of confidence maps (middle). Colors in the confidence maps represent the probability distribution for each individual body
part. Insets overlay individual confidence maps on the image to reveal how confidence density is centered on each body part, with the peak indicated
by a circle. The peak value in each confidence map predicts the coordinate for each body part (right). d, Quantification of walking behavior using leg tip
trajectories. The distance of each of the six leg tips from its own mean position during a walking bout as a function of time (left). Poses at the indicated
time points (right). Blue and red traces correspond to left and right leg tips, respectively. e, Quantitative description of head grooming behavior described
by leg tip trajectories. Position estimates are not confounded by occlusions when the legs pass under the head (right, inset).

required to label sufficient examples (50 frames) to achieve a median using a previously published dataset of high-quality videos of freely
accuracy of less than 3 px per 86 μm (distance from ground truth). behaving adult fruit flies (Drosophila melanogaster11) and recapitu-
Training on a workstation with a modern graphics processing unit lated a number of reported findings on insect gait dynamics as a test
(GPU) is efficient (<1 h) and prediction on new data is fast (up to of its experimental validity. We then used an unsupervised behav-
185 Hz after alignment). We validated the results of our method ioral classification algorithm to describe stereotyped behaviors

118 Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods

Nature Methods Articles
a Error distance (px)
b Accuracy after full training
25th percentile
e
50th percentile
e 0.5 Part type
75th percentile
e
90th percentile
e All
0.4 Body
Legs
Wings
0.3

PDF
0.2

0.1

0
5 px ~140 μm 0 1 2 3 4 5
35 px 1 mm Error distance (px)

c Accuracy during training phase d Accuracy with few user labels

15 1 Labeled
Part type frames
All 1,000
Body 0.8
500
Error distance (px)

10 Legs 8 250
Fast training 200

R.m.s. error
Wings 0.6 150
(15–20 min)

CDF
6 100
Full training 50
(50–75 min) 0.4 25
5 4
1.97 px 20
1.63 px 15
0.2 101 102 103
2.5 10
Labeled frames 5
0 0
5 10 15 20 25 30 35 40 45 50 0 2.5 5 10 15
Epochs trained Error distance (px)

Fig. 2 | LEAP is accurate and requires little training or labeled data. a, Part-wise accuracy distribution after full training. Circles are plotted on
a reference image to indicate the fraction of held-out testing data (168 images from seven held-out flies) for which estimated positions of the
particular body part are closer to the ground truth than the radii. Scale bars indicate image and physical size; 35 px is equivalent to 1 mm at this resolution.
b, Accuracy summary on held-out test set after full training. PDF, probability density function. c, Accuracy as a function of training time. In the ‘fast
training’ regime, n =1,215 labeled frames were used for training. Lines and shaded area (smaller than line width) indicate the mean and s.e.m. for all held-
out test images pooled over five runs. Run time estimates based on high-end consumer or enterprise GPUs. d, Accuracy as a function of the number of
training examples. Distributions indicate estimation errors in a held-out test set (n =168 frames) with varying numbers of labeled images used for training,
pooled over five ‘fast training’ runs. CDF, cumulative distribution function. Inset: median overall r.m.s. error over these five replicates at each sample size.

in terms of the dynamics of individual body parts. Finally, we 2 min per frame for the first 10 frames to 6 s per frame for the last
showed generalizability by using more challenging imaging condi- 500 frames (Supplementary Fig. 5). The third step is pose estima-
tions and videos from freely moving rodents. tion, in which the network can be applied to new and unlabeled
data (Fig. 1c). With minimal training, LEAP faithfully tracks all the
Results body parts, even during challenging bouts of locomotion and in the
LEAP consists of three phases (see Fig. 1a and Supplementary presence of occlusion (Fig. 1d,e and Supplementary Videos 1–3).
Results for a full description). The first step is registration and In the following sections, we demonstrate the power of this tool,
alignment, in which raw video of a behaving animal is preprocessed using a previously published dataset of 59 male fruit flies, each
into egocentric coordinates with an average error of 2.0°. This step recorded for 1 h at 100 Hz, for a total of >21 million images11. All
increases pose estimation accuracy but can be omitted at the cost code and utilities are available at [Link] and
of prediction accuracy (Supplementary Fig. 1). The second step is as Supplementary Software.
labeling and training, in which the user provides ground truth labels
to train the neural network to find body-part positions on a sub- Performance of LEAP: accuracy, speed, and training sample size.
set of the total images. We used cluster sampling to identify a sub- We evaluated the accuracy of LEAP after full training with 1,500
set of images that were representative of the complete set of poses labeled images by measuring error as the Euclidean distance between
found in a dataset (Supplementary Fig. 2). A GUI with draggable estimated and ground truth coordinates of each body part on a held-
body part markers facilitated the labeling of each training image out test set of 168 frames (from seven held-out flies) without aug-
(Fig. 1b). LEAP uses a 15-layer, fully convolutional neural network mentation. We found that the accuracy level depended on the body
that produces a set of probability distributions for the location of part being tracked, with parts that were more often occluded (for
each body part in an image (Fig. 1c and Supplementary Fig. 3). example, hind legs) resulting in slightly higher error rates (Fig. 2a).
This simple network performs equivalently to, or better than, Overall, we found that error distances for all body parts were
more complicated architectures that have been used in the past well below 3 px for the vast majority of tested images (Fig. 2b).
(Supplementary Fig. 3b). For the fly, we tracked 32 points that define This error was achieved rather quickly during training, with as few as 15
the Drosophila body joints (Supplementary Fig. 4). Labeling and epochs (15–20 min of training time) required to achieve approximately
training occur in an iterative procedure. Labels from the first ten 1.97 px per 56 μm overall accuracy, and less than 50 epochs (50–75 min)
images are used to train the neural network and generate body-part required for convergence to 1.63 px per 47 μm accuracy with the full
estimates for the rest of the training set images. Using these estimates training set (Fig. 2c). To measure the ground truth accuracy during
as the initial guesses in the GUI increases the speed of labeling. This the alternating labeling-training phase, we also measured the errors on
is repeated periodically, and the time to label an image drops from the full test set as a function of the number of labeled images used for

Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods 119

Articles Nature Methods

a b c 15 Body speed
Stance –1 –1
s
Stance 100 Swing 10 5–10 mm–1 s–1

Swing velocity (mm s–1)

10–15 mm–1 s–1
15–20 mm–1 s–1

Durations (ms)
80 5 20–25 mm–1 s–1
25–30 mm–1 s–1
0 30–35 mm–1 s–1
60 35–40 mm–1 s–1
40–45 mm–1 s–1
–5
40
–10
1 mm 20
Swing –15
10 20 30 40 –10 0 10 20 30 40 50
Average body speed (mm–1 s–1) Time from swing onset (ms)

d 0.6 e ×104
3.5
Tripod Tripod
0.5 Tetrapod Tetrapod
3
Noncanonical Noncanonical
Emission probability

0.4 2.5

Count
0.3
1.5
0.2
1
0.1
0.5

0 0
0 1 2 3 4 5 6 5 10 15 20 25 30 35
Number of legs in stance Forward velocity (mm–1 s–1)

f Tripod g Tetrapod
40 40
(mm–1 s–1)
(mm–1 s–1)

Forward
velocity
Forward
velocity

20 20
0 0

RH RH
RM RM
Swing
Leg tip
Leg tip

RF RF
LH Stance LH
LM LM
LF LF
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Time (s) Time (s)

Fig. 3 | LEAP recapitulates known gait patterning in flies. a, Schematic of swing and stance encoding. Stance is defined by a negative horizontal
velocity in egocentric coordinates. b, Duration of swing and stance as a function of average body speed. These data comprise approximately 7.2 h in which
the fly was moving forward (2.6 million frames). Shaded regions indicate 1 s.d. c, Swing velocity as a function of time from swing onset, and binned by body
speed (n =1,868,732 swing bouts across all legs). Shaded regions indicate 1 s.d. d, Emission probabilities of numbers of legs in stance for each hidden
state in the HMM (Methods). Hidden state emissions resemble tripod, tetrapod and noncanonical gaits. e, Distributions of velocities for each hidden state.
f,g, Examples of tripod (f) and tetrapod (g) gaits identified by the HMM. RH, right hind leg tip; RM, right mid; RF, right fore; LH, left hind; LM, left mid; LF,
left fore.

training under the fast training regime (15 epochs). We found that with producing experimentally valid measurements, we used it to ana-
as few as ten labeled images, the network was able to achieve <2.5 px lyze the gait dynamics of freely moving flies. Previous work on
error (2–3% of body length) in 74% of the test set, while 1,000 labeled Drosophila gait relied on imaging systems that use a combination of
images yielded an accuracy of <2.5 px in 87% of the test set (Fig. 2d). optical touch sensors and high-speed video recording to follow fly
When examining the root-mean-square error (r.m.s. error), we found legs as they walk25. Such systems cannot track the limbs when they
that the performance of the network plateaued at approximately 100 are not in contact with the surface (during swing). Other methods
training frames, and labeling of only ten frames corresponded to 65% to investigate gait dynamics use a semi-automated approach to label
of peak performance (Fig. 2d, inset). This level of accuracy when train- fly limbs18,41 and require manual correction of automatically gener-
ing for few epochs with few samples contributes to the drastic reduction ated predictions; these semi-automated approaches therefore typi-
in time spent hand-labeling after fast training (Supplementary Fig. 5). cally utilize smaller datasets.
For reference, labeling of 100 fly images with the 32-point skeleton took We evaluated our network on a dataset of 59 adult male fruit flies11
a total of 2 h with the LEAP GUI (with fast training performed after and extracted the predicted positions of each leg tip in each of 21
labeling of 10 and 50 frames), training the network took 1 h, and pose million frames. For every frame in which the fly was moving forward
estimation on new images occurred at a rate of 185 Hz. (7.2 h per 2.6 million frames total), we encoded each leg as either in
swing or in stance, depending on whether the leg was moving for-
Leg tracking with LEAP recapitulates previously described gait ward or backward relative to the fly’s direction of motion (Fig. 3a).
structure. To evaluate the usefulness of our pose estimator for Using this encoding, we measured the relationship between

120 Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods

Nature Methods Articles
a c 1. Wing grooming (right) e 3. Hind grooming (bilateral) g 5. Hind grooming (left)
Body movement space
Head Head Head
Neck Neck Neck
Abdomen Abdomen Abdomen
Wings Wings Wings
Forelegs (L) Forelegs (L) Forelegs (L)

Forelegs (R) Forelegs (R) Forelegs (R)

Midlegs (L) Midlegs (L) Midlegs (L)

Midlegs (R) Midlegs (R) Midlegs (R)

Hindlegs (L) Hindlegs (L) Hindlegs (L)

Hindlegs (R) Hindlegs (R) Hindlegs (R)

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
PDF (× 10–3) Frequency (Hz) Frequency (Hz) Frequency (Hz)
–3
Normalized amplitude (×10 )
0.6 2.6
0 0.5 1 1.5

b Behavioral clusters d f h
2. Hind grooming (right) 4. Wing grooming (left) 16. Anterior grooming
7 Head Head Head
Neck Neck Neck
Abdomen Abdomen Abdomen
1 2 Wings Wings Wings
18
19 20 Forelegs (L) Forelegs (L) Forelegs (L)
3
8 Forelegs (R) Forelegs (R) Forelegs (R)
17
9 21 Midlegs (L) Midlegs (L) Midlegs (L)
4 5
10 Midlegs (R) Midlegs (R) Midlegs (R)
6 11 16
Hindlegs (L) Hindlegs (L) Hindlegs (L)
12
Hindlegs (R) Hindlegs (R) Hindlegs (R)
13
15 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
14 Frequency (Hz) Frequency (Hz) Frequency (Hz)

Fig. 4 | Unsupervised embedding of body position dynamics. a, Density of freely moving fly body-part trajectories, after projection of their spectrograms
into two dimensions via unsupervised nonlinear manifold embedding11. The distribution shown was generated from 21.1 million frames. Regions in the
space with higher density correspond to stereotyped movement patterns, whereas low-density regions form natural divisions between distinct dynamics.
A watershed algorithm was used to separate the peaks in the probability distribution (Methods). b, Cluster boundaries from a with cluster numbers
indicated. c–h, Average spectrograms for the indicated body parts from time points that fall within the dominant grooming clusters; cluster numbers are
indicated in b. Qualitative labels for each cluster based on visual inspection are provided for convenience. Color map corresponds to normalized power for
each body part.

the fly’s speed and the duration of stance and swing (Fig. 3b). feature vectors for each time point into a two-dimensional (2D)
Similar to previous work, we found that swing duration was rela- manifold we term a behavior space (Fig. 4a). The feature vectors
tively constant across walking speeds, whereas stance duration represent the dynamics of each body part across different time
decreased with walking speed25. Because our methods allowed us scales, and as has been shown previously, the distribution of embed-
to estimate animal pose during both stance and swing (versus only ded time points in this space is concentrated into a number of strong
during stance25), we had the opportunity to investigate the dynamics peaks that represent stereotyped behaviors seen across time and in
of leg motion during the swing phase. We found that swing veloc- multiple individuals11.
ity increased with body speed, in agreement with previous results25 We identified clusters in the behavior space distribution by group-
(Fig. 3c). We also found that fly leg velocities followed a parabolic ing together regions of high occupancy and stereotypy (Fig. 4b).
trajectory parameterized by body speed (Fig. 3c). This distribution was qualitatively similar to what we found
We then trained a three-state hidden Markov model (HMM) previously by using a PCA-based compression of the images
to capture the different gait modes exhibited by Drosophila41. The (Supplementary Fig. 6). A major advantage to using pose estimation
emission probabilities from the model of the resulting hidden over PCA-based image compression is the ability to describe stereo-
states were indicative of tripod, tetrapod and noncanonical/wave typed behaviors by the dynamics of each body part. We calculated
gaits (Fig. 3d). As expected, we observed tripod gait at high body the average concatenated spectrogram for each cluster and found
velocities and tetrapod or noncanonical gaits at intermediate and that specific behaviors were recapitulated in the motion power spec-
low velocities, in accordance with previous work25,41,42 (Fig. 3e–g). trum for each body part (Fig. 4c–h).
These results demonstrate that our pose estimator is able to This method can be used to accurately describe grooming, a
effectively capture the dynamics of known complex behaviors, such class of behaviors that is highly represented in our dataset. Posterior
as locomotion. grooming behaviors exhibited a distinctly symmetric topology
(Fig. 4b–g), revealing both bilateral (Fig. 4e) and unilateral groom-
Body dynamics reveal structure in the fly behavioral repertoire. ing of the wings (Fig. 4c,f) and the rear of the abdomen (Fig. 4d,g).
We next used the output of LEAP as the first step in an unsuper- These behaviors involve unilateral, broadband (1–8 Hz) motion of
vised analysis of the fly behavioral repertoire11. We calculated the the hind legs on one side of the body and a slower (~1.5 Hz) fold-
position of each body part relative to the center of the fly thorax for ing of the wing on the same side of the body. In contrast, anterior
each point in time and then computed a spectrogram for each of grooming is characterized by broadband motions of both front legs
these time series via the continuous wavelet transform (CWT). We with a peak at ~9 Hz, representing the legs rubbing against each
then concatenated these spectrograms and embedded the resulting other (Fig. 4h).

Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods 121

Articles Nature Methods

a Locomotion density b Locomotion clusters c 10. Locomotion (slowest) 13. Locomotion (medium-fast)
Head
Neck
2.6 Abdomen
PDF (×10–3) 10 Wings

11 Forelegs (L)

12 Forelegs (R)

Velocity
Midlegs (L)
0.6
13
Midlegs (R)
15 Hindlegs (L)
14
Hindlegs (R)
1 2 4 8 16 32 1 2 4 8 16 32
Frequency (Hz) Frequency (Hz)

d Cluster leg dynamics e Cluster velocity distributions 11. Locomotion (slow) 14. Locomotion (fast)
Head
Neck
20

Mean velocity (mm/s)

Abdomen
0.14 Wings
15
0.14
0.12 Forelegs (L)
Power (mm2 per Hz)

10
0.12
0.1 5 Forelegs (R)
0.1
0.08 0 Midlegs (L)
PDF

0.08 4 6 8 10 12
0.06 Peak frequency (Hz) Midlegs (R)
0.06
0.04 0.04 Hindlegs (L)
0.02 0.02 Hindlegs (R)
0 1 2 4 8 16 32 1 2 4 8 16 32
1 2 4 8 16 32 0 10 20 30 40
Frequency (Hz) Forward velocity (mm/s) Frequency (Hz) Frequency (Hz)

f Gait state distribution 12. Locomotion (medium-slow) 15. Locomotion (fastest)

Head
Neck
Abdomen
Tripod Tetrapod Noncanonical Wings

0.02 Forelegs (L)

Forelegs (R)
0.015
Density

Midlegs (L)

0.01 Midlegs (R)

Hindlegs (L)
0.005
Hindlegs (R)

1 2 4 8 16 32 1 2 4 8 16 32
Frequency (Hz) Frequency (Hz)
Normalized amplitude (×10–3)

0 0.5 1 1.5

Fig. 5 | Locomotor clusters in behavior space separate distinct gait modes. a,b, Density (a) and cluster (b) labels of locomotion clusters (from the same
behavioral space shown in Fig. 4a). c, Average spectrograms (similar to Fig. 4c–h) quantifying the dynamics in each cluster. d, Average power spectra
calculated from the leg joint positions for each cluster in c. Colors correspond to the cluster numbers in b. e, The distribution of forward locomotion
velocity as a function of cluster number. Colors correspond to cluster numbers in b. Inset, forward locomotion velocity as a function of peak leg frequency.
f, Gait modes identified by HMM from swing/stance state correspond to distinct clusters.

We also discovered a number of unique clusters related to loco- approach under more varied imaging conditions, we evaluated the
motion (Fig. 5a,b). The slowest state (cluster 10) involved several performance of LEAP on a dataset in which pairs of flies were imaged
frequencies with a broad peak centered at 5.1 Hz (Fig. 5c–e). This against a nonuniform and low-contrast background of porous mesh
can be seen in both the concatenated spectrograms (Fig. 5c) and (~4.2 million frames, ~11.7 h of video) (Fig. 6a). We first labeled
the power spectrum averaged over all leg positions (Fig. 5d). The only the male flies from these images, and, using the same workflow
fly center-of-mass velocity distribution for this behavior is shown in as in the first dataset, we found that the pose estimator was able
Fig. 5e. As the fly speeds up (clusters 10–15, Fig. 5e), the peak fre- to reliably recover body-part positions with high accuracy despite
quency for the legs increases monotonically to 11.5 Hz (cluster 15). poorer illumination and a complex background that was at times
We next asked whether the tripod and tetrapod gaits we found in indistinguishable from the fly (Fig. 6a and Supplementary Video 4).
our previous analysis (Fig. 3) were represented by distinct regions We then evaluated the performance of the network when the back-
in the behavior space. We found that tripod gait was used predomi- ground was masked out14 (Fig. 6b). Even with substantial errors in
nantly in the three fastest locomotion behaviors, whereas the tetra- the masking (for example, leg or wing segmentation artifacts), we
pod (and to a lesser extent the noncanonical) gait was used for the found that the accuracy improved slightly when the background
three slower locomotion behaviors (Fig. 5f). pixels were excluded from the images compared with that achieved
with the raw images (Fig. 6b and Supplementary Video 4). We also
LEAP generalizes to images with complex backgrounds or of tested whether a single network trained on both male and female
other animals. To test the robustness and generalizability of our images performed better or worse than the network trained on

122 Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods

Nature Methods Articles
a 0.4
Error distance (px)
25th percentile
Part type 50th percentile
0.35 75th percentile
All 90th percentile
0.3 Body
Legs
0.25 Wings

PDF
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
1 mm
Error distance (px)

b Part type
Error distance (px)
25th percentile
0.35 All
50th percentile
75th percentile
0.3 Body 90th percentile
Legs
0.25 Wings
PDF

0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
1 mm Error distance (px)

c Error distance (px)

0.35 Part type 25th percentile

50th percentile
All 75th percentile
0.3 Body 90th percentile

Paws
0.25
0.2
PDF

0.15
0.1
0.05
0
0 2 4 6 8 10
2 cm
Error distance (px)

Fig. 6 | LEAP generalizes to images with complex backgrounds or of other animals. a, LEAP estimates on a separate dataset of 42 freely moving male
flies, each imaged against a heterogeneous background of mesh and microphones, with side illumination (~4.2 million frames, ~11.7 h). 32 body parts
(Supplementary Fig. 4) were tracked, and 1,530 labeled frames were used for training. Error rates for position estimates were calculated on a held-out test
set of 400 frames (center) and were comparable to those achieved for images with higher signal to noise (compare with Fig. 2b). Part-wise error distances
(right). b, LEAP estimates on masked images from the dataset described in a. Background was subtracted using standard image processing algorithms
(Methods) to reduce the effect of background artifacts. c, LEAP estimates on a dataset of freely moving mice imaged from below (~3 million frames,
~4.8 h). Three points are tracked per leg, in addition to the tip of the snout, neck, and base and tip of the tail (left)—1,000 labeled frames were used for
training. Accuracy rates on a held-out test set (of 242 frames) (center).

only male images. We found that the overall performance was pose across datasets, as is done in the case of human pose estima-
similar (Supplementary Fig. 7) but that the network trained on tion. Rather, we present a framework that uses an active GUI and
only male images performed slightly better. This discrepancy is simple network architecture that can be quickly trained on any new
due largely to body parts that are used in very different ways by image dataset for which pre-existing labels are not available.
males and females (for example, the wings, which generate song Tracking only the centroid of an animal and its change in
in males but never in females), and can be overcome with addi- position or heading over time is probably an insufficient level of
tional training. Finally, we tested the applicability of our framework description for determining how the nervous system controls most
to animals with different morphology by tracking videos of freely behaviors. Previous studies have addressed the issue of pose esti-
behaving mice (Mus musculus) imaged from below in an open mation through centroid tracking3, pixel-wise correlations11,12 or
arena (Fig. 6c). We observed comparable accuracy in these mice specialized apparatus for tracking body parts19,22,25,41,43. For the last,
despite considerable occlusion during behaviors such as rearing applying markers to an animal can limit natural behavior, and sys-
(Fig. 6c and Supplementary Video 5). tems that track particular body parts are not in general scalable to
all body parts or animals with a very different body plan.
Discussion We demonstrate the value of LEAP by showing how it can be
Here we present a pipeline (LEAP) that uses a deep neural network applied to the study of locomotor gait dynamics and unsupervised
to track the body parts of a behaving animal in all frames of a movie behavioral mapping in Drosophila. Previous studies of gait dynam-
via labeling of a small number of images from across the dataset. ics have been limited to short stretches of locomotor bouts that were
LEAP does not use a single trained ‘generalist’ network to analyze captured with a specialized imaging system25 or to the number of

Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods 123

Articles Nature Methods

behavioral frames that could be hand-labeled41. We show that LEAP 6. Dankert, H., Wang, L., Hoopfer, E. D., Anderson, D. J. & Perona, P.
not only recapitulates previous findings on locomotor gait, but also Automated monitoring and analysis of social behavior in Drosophila.
Nat. Methods 6, 297–303 (2009).
discovers new aspects of the behavior. Body-part tracking provides 7. Kabra, M., Robie, A. A., Rivera-Alba, M., Branson, S. & Branson, K. JAABA:
a solution to a major shortcoming in existing approaches, namely, interactive machine learning for automatic annotation of animal behavior.
that researchers have to interpret identified behaviors simply by Nat. Methods 10, 64–67 (2013).
watching videos11,12. When LEAP is used as the first step in such 8. Arthur, B. J., Sunayama-Morita, T., Coen, P., Murthy, M. & Stern, D. L.
Multi-channel acoustic recording and automated analysis of Drosophila
unsupervised algorithms, each discovered behavior can be inter-
courtship songs. BMC Biol. 11, 11 (2013).
preted through analysis of the dynamics of each body part. 9. Anderson, S. E., Dave, A. S. & Margoliash, D. Template-based
There are a number of applications for this pipeline beyond those automatic recognition of birdsong syllables from continuous recordings.
demonstrated here. Because the network learns body positions from J. Acoust. Soc. Am. 100, 1209–1219 (1996).
a small number of labeled frames, the network can probably be 10. Tachibana, R. O., Oosugi, N. & Okanoya, K. Semi-automatic classification of
birdsong elements using a linear support vector machine. PLoS ONE 9,
trained to track a wide variety of animal species and classes of behav- e92584 (2014).
ior. Further, LEAP could be extended to tracking of body parts in 11. Berman, G. J., Choi, D. M., Bialek, W. & Shaevitz, J. W. Mapping the
three dimensions with the use of either multiple cameras or depth- stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface 11,
sensitive devices. This will probably be useful for tracking body 20140672 (2014).
parts of head-fixed animals moving on an air-supported treadmill 12. Wiltschko, A. B. et al. Mapping sub-second structure in mouse behavior.
Neuron 88, 1121–1135 (2015).
with simultaneous neural recording44,45. Such experiments would be 13. Berman, G. J., Bialek, W. & Shaevitz, J. W. Predictability and hierarchy in
particularly suited to our approach, as the videos from head-fixed Drosophila behavior. Proc. Natl Acad. Sci. USA 113, 11943–11948 (2016).
animals are inherently recorded in egocentric coordinates. Body- 14. Klibaite, U., Berman, G. J., Cande, J., Stern, D. L. & Shaevitz, J. W. An
part positions could then be used to decode neural activity, with unsupervised method for quantifying the behavior of paired animals.
mapping onto a substrate that approximates muscle coordinates. Phys. Biol. 14, 015006 (2017).
15. Wang, Q. et al. The PSI-U1 snRNP interaction regulates male
Additionally, we note that the fast prediction performance of our mating behavior in Drosophila. Proc. Natl Acad. Sci. USA 113,
method might make it compatible with closed-loop experimenta- 5269–5274 (2016).
tion, where joint positions may be computed in real time to control 16. Vogelstein, J. T. et al. Discovery of brainwide neural-behavioral
experimental parameters such as stimuli presented to the animal or maps via multiscale unsupervised structure learning. Science 344,
optogenetic modulation. Lastly, through the addition of a segmen- 386–392 (2014).
17. Cande, J. et al. Optogenetic dissection of descending behavioral control in
tation step for analyzing videos of multiple animals3,14,46, LEAP can Drosophila. eLife 7, e34275 (2018).
potentially estimate poses for multiple interacting individuals. 18. Uhlmann, V., Ramdya, P., Delgado-Gonzalo, R., Benton, R. & Unser, M.
An important aspect of LEAP is the active training framework FlyLimbTracker: an active contour based approach for leg segment tracking
that identifies useful images for labeling and provides a GUI for in unmarked, freely behaving Drosophila. PLoS ONE 12, e0173433 (2017).
iterative labeling, training and evaluation of network performance. 19. Kain, J. et al. Leg-tracking and automated behavioural classification in
Drosophila. Nat. Commun. 4, 1910 (2013).
We highlight that this framework can be used with any network 20. Machado, A. S., Darmohray, D. M., Fayad, J., Marques, H. G. & Carey, M. R.
architecture. Although we use a relatively simple network that trains A quantitative framework for whole-body coordination reveals specific
quickly, other networks, such as those that utilize transfer learning40 deficits in freely walking ataxic mice. eLife 4, e07892 (2015).
or stacked hourglasses with skip connections and intermediate 21. Nashaat, M. A. et al. Pixying behavior: a versatile real-time and post hoc
automated optical tracking method for freely moving and head fixed animals.
supervision47, can also be implemented within the LEAP framework
eNeuro 4, e34275 (2017).
and may increase performance for other kinds of data. 22. Nanjappa, A. et al. Mouse pose estimation from depth images. arXiv Preprint
In summary, we present a method for tracking body-part posi- at [Link] (2015).
tions of freely moving animals with little manual effort and without 23. Nakamura, A. et al. Low-cost three-dimensional gait analysis system for mice
the use of physical markers. We anticipate that this tool will reduce with an infrared depth sensor. Neurosci. Res. 100, 55–62 (2015).
24. Wang, Z., Mirbozorgi, S. A. & Ghovanloo, M. An automated behavior
the technical barriers to addressing a broad range of previously analysis system for freely moving rodents using depth image.
intractable questions in ethology and neuroscience through quanti- Med. Biol. Eng. Comput. 56, 1807–1821 (2018).
tative analysis of the dynamic changes in the full pose of an animal 25. Mendes, C. S., Bartos, I., Akay, T., Márka, S. & Mann, R. S. Quantification
over time. of gait parameters in freely walking wild type and sensory deprived
Drosophila melanogaster. eLife 2, e00231 (2013).
26. Mendes, C. S. et al. Quantification of gait parameters in freely walking rodents.
Online content BMC Biol. 13, 50 (2015).
Any methods, additional references, Nature Research reporting 27. Petrou, G. & Webb, B. Detailed tracking of body and leg movements of a
summaries, source data, statements of data availability and asso- freely walking female cricket during phonotaxis. J. Neurosci. Methods 203,
ciated accession codes are available at [Link] 56–68 (2012).
s41592-018-0234-5. 28. Toshev, A. & Szegedy, C. DeepPose: human pose estimation via deep neural
networks. arXiv Preprint at [Link] (2013).
29. Tompson, J. J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a
Received: 25 May 2018; Accepted: 31 October 2018; convolutional network and a graphical model for human pose estimation. In
Published online: 20 December 2018 Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani,
Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) 1799–1807
References (Curran Associates, Inc., Red Hook, 2014).
1. Anderson, D. J. & Perona, P. Toward a science of computational ethology. 30. Carreira, J., Agrawal, P., Fragkiadaki, K. & Malik, J. Human pose estimation
Neuron 84, 18–31 (2014). with iterative error feedback. arXiv Preprint at [Link]
2. Szigeti, B., Stone, T. & Webb, B. Inconsistencies in C. elegans behavioural abs/1507.06550 (2015).
annotation. Preprint at bioRxiv [Link] 31. Wei, S.-E., Ramakrishna, V., Kanade, T. & Sheikh, Y. Convolutional pose
early/2016/07/29/066787 (2016). machines. arXiv Preprint at [Link] (2016).
3. Branson, K., Robie, A. A., Bender, J., Perona, P. & Dickinson, M. H. 32. Bulat, A. & Tzimiropoulos, G. Human pose estimation via convolutional part
High-throughput ethomics in large groups of Drosophila. Nat. Methods 6, heatmap regression. arXiv Preprint at [Link] (2016).
451–457 (2009). 33. Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime multi-person 2D pose
4. Swierczek, N. A., Giles, A. C., Rankin, C. H. & Kerr, R. A. High-throughput estimation using part affinity fields. arXiv Preprint at [Link]
behavioral analysis in C. elegans. Nat. Methods 8, 592–598 (2011). abs/1611.08050 (2016).
5. Deng, Y., Coen, P., Sun, M. & Shaevitz, J. W. Efficient multiple object 34. Tome, D., Russell, C. & Agapito, L. Lifting from the deep: convolutional 3D
tracking using mutually repulsive active membranes. PLoS ONE 8, pose estimation from a single image. arXiv Preprint at [Link]
e65769 (2013). abs/1701.00295 (2017).

124 Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods

Nature Methods Articles
35. Shelhamer, E., Long, J. & Darrell, T. Fully convolutional networks for semantic Acknowledgements
segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2017). The authors acknowledge J. Pillow for discussions; B.C. Cho for contributions to the
36. Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for acquisition and preprocessing pipeline for mouse experiments; P. Chen for a previous
biomedical image segmentation. In Medical Image Computing and Computer- version of a neural network for pose estimation that was useful in designing our method;
Assisted Intervention – MICCAI 2015 234–241 (Springer International H. Jang, M. Murugan, and I. Witten for feedback on the GUI and other discussions;
Publishing, Cham, Switzerland, 2015). G. Guan for assistance maintaining flies; and the Murthy, Shaevitz and Wang labs for
37. Lin, T.-Y. et al. Microsoft COCO: common objects in context. In general feedback. This work was supported by the NIH R01 NS104899-01 BRAIN
Computer Vision – ECCV 2014 740–755 (Springer International Publishing, Initiative Award and an NSF BRAIN Initiative EAGER Award (to M.M. and J.W.S.),
Cham, Switzerland, 2014). NIH R01 MH115750 BRAIN Initiative Award (to S.S.-H.W. and J.W.S.), the Nancy Lurie
38. Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2D human pose Marks Family Foundation and NIH R01 NS045193 (to S.S.-H.W.), an HHMI Faculty
estimation: new benchmark and state of the art analysis. In Proc. IEEE Scholar Award (to M.M.), NSF GRFP DGE-1148900 (to T.D.P.), and the Center for the
Conference on Computer Vision and Pattern Recognition 3686–3693 (IEEE Physics of Biological Function sponsored by the National Science Foundation (NSF
Computer Society, 2014). PHY-1734030).
39. Güler, R. A., Neverova, N. & Kokkinos, I. DensePose: dense human pose
estimation in the wild. arXiv Preprint at [Link] (2018).
40. Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined Author contributions
body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018). T.D.P., D.E.A., S.S.-H.W., J.W.S. and M.M. designed the study. T.D.P., D.E.A., L.W. and
41. Isakov, A. et al. Recovery of locomotion after injury in Drosophila melanogaster M.K. conducted experiments. T.D.P. and D.E.A. developed the GUI and analyzed data.
depends on proprioception. J. Exp. Biol. 219, 1760–1771 (2016). T.D.P., D.E.A., J.W.S. and M.M. wrote the manuscript.
42. Wosnitza, A., Bockemühl, T., Dübbert, M., Scholz, H. & Büschges, A.
Inter-leg coordination in the control of walking speed in Drosophila.
J. Exp. Biol. 216, 480–491 (2013). Competing interests
43. Qiao, B., Li, C., Allen, V. W., Shirasu-Hiza, M. & Syed, S. Automated analysis of T.D.P., D.E.A., J.W.S. and M.M. are named as inventors on US provisional patent no.
long-term grooming behavior in Drosophila using a k-nearest neighbors classifier. 62/741,643 filed by Princeton University.
eLife 7, e34497 (2018).
44. Dombeck, D. A., Khabbaz, A. N., Collman, F., Adelman, T. L. & Tank, D. W.
Imaging large-scale neural activity with cellular resolution in awake, mobile mice. Additional information
Neuron 56, 43–57 (2007). Supplementary information is available for this paper at [Link]
45. Seelig, J. D. & Jayaraman, V. Neural dynamics for landmark orientation and s41592-018-0234-5.
angular path integration. Nature 521, 186–191 (2015). Reprints and permissions information is available at [Link]/reprints.
46. Pérez-Escudero, A., Vicente-Page, J., Hinz, R. C., Arganda, S. & de Polavieja, Correspondence and requests for materials should be addressed to M.M. or J.W.S.
G. G. idTracker: tracking individuals in a group by automatic identification of
unmarked animals. Nat. Methods 11, 743–748 (2014). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in
47. Newell, A., Yang, K. & Deng, J. Stacked hourglass networks for human pose published maps and institutional affiliations.
estimation. arXiv Preprint at [Link] (2016). © The Author(s), under exclusive licence to Springer Nature America, Inc. 2018

Nature Methods | VOL 16 | JANUARY 2019 | 117–125 | [Link]/naturemethods 125

Articles Nature Methods

Methods Oriented bounding boxes were cropped to 192 × 192 px for all datasets to
Datasets. Details on the dataset of 59 adult male Drosophila can be found in ensure consistency in output image size after repeated pooling and upsampling
refs. 11,13. Animals were allowed to move freely in a backlit 100-mm-diameter steps in the neural network. These data were stored in self-describing HDF5 files.
circular arena covered by a 2-mm-tall clear polyethylene terephthalate glycol dome.
Videos were captured from the top with a Point Grey Gazelle camera at a resolution Sampling diverse images for labeling and training in LEAP. To ensure
of ~35 px per mm at 100 frames per second (FPS) for 1 h for each fly, totaling ~21 diversity in image and pose space when operating at low sample sizes, we used
million frames for the dataset. To calculate the spatial resolution for these videos, a multistage cluster sampling technique. First, we sampled n0 images uniformly
we assumed a mean male fly length of 2.82 mm (ref. 48). from each dataset by using a fixed stride over time to minimize correlations being
The second fly dataset reported here (Fig. 5) consists of 42 videos of freely temporally adjacent samples. We then used PCA to reduce their dimensionality
moving pairs of virgin male and female fruit flies (NM91 strain) 3–5 d post- and projected the images down to the first D principal components. After
eclosion. Only males from these videos were analyzed in this study. Flies moved dimensionality reduction, the images were grouped via k-means clustering into
freely within a 30-mm-diameter circular arena with a 2-mm-tall clear polyethylene k subgroups from which n images were randomly sampled from each group. To
terephthalate glycol dome against a white mesh floor covering an array of minimize the time necessary for the network to generalize to images from all
microphones, resulting in an inhomogeneous image background. Videos were groups, we sorted the dataset such that consecutive samples cycled through the
captured from above with a Point Grey Flea3 camera at a resolution of ~25 px per groups. This way, uniform sampling was maintained even at the early phases of
mm at 100 FPS, totaling ~4.2 million frames. user labeling, ensuring that even a network trained on only the first few images
The mouse dataset for Fig. 5 consisted of 29 videos of C57BL/6 strain mice would be optimized to estimate body-part positions for a diversity of poses. We
(Mus musculus), 15 weeks (108 d) old. Animals moved freely in a 45.7 × 45.7 cm used n0 = 500, yielding 29,500 initial samples; D = 50, which is sufficient to explain
open field arena with a clear acrylic floor for 10 min each. Videos were captured 80% of the variance in the data (Supplementary Fig. 2); and k = 10 and n = 150 to
from below with infrared illumination using a Point Grey Blackfly S camera at a produce a final dataset of 1,500 frames for labeling and training.
resolution of 1.95 px per mm at 170 FPS, totaling ~3 million frames. Experimental
procedures were approved by the Princeton University Institutional Animal Care LEAP neural network design and implementation. We based our network
and Use Committee and conducted in accordance with the National Institutes of architecture on previous designs of neural networks for human pose
Health guidelines for the humane care and use of laboratory animals. Mice used in estimation29,31,47. We adopted a fully convolutional architecture that learns a
this study were ordered through The Jackson Laboratory and had at least 1 week of mapping from raw images to a set of confidence maps. These maps are images that
acclimation to the Princeton Neuroscience Institute vivarium before experimental can be interpreted as the 2D probability distribution (that is, a heat map) centered at
procedures were performed. Mice were kept in group cages with food and water ad the spatial coordinates of each body part within the image. We trained the network
libitum under a reversed 12/12-h dark-light cycle (light, 19:30–07:30). to output one confidence map per body part stacked along the channel axis.
Our network consists of 15 layers of repeated convolutions and pooling
Preprocessing and alignment to generate egocentric images for labeling and (Supplementary Fig. 3a). The convolution block consists of ×3 convolution layers
training in LEAP. For the main fly dataset (59 males), we used the alignment (64 filters, 3 × 3 kernel size, 1 × 1 stride, ReLU activation). The full network consists
algorithm from ref. 11. The raw videos consisted of unoriented bounding boxes of ×1 convolution block, ×1 max pooling across channels (2 × 2 pooling size, 2 × 2
around the flies from a closed-loop camera tracking system. This technique relies stride), ×1 convolution block (128 filters), ×1 max pooling (2 × 2 pooling size, 2 × 2
on videos in which the animal remains visible and in [Link] then aligned stride), ×1 convolution block (256 filters), ×1 transposed convolution (128 filters,
individual frames to a template image of an oriented fly by matching the peak of 3 × 3 kernel size, 2 × 2 stride, ReLU activation, Glorot normal initialization), ×
the radon transformed fly image to recover the orientation and then computing 2 convolution (128 filters, 3 × 3 kernel size, 1 × 1 stride, ReLU activation), and ×1
the cross-correlation to center the fly. The centroid and orientation parameters transposed convolution (128 filters, 3 × 3 kernel size, 2 × 2 stride, linear activation,
were used to crop a 192 × 192-px oriented bounding box in each frame. Code for Glorot normal initialization).
alignment is available in the repository accompanying the original paper: https:// We base our decisions of these hyperparameters on the idea that repeated
[Link]/gordonberman/MotionMapper. convolutions and strided max pooling enable the network to learn feature detectors
For the second fly dataset (42 males), we adapted a previously published across spatial scales. This allows the network to learn how to estimate confidence
method for tracking and segmentation of videos of courting fruit flies14. We first maps using a global image structure that provides contextual information that
modeled the mesh background of the images by fitting a normal distribution to can be used to improve estimates, even for occluded parts29,31. Despite the loss of
each pixel in the frame across time with a constant variance to account for camera resolution from pooling, the upsampling learned through transposed convolutions
shot noise. The posterior was evaluated at each pixel of each frame and then is sufficient to recover the spatial precision in the confidence maps. We do not
thresholded to segment the foreground pixels. Because of the inhomogeneity of the use skip connections, residual modules, stacked networks, regression networks
arena floor mesh, substanial segmentation artifacts were introduced, particularly or affinity fields in our architecture as used in other approaches of human
when translucent or very thin body parts (that is, wings and legs) could not be pose estimation29,31–33,47.
disambiguated from the dark background mesh holes. The subsequent steps of For comparison, we also implemented the stacked hourglass network47.
histogram thresholding, morphological filtering and ellipse fitting were performed We tested both the single hourglass version and ×2 stacked hourglass with
as described previously in ref. 14. We developed a simple GUI for proofreading the intermediate supervision. The hourglass network consisted of ×4 residual
automated ellipse tracking before extracting 192 × 192-px oriented bounding boxes. bottleneck modules (64 output filters) with max pooling (2 × 2 pool, 2 × 2 stride),
We extracted bounding boxes for both animals in each frame and saved both the followed by their symmetric upsampling blocks and respective skip connections.
raw pixels containing the background mesh and the foreground-only images that The stacked version adds intermediate supervision in the form of a loss term on
contain segmentation artifacts. This pipeline was implemented in MATLAB, and the output of the first network in addition to the final output.
the code is available in the code repository accompanying this paper. We implemented all versions of neural networks in Python via Keras and
For the mouse videos, a separate preprocessing pipeline was developed. TensorFlow, popular deep learning packages that allow transparent GPU
Raw videos were processed in three stages: (1) animal tracking, (2) segmentation acceleration and easy portability across operating systems and platforms. All
from background and (3) alignment to the body centroid and tail–body Python code was written for Python v.3.6.4. Required libraries were installed
interface. In stage (1), we tracked the mouse’s torso centroid by subtracting a via the pip package manager: numpy (v.1.14.1), h5py (v.2.7.1), TensorFlow-gpu
background image (median calculated at each pixel value across that video), (v.1.6.0), keras (v.2.1.4). We tested our code on machines running either Windows
retrieving pixels with a brightness above a chosen threshold from background 10 (v.1709) and a RedHat-based Linux distribution (Springdale 7.4) with no
(mice were brighter than background) and using morphological opening to additional steps required to port the software other than installing the required
eliminate noise and the mouse’s appendages. The largest contiguous region libraries. All networks were compared using the same aligned dataset so as to
reliably captured the mouse’s torso (referred to below as the torso mask) and remove complications due to differences in preprocessing.
was used to fit an ellipse whose center was used to approximate the center of Code for all network implementations is available in the main
the animal. In stage (2), a similar procedure as in stage (1) was employed to repository accompanying this paper ([Link]
retrieve a full body mask. In this stage, a more permissive threshold and smaller and Supplementary Software.
morphological opening radius were used than in stage (1) to capture the mouse’s
body edges, limbs and tail while still eliminating noise. The pixels outside of LEAP training procedure. Prior to training, we generated an augmented dataset
this body mask were set to zero. In stage (3) each segmented video frame was from the user-provided labels and corresponding images. We first doubled the
translated and rotated such that frame’s center coincided with the center of number of images by mirroring the images along the body symmetric axis (defined
the animal and the x-axis lay on the line connecting the center and tail-body from the preprocessing) and adjusting the body-part coordinates accordingly,
attachment point. The tail-body attachment point was defined as the center of including swapping left/right body part labels (for example, legs). Then, we
a region overlapping between the torso mask and a dilated tail mask. The tail generated confidence maps for each body part in each image by rendering the
mask was defined as the largest region remaining after subtraction of the torso 2D Gaussian probability distribution centered at the ground truth body-part
mask from the full body mask and application of a morphological opening. After coordinates, μ = (x, y), and fixed covariance, Σ = diag(σ) with a constant σ = 5 px.
applying these masks to segment the raw images, we extracted bounding boxes These were pre-generated and cached to disk to minimize the necessary processing
by using the ellipse center and orientation. time during training.

Nature Methods | [Link]/naturemethods

Nature Methods Articles
Once confidence maps were computed for each image, we split the dataset into states from which the observed stance vectors for the entire dataset
training, validation and test sets. The training set was used for backpropagation would emerge52.
of the loss for updating network weights, the validation set was used to estimate
performance and adjust the learning rate over epochs, and the test set was held Unsupervised embedding of body-part dynamics. In order to create a map
out for analysis. For the fast training, the dataset was split into only training (90%) of motor behaviors described by body-part movements, we used a previously
and validation (10%) sets to make the best use of data when training with very described method for discovering stereotypy in postural dynamics11. First, body-
few labels. For full training, the dataset was split into training (76.5%), validation part positions were predicted for each frame in our dataset to yield a set of 32 time
(13.5%) and testing (10%) sets. All analyses reported here share the same held-out series of egocentric trajectories in image coordinates for each video. We recentered
test set to ensure it is never trained against for any replicate. these time series by subtracting the thorax coordinate at each time point and
All training was done using the Adam optimizer with default parameters as rescaled them to comparable ranges by z-scoring each time series. The time series
described in the original paper49. We started with a learning rate of 1e-3 but used a were then expanded into spectrograms by application of the CWT parametrized by
scheduler to reduce it by a factor of 0.1 when the validation loss failed to improve the Morlet wavelet as the mother wavelet and 25 scales chosen to match dyadically
by a minimum threshold of 1e-5 for three epochs. The loss function optimized spaced center frequencies spanning 1–50 Hz. This time-frequency representation
against is simply the mean squared error between estimated and ground truth augments the instantaneous representation of pose at each time point to one
confidence maps. that captures oscillations across many time scales. The instantaneous spectral
During training, we considered an epoch to be a set of 50 batches of 32 images, amplitudes of each body part were then concatenated into a single vector of length
which were drawn sequentially from the training set, cycling back to the first 2(J − 1)F, where J is the number of body parts before subtraction of the body part
image if there were less than 50 × 32 = 1,600 images. Images were then augmented used as a reference (that is, the thorax) and doubled to account for both x and y
by application of a small random rotation (−15–15°) to the input image and the coordinates, and F is the number of frequencies being measured via CWT.
corresponding ground truth confidence maps (Supplementary Fig. 1a). At the end In our data, this resulted in a 1,550-dimensional representation at each time
of 50 batches of training, 10 batches were sampled from the separate validation point (frame).
set, augmented and evaluated, and the loss was used for learning rate scheduling Finally, we performed nonlinear dimensionality reduction on these high-
described above. Training and validation sets are shuffled at the end of each dimensional vectors by using a nonlinear manifold embedding algorithm53. We
epoch. An epoch was evaluated in 60–90 s, including all augmentation, forward first selected representative time points via importance sampling, wherein a
and reverse passes, and the validation forward pass when running on a modern random sampling of time points in each video is embedded into a 2D manifold
GPU (NVIDIA GeForce GTX 1080 Ti or P100). We ran this entire procedure for via t-distributed stochastic neighbor embedding (t-SNE) and clustered via the
15 epochs during the fast training stage and for 50 epochs during the full training watershed transform. This allowed us to choose a set of time points from each
stage. For analyses, a minimum of five replicates were fully trained on each video that were representative of their local clusters—that is, spanning the space
dataset to estimate the stability of optimization convergence. We evaluated the of postural dynamics. We then computed a final behavior space distribution by
performance of the network on a held-out test set of images without augmentation. embedding the selected representative time points using t-SNE to produce the full
manifold of postural dynamics in two dimensions.
Pose estimation from confidence maps. Predictions of body-part positions were After projecting all remaining time points in the dataset into this manifold, we
computed directly on the GPU. We implement a channel-wise global maximum computed their 2D distribution and smoothed with a Gaussian kernel with σ = 0.65
operation to convert the confidence maps into image coordinates as a TensorFlow to approximate the probability density function of this space. We clipped the range
function, further improving runtime prediction performance by avoiding the of this density map to the range 0.5 × 10−3 to 2.75 × 10−3 to exclude low-density
costly transfer of large confidence map arrays. All prediction functions including regions and merge very high-density regions. We then clustered similar points
normalization and saving were implemented as a self-contained Python script with by segmenting the space into regions of similar body-part dynamics by applying
a command-line interface for ease of batch processing. the watershed transform to the density. Although both the manifold coordinates
representation of each time point are not immediately meaningful, we were
Computing hardware. All performance tests were conducted on a high-end able to derive an intuitive interpretation of each cluster by referring to the high-
consumer-grade workstation equipped with an Intel Core i7-5960X CPU, 128 GB dimensional representation of their constituent time points. To do this, we sampled
DDR4 RAM, NVMe solid state drives and a single NVIDIA GeForce 1080 GTX Ti time points from each cluster and averaged their corresponding high-dimensional
(12 GB) GPU. We also used Princeton University’s High Performance Computing feature vector, which we could then visualize by reshaping it into a body-part-
cluster with nodes equipped with NVIDIA P100 GPUs for batch processing. These frequency matrix (Fig. 4).
higher-end cards afford a speed-up of ~1.5×in processing runtime during the
training phase. Reporting Summary. Further information on research design is available in the
Nature Research Reporting Summary linked to this article.
Accuracy analysis. For all analyses of accuracy (Figs. 2 and 6 and Supplementary
Figs. 3 and 5), we trained at least five replicates of the network with the same Code availability. The code for running LEAP, as well as all accompanying GUIs,
training/validation/testing datasets. All analyses were performed in MATLAB trained networks, labeled data and analysis code for figure reproduction, can be
R2018a (MathWorks). We used the gramm toolbox for figure plotting50. found in the Supplementary Software and in the following repository: https://
[Link]/talmo/leap.
Gait analysis. We translated the body position coordinates to egocentric
coordinates by subtracting the predicted location of the intersection between the
thorax and abdomen from all other body-position predictions for each frame. We
Data availability
The entire primary dataset of 59 aligned, high-resolution behavioral videos is made
then calculated the instantaneous velocity along the rostrocaudal axis of each leg
available online for reproducibility or further studies based off of this method, as
tip within these truly egocentric reference coordinates. The speed of each body
well as labeled data to train and ground-truth the networks, pre-trained networks
part was smoothed using a Gaussian filter with a five-frame moving window. For
used for all analyses, and estimated body-part positions for all 21 million frames.
each leg tip, instances in which the smoothed velocity was greater than zero were
This dataset (~170 GiB) is freely available at [Link]
defined as swing, while those with velocity less than zero were defined as stance.
dsp01pz50gz79z. Data from additional fly and mouse datasets used in Fig. 6 can be
Information from this egocentric axis was combined with allocentric tracking data
made available upon reasonable request.
to incorporate speed and orientation information. The centroids and orientations
of the flies were smoothed using a moving mean filter with a five-frame window
to find the instantaneous speed and forward velocity. To remove idle bouts and References
instances of backward walking, all gait analyses were limited to times when 48. Chyb, S. & Gompel, N. Atlas of Drosophila Morphology: Wild-type
the fly was moving in the forward direction at a velocity greater than 2 mm s−1 and Classical Mutants (Academic Press, London, Waltham and
(approximately one body length per second) unless otherwise noted. The analyses San Diego, 2013).
relating stance and swing duration to body velocity were limited to forward 49. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization.
velocities greater than 7.2 mm s−1, to remain in line with previous work25. arXiv Preprint at [Link] (2014).
To measure gait modes, we trained an HMM to model gait as described 50. Morel, P. Gramm: grammar of graphics plotting in MATLAB. J. Open Source
previously41. The training data consisted of a vector denoting the number of legs Softw. 3, 568 (2018).
in stance for bouts in which the fly was moving forward at a velocity greater than 51. Baum, L. E., Petrie, T., Soules, G. & Weiss, N. A maximization technique
2 mm s−1 lasting longer than 0.5 s. Training data were sampled such that up to occurring in the statistical analysis of probabilistic functions of markov chains.
3,000 frames were taken from each video, resulting in a total of 159,270 frames. Ann. Math. Stat. 41, 164–171 (1970).
We trained a three-state HMM using the Baum–Welch algorithm and randomly 52. Viterbi, A. Error bounds for convolutional codes and an asymptotically
initialized transition and emission probabilities51. We designated each hidden state optimum decoding algorithm. IEEE Trans. Inf. Theory 13,
as tripod, tetrapod or noncanonical in accordance with the estimated emission 260–269 (1967).
probabilities. We then used the Viterbi algorithm along with our estimated 53. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE.
transition and emission matrices to predict the most probable sequence of hidden J. [Link]. Res. 9, 2579–2605 (2008).

Nature Methods | [Link]/naturemethods

nature research | reporting summary
Corresponding author(s): Mala Murthy, Joshua Shaevitz

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main
text, or Methods section).
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested

A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND
variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Clearly defined error bars

State explicitly what error bars represent (e.g. SD, SE, CI)

Our web collection on statistics for biologists may be useful.

Software and code

Policy information about availability of computer code
Data collection Custom code was used for all components of the framework and provided alongside the accompanying open-source code repository
([Link]

Data analysis Custom code was used for all components of the framework and provided alongside the accompanying open-source code repository
([Link] Additional commercial or third party software used: MathWorks MATLAB R2018a, Python 3.6.4, numpy
(1.14.1), h5py (2.7.1), tensorflow-gpu (1.6.0), keras (2.1.4).
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers
upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.
April 2018

1
Data

nature research | reporting summary

Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
In addition to example datasets available through the code repository ([Link] the primary dataset of high resolution videos we used for all
analyses is made available through our institution's data repository service: [Link]
This constitutes 170 GB of raw data, labeled data, trained networks and network predictions for all 21 million images, sufficient to exactly reproduce all findings in
the paper.

Additional fly and mouse datasets used for Fig. 6 can be made available upon reasonable request.

Field-specific reporting
Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see [Link]/authors/policies/[Link]

Life sciences study design

All studies must disclose on these points even when the disclosure is negative.
Sample size We used previously collected data for our tests. As part of the method, we find that very few images (samples) are required (order of 1,000s)
whereas the datasets are all substantially larger (order of 1,000,000s).

Data exclusions No data were excluded.

Replication No major experimental findings are reported for this method description. Application results reproduce previously described findings.

Randomization Blinding was not relevant to this study. We observed natural behavior in a freely moving context with no grouping of the animals.

Blinding Blinding was not relevant to this study. We observed natural behavior in a freely moving context with no grouping of the animals.

Reporting for specific materials, systems and methods

Materials & experimental systems Methods

n/a Involved in the study n/a Involved in the study
Unique biological materials ChIP-seq
Antibodies Flow cytometry
Eukaryotic cell lines MRI-based neuroimaging
Palaeontology
Animals and other organisms
Human research participants

Animals and other organisms

April 2018

Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research
Laboratory animals Fruit flies (Drosophila melanogaster), all males, 3-8 days old, NM91 or Oregon-R strains. Mice (Mus musculus), all males, 15
weeks (108 days) old, C57/BL6 strain.

Wild animals No wild animals were used for this study.

Field-collected samples No field samples were collected for this study.

Common questions