10.5061/DRYAD.D7WM37Q2Z
Moore, Dalton
0000-0003-0187-2364
University of Chicago
Walker, Jeffrey
University of Chicago
MacLean, Jason
University of Chicago
Hatsopoulos, Nicholas
University of Chicago
Validating marker-less pose estimation with 3D x-ray radiography
Dryad
dataset
2022
FOS: Biological sciences
deeplabcut
Markerless tracking
Marmoset
anipose
XROMM
Pose estimation
National Institute of Neurological Disorders and Stroke
https://ror.org/01s5ya894
R01NS104898
National Institute of Neurological Disorders and Stroke
https://ror.org/01s5ya894
1F31NS118950-01
National Science Foundation
https://ror.org/021nxhr62
MRI1338036
National Science Foundation
https://ror.org/021nxhr62
MRI1626552
2022-05-12T00:00:00Z
2022-05-12T00:00:00Z
en
https://doi.org/10.1242/jeb.243998
https://doi.org/10.5281/zenodo.5847410
45316128546 bytes
3
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
These data were generated to evaluate the accuracy of DeepLabCut (DLC), a
deep learning marker-less motion capture approach, by comparing it to a 3D
x-ray video radiography system that tracks markers placed under the skin
(XROMM). We recorded behavioral data simultaneously with XROMM and RGB
video as marmosets foraged and reconstructed three-dimensional kinematics
in a common coordinate system. We used XMALab to track 11 XROMM markers,
and we used the toolkit Anipose to filter and triangulate DLC trajectories
of 11 corresponding markers on the forelimb and torso. We performed a
parameter sweep of relevant Anipose and post-processing parameters to
characterize their effect on tracking quality. We compared the median
error of DLC+Anipose to human labeling performance and placed this error
in the context of the animal's range of motion.
Subjects These experiments were conducted with two common marmosets
(Callithrix jacchus) (an 8-year old, 356g male and a 7-year old, 418g
female). All methods were approved by the Institutional Animal Care and
Use Committee of the University of Chicago. Data Collection The two
marmosets were placed together in a 1m x 1m x 1m cage with a modular
foraging apparatus attached to the top of the cage, as previously
described by Walker et al. (2020). The marmosets were allowed to forage
voluntarily throughout recording sessions that lasted 1-2 hours.
Recordings of individual trials were triggered manually with a foot pedal
by the experimenters when the marmosets appeared ready to initiate a
reach. The manual trigger initiated synchronized video collection by the
XROMM system (Brainerd et al., 2010) and two visible light cameras, each
described in further detail below. We retained all trials that captured
right-handed reaches. Marmoset TY produced four useful reaching events
containing 5 total reaches and marmoset PT produced 13 reaching events
containing 17 reaches. XROMM Bi-planar X-ray sources and image
intensifiers (90kV, 25mA at 200 fps) were used to track the 3D position of
radiopaque tantalum beads (0.5-1 mm, Bal-tec) placed subcutaneously in the
arm, hand, and torso. Details of bead implants can be found in Walker et
al. (2020), in which the authors also report estimating XROMM marker
tracking precision of 0.06 mm based on the standard deviation of
inter-marker distances during a recording of a calibration specimen.
Marker locations were chosen to approximate the recommendations given by
the International Society of Biomechanics for defining coordinate systems
of the upper limb and torso in humans (Wu et al., 2005). These
recommendations were adapted to the marmoset and constrained by surgical
considerations. Positions of 13 beads were tracked using a semi-automated
process in XMALab (Knorlein et al., 2016) following the procedure
described there and in the XMALab User Guide
(https://bitbucket.org/xromm/xmalab/wiki/Home). Two beads implanted in the
anterior torso were ignored for comparison with DLC because corresponding
positions on the skin were occluded in nearly every frame captured by
visible light cameras. DeepLabCut Two high-speed cameras (FLIR Blackfly S,
200 fps, 1440x1080 resolution) were used to record video for analysis by
DLC. The cameras were positioned to optimize visibility of the right upper
limb during reaching behavior in the foraging apparatus and to minimize
occlusions, while avoiding the path between the X-ray sources and image
intensifiers (Fig. 1A). The cameras were triggered to record continuous
images between the onset and offset of the manual XROMM trigger, with
series of images later converted to video for DLC processing. All videos
were brightened using the OpenCV algorithm for contrast limited adaptive
histogram equalization (CLAHE) prior to labeling. We labeled 11 body parts
in DLC – two labels on the torso and three on each of the upper arm,
forearm, and hand (Fig. 1B). Locations of each label were chosen to be as
close as possible to the approximate location of XROMM beads, although
concessions had to be made to ensure the location was not occluded
consistently in the recordings. We used DLC 2.2 with in-house
modifications to produce epipolar lines in image frames that were matched
between the two cameras (Fig. 1C), which significantly improved human
labeling accuracy by correcting gross errors and fine-tuning minor errors.
We did not train a network on labels produced without the aid of epipolar
lines and therefore cannot evaluate 3D error reduction using epipolar
lines. However, we note that labels applied without epipolar lines on the
torso were grossly inaccurate – these labels were adjusted by an average
of 63 pixels and 57 pixels in camera-1 and camera-2, respectively, after
implementation. The other nine labels were adjusted by an average of
<1 pixel in camera-1 and 11 pixels in camera-2. This modification
has been added as a command line feature in the DLC package (a guide for
using epipolar lines can be found at
https://deeplabcut.github.io/DeepLabCut/docs/HelperFunctions.html). Aside
from this and related changes to the standard DLC process, we followed the
steps outlined in Nath et al. (2019). In the first labeling iteration we
extracted 100 total frames (50/camera) across the four events for marmoset
TY and 254 frames (127/camera) across seven of the 13 events for marmoset
PT, which produced a labeled dataset of 354 frames. These were chosen
manually to avoid wasting time labeling frames before and after reaching
bouts during which much of the marmoset forelimb was entirely occluded in
the second camera angle. An additional 202 frames (101/camera) were
extracted using the DLC toolbox with outliers identified by the ‘jump’
algorithm and frame selection by k-means clustering. We chose the number
of frames to extract for each video based on visual inspection of labeling
quality and chose the start and stop parameters to extract useful frames
that captured reaching bouts. In all cases, frame numbers of extracted
frames were matched between cameras to enable the use of epipolar lines.
This refinement step resulted in an error reduction of 0.046 cm and
percent frames tracked increase of 14.7% after analysis with the chosen
Anipose parameters. The final dataset consisted of 278 human-labeled
timepoints from 15 of the 17 events and 10,253 timepoints from all 17
events labeled by the network only. We used the default resnet-50
architecture for our networks with default image augmentation. We trained
3 shuffles of the first labeling iteration with a 0.95 training set
fraction and used the first shuffle for the label refinement discussed
above. We trained 15 total networks after one round of label refinement –
three shuffles each with training fractions of 0.3, 0.5, 0.7, 0.85, and
0.95. Each network was trained for 300,000 iterations starting from the
default initial weights. We evaluated each network every 10,000 iterations
and selected the snapshot that produced the minimum test error across all
labels for further analysis. We chose the network to use in subsequent
analyses by finding the smallest training set size that reached the
threshold of human labeling error (discussed next). We then chose the
median-performing network of the three shuffles at this training set size
for all further analysis. Human Labeling Error We selected 134 frames
(67/camera) across three events from the same marmoset and session to be
relabeled by the original, experienced human labeler and by a second, less
experienced labeler. We used the error between the new and original labels
to evaluate whether the networks reached asymptotic performance, defined
by the experienced human labeling error. Calibration A custom calibration
device was built to allow for calibration in both recording domains
(Knorlein et al. 2016; instruction manual for small lego cube is located
in the XMALab BitBucket). The device was constructed to contain a
three-dimensional grid of steel beads within the structure and a
two-dimensional grid of white circles on one face of the cube. Calibration
of x-ray images was computed in XMALab and calibration of visible light
images was computed with custom code using OpenCV. This integrated
calibration device, along with the PCA-based alignment procedure described
below, ensures that DLC and XROMM tracked trajectories in a common 3D
coordinate system. DLC videos were accurately calibrated, with 0.42 pixels
and 0.40 pixels of intrinsic calibration error for camera-1 and camera-2,
respectively, and 0.63 pixels of stereo reprojection error. XROMM
calibration was similarly accurate, with average intrinsic calibration
error equal to 0.81 pixels and 1.38 pixels for the two cameras.
Trajectory processing with Anipose We used Anipose to analyze videos,
filter in 2D, triangulate 3D position from 2D trajectories, and apply 3D
filters (see Karashchuk et al., 2021 for details). For 2D-filtering, we
chose to apply a Viterbi filter followed by an autoencoder filter because
the authors demonstrate this to be the most accurate combination of 2D
filters. For triangulation and 3D filtering, we enabled optimization
during triangulation and enabled spatial constraints for each set of three
points on the hand, forearm, and upper arm, and for the pair of points on
the torso. We identified six Anipose parameters and one post-processing
parameter that may affect the final accuracy of DLC+Anipose tracking and
ran a parameter sweep to find the optimal combination. In 2D filtering, we
varied the number of bad points that could be back-filled into the Viterbi
filter (“n-back”) and the offset threshold beyond which a label was
considered to have jumped from the filter. We varied four parameters in 3D
processing, including the weight applied to spatial constraints
(“scale_length”) and a smoothing factor (“scale_smooth”), the reprojection
error threshold used during triangulation optimization, and the score
threshold used as a cutoff for 2D points prior to triangulation. We also
varied our own post-processing reprojection error threshold that filtered
the outputs of DLC+anipose. We tested 3,456 parameter combinations in
total, the details of which will be discussed below. We generally chose
parameter values centered around those described in Anipose documentation
and in Karashchuk et al. (2021). Post-processing of DLC+Anipose
trajectories To process the 3D pose outputs from Anipose,
we first used the reprojection error between cameras provided by Anipose
to filter out obviously bad frames. We tested two thresholds, 10 and 20
pixels, for 15 of 17 events. We tested much higher thresholds, 25 and 35
pixels, for the final two events of 2019-04-14 because the calibration was
poor in these events – we suspect one of the cameras was bumped prior to
these events. Next, we deleted brief segments of five or fewer frames and
stitched together longer segments separated by fewer than 30 frames.
Importantly, we did not have to do any further interpolation to stitch
segments together, as Anipose produces a continuous 3D trajectory.
Together, these steps remove portions of trajectories captured when the
marmoset was chewing or otherwise disengaged from the foraging task and
outside of the usable region of interest in camera-2 and combined segments
during foraging bouts that were separated only by brief occlusions or
minor tracking errors. All steps were performed independently for each
label and event. DLC labels could not be applied to the upper limb and
torso in spots corresponding exactly to XROMM bead locations because those
locations would often be obstructed from view by the marmoset’s own body
in one of the camera angles. We therefore applied labels as close as
possible to the correct spots and subtracted the average position from
each label and bead during post-processing. This removes a constant offset
that should not be included in the DLC error calculations. Despite our
best efforts to place DLC and XROMM in the same 3D coordinate system
through the calibration process described above, we found the two systems
to be slightly misaligned. To fix this, we computed the three principal
components across good frames for all DLC+Anipose labels and separately
for all XROMM markers, then projected the mean-subtracted DLC+Anipose and
XROMM trajectories onto their respective principal components. We found
that this brought the coordinate systems into close alignment, such that
we could no longer identify any systematic error that could be attributed
to misalignment. Finally, we found that there was a brief delay ranging
from 0 to 10 frames between pedal-triggered onset of the XROMM event and
the corresponding pedal-triggered TTL pulse initiating the start of the
event for the FLIR cameras (and for the pulse ending the event). To adjust
for the timing difference, we iterated over a range of possible sample
shifts separately for each event to find the shift that minimized the mean
absolute error between the DLC+Anipose and XROMM trajectory. We visually
inspected each trajectory after the adjustment to ensure the shift was
qualitatively accurate. Evaluation of DLC Performance We computed the
median and mean absolute error between matched trajectories from
DLC+Anipose and XROMM for all body parts across all reaching events. We
also computed the percent of motion tracked across all labels and all
active segments of reaching events. To define active segments, we manually
inspected the videos for the first and last frames in each event for which
the marmoset was engaged in the task; as mentioned before, the position of
camera-2 prevented accurate human labeling when the marmoset was
positioned well behind the partition and the vast majority of these fames
are discarded by Anipose and in post-processing. Statistical Tests
Since the error distributions are right-skewed with long tails of large
errors, we use the median error to describe the center of each
distribution and the Mann-Whitney U-Test to assess statistical
significance. The P-values computed with this method are artificially low
due to the large sample size (e.g. 27,630 samples for the three upper arm
markers and 11,480 samples for the two torso markers), so we report the
correlation effect size defined by the rank-biserial correlation to
describe statistical differences between distributions. According to
convention, we consider r < 0.20 to be a negligible effect (Cohen,
1992). In order to determine which of the Anipose and post-processing
parameters from the parameter sweep significantly affected either the
median error or percent of frames tracked, we created two linear
regression models using the six parameters and a constant as independent
variables and either error or percent tracked as the dependent variable.
We tested the effect of individual parameters by calculating the log
likelihood ratio Chi-squared test statistic (LR) between the full model
and each nested model created by leaving one parameter out at a time (such
that each nested model had a constant term and six parameter terms). We
computed the p-value of each comparison using a Chi-squared test with two
degrees of freedom. We also created a full interaction model with the
seven individual parameter terms and all possible first-order interaction
terms. We tested the significance of each term by the same method.
Please see README.txt for usage notes.