10.5061/DRYAD.MGQNK98X5
Ruegg, Kristen
0000-0001-5579-941X
Colorado State University
Brinkmeyer, Michaela
Boise State University
Bossu, Christen M
University of California Los Angeles
Bay, Rachael
University of California, Davis
Anderson, Eric C
Southwest Fisheries Science Center, National Marine Fisheries Service
Heath, Julie
Boise State University
Data from: The American Kestrel (Falco sparverius) genoscape: implications
for monitoring, management, and subspecies boundaries
Dryad
dataset
2020
2021-07-27T00:00:00Z
2021-05-18T00:00:00Z
en
https://doi.org/10.1093/auk/ukaa051
362172456 bytes
3
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Identifying population genetic structure is useful for inferring
evolutionary process as well as defining subspecies boundaries and/or
conservation units that can aid in species management. The American
kestrel (Falco sparverius) is a widespread species with two described
North American subspecies, (F. s. sparverius and F. s. paulus), the
latter in the southeastern United States and the former across the
remainder of its distribution. In many parts of their range, American
kestrels have been declining, but it has been difficult to interpret
demographic trends without a clearer understanding of gene flow among
populations. Here we sequence the first American kestrel genome and scan
the genome of 197 individuals from 12 sampling locations across the range
of the two North American subspecies to identify population structure. To
validate signatures of population structure and fill in sampling gaps
across the breeding range we screen 192 outlier loci in an additional 376
samples from 34 sampling locations. Overall, our analyses support the
existence of 5 genetically distinct lineages within American kestrels -
eastern, western, Texas, Florida, and Alaska. Interestingly, we find that
while our genome-wide genetic data support the existence of previously
described subspecies boundaries, genetic differences across the species’
range correlate more with putative migratory phenotypes (resident,
long-distance, and short-distance migrants) rather than a priori
described subspecies boundaries per se. Based on our results, we suggest
the resulting five genetic lineages serve as the foundation for American
kestrel conservation and management in the face of future threats.
SNP discovery and SNP filtering High-density RADseq was carried out on 287
individuals from 12 sampling locations following a modified version of the
bestRAD library preparation protocol (Table SI 1; Ali et al. 2016). The
program Stacks (Catchen et al. 2013) was used to demultiplex, filter and
trim adapters from the data with the process_radtags function and remove
duplicate read pairs using the clone_filter function. Bowtie2 was used to
map reads to the genome (Langmead and Salzberg 2012), and the
HaplotypeCaller in the Genome Analysis Toolkit was used to identify SNPs
(McKenna et al. 2010, Auwera et al. 2013). Vcftools (Danecek et al. 2011)
was used to remove indels, non-biallelic SNPs, and low quality and rare
variants (genotype quality 20; coverage depth 10; minor allele frequency
0.05). The final number of SNPs and individuals to be retained for further
analyses was assessed by visualizing the tradeoff between discarding
low-coverage SNPs and discarding individuals with missing genotypes using
custom scripts within the R-package genoscapeRtools (Anderson 2019).
Because preliminary analyses revealed outliers in the principal components
analysis and we were concerned about sample contamination between
individuals during the library preparation stage, we filtered out
individuals with >30% heterozygosity as heterozygosity is expected
to be high in cases where multiple individuals are combined into a single
well. Identification of outlier SNPs for Population Assignment Population
genomic analyses were conducted on all SNPs that passed our filters to
assess genome-wide patterns of genetic divergence and identify SNPs for
population assignment and assay design. Population genetic structure was
assessed by calculating pairwise population level FST (with different
sampling sites representing different populations) with bootstrapped
confidence intervals using the R package assigner (Gosselin et al. 2019),
and principal components analysis (PCA) using SNPRelate (Zheng et al.
2012). To test for isolation by distance we compared linearized FST with
pairwise geographic distance calculated from the central longitudinal and
latitudinal coordinates of each location using the Vincenty ellipsoid
method in the R package Geoshpere (Hijmans 2019). Because the PCA of all
SNPs from the genome-wide analysis revealed five major groups, including
Alaska, Texas, the west, the east and Florida (see results, Figure 1),
subsequent analyses focused on developing SNPs for population assignment
within and among these groups. To identify SNPs useful for population
assignment between the 5 genetically distinct populations, we used
vcftools (Danecek et al. 2011) to calculate site-wise FST between
populations and identify individual SNPs with the most power for
discriminating between populations (SNPs with the biggest allele frequency
differences). It is important to note that population genetic summary
statistics were based on the full RADseq data set (see above) rather than
downstream SNP dataset in order to avoid potential biases associated with
selecting SNPs with the highest discriminatory power for population
assignment. Custom R-scripts were used to evaluate which of our
top-ranking SNPs would generate designable assays based on the following
parameters: 1) Guanine – Cytosine content was less than 0.65, 2) there
were no insertions or deletions (indels) within 30bp of the variable site,
or 3) there were no ambiguous codes within 20bp of the variable site.
Additionally, we used bwa-mem (Burrows-Wheeler Aligner; Li and Durbin
2009) to determine which of our designable SNPs mapped uniquely to the
reference genome. Fluidigm SNPtype assays (Fluidigm Inc.) were then
developed in the 216 top ranking SNPs that passed our filters. Genetic
screening and building the genoscape Ninety-three samples and three
non-template controls were screened on the Fluidigm Corporation EP1
Genotyping System (Fluidigm Inc.) and assays were ranked by variability
and call rate to identify the most reliable 192 SNP assays of 216 that
were designed. The 192 variable SNP assays with the highest call rate
were used to screen an additional 396 American Kestrel feather samples
from 34 breeding locations in the United States and Canada in order to
fill in sampling gaps and refine the resulting map of population genetic
structure (Table 1). Following the methods described in Ruegg et al.
(2014), we amplified PCR products using fluorescently labelled
allele-specific primers and then used the EP1 Array Reader and Fluidigm’s
automated Genotyping Analysis Software (Fluidigm Inc.) to call alleles
with a confidence threshold of 90%. Each genotype was also visually
inspected for potential irregularities and uncertain genotype calls were
removed from the analysis. Samples with missing genotypes at >25%
of SNP assays were removed from the analyses and SNP loci with >25%
missing genotypes were removed, resulting in a total of 376 additional
individuals at 186 SNP loci that could be used to identify genetic
structure across the range (Table 1).
This dataset includes the filtered RADseq dataset. It is in the vcf
format, which includes 72,263 filtered genotypes for all 197 American
Kestrel individuals. Missing data is coded as "./.". The dataset
also includes the structure file for the SNPtype assays for all breeding
individuals. This includes 186 loci, population information and genetic
group information. Given the dataset does not include SNP information,
we've also included the plink map and ped files used to create the
structure file. The map file includes genomic location information in 4
columns. By default, each line of the MAP file describes a single marker
with scaffold information, snp identifier, physical base-pair position (0
if unknown) and base-pair posiiton. The ped file includes genotype
information for all positions in the .map file for each individual, along
with location information (FID). Distance matrices are also included:
pairwise linearized FST values for the 10 population that had greater than
4 individuals and pairsie geographic distance matrices calculated from the
Latitude and Longitude of the 10 populations (this information can be
found in the published tables).