10.5061/DRYAD.N5TB2RBV2
Bouzid, Nassima
0000-0001-5565-8676
Museum of Vertebrate Zoology
Archie, James
California State University, Long Beach
Anderson, Roger
Western Washington University
Grummer, Jared
University of British Columbia
Leaché, Adam
University of Washington
Data from: Evidence for ephemeral ring species formation during the
diversification history of Western Fence Lizards (Sceloporus occidentalis)
Dryad
dataset
2021
FOS: Biological sciences
National Science Foundation
https://ror.org/021nxhr62
DEB-1144630
2021-02-08T00:00:00Z
2021-02-08T00:00:00Z
en
4246406 bytes
4
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Divergence is often ephemeral, and populations that diverge in response to
regional topographic and climatic factors may not remain reproductively
isolated when they come into secondary contact. We investigated the
geographic structure and evolutionary history of population divergence
within Sceloporus occidentalis (Western Fence Lizards), a habitat
generalist with a broad distribution that spans the major biogeographic
regions of Western North America. We used double digest RAD sequencing to
infer population structure, phylogeny, and demography. Population genetic
structure is hierarchical and geographically structured with evidence for
gene flow between biogeographic regions. Consistent with the
isolation-expansion model of divergence during Quaternary
glacial-interglacial cycles, gene flow and secondary contact are supported
as important processes explaining the demographic histories of
populations. Although populations may have diverged as they spread
northward in a ring-like manner around the Sierra Nevada and southern
Cascade Ranges, there is strong evidence for gene flow among populations
at the northern terminus of the ring. We propose the concept of an
“ephemeral ring species” and contrast S. occidentalis with the classic
North American ring species, Ensatina eschscholtzii. Contrary to
expectations of lower genetic diversity at northern latitudes following
post-Quaternary-glaciation expansion, the ephemeral nature of divergence
in S. occidentalis has produced centers of high genetic diversity for
different reasons in the south (long-term stability) versus the north
(secondary contact).
Sampling and RAD sequencing A total of 108 Sceloporus occidentalis were
sampled from 87 sites throughout their range in western North America
(Fig. 1; Table S1). Double digest RAD sequencing data (ddRADseq) were
collected using standard protocols (Peterson et al. 2012). Genomic DNA
(500 ng per sample) was double-digested with 20 units each of a rare
cutter SbfI (restriction site 5'-CCTGCAGG-3') and a common
cutter MspI (restriction site 5'-CCGG-3') in a single reaction
with the manufacturer recommended buffer (New England Biolabs) for 8 hours
at 37°C. Fragments were purified with SeraPure SpeedBeads before ligation
of barcoded Illumina adaptors onto the fragments. The libraries were
size-selected (between 415 and 515 bp after accounting for adapter length)
on a Blue Pippin Prep size fractionator (Sage Science). The final library
amplification used proofreading Taq and Illumina's indexed primers.
The fragment size distribution and concentration of each pool was
determined on an Agilent 2200 TapeStation, and qPCR was performed to
determine library concentrations before multiplexing equimolar amounts of
each pool for sequencing on a single Illumina HiSeq 2500 lane (50-bp,
single-end reads) at the Vincent J. Coates Genomics Sequencing Laboratory
at UC Berkeley. Bioinformatics The raw Illumina reads were filtered and
demultiplexed using PyRAD v.3.0.3 (Eaton 2014) and assembled using Stacks
v.2.54 (Catchen et al. 2013; Rochette et al. 2019) and STACKS_PIPELINE
v.2.4 (Portik et al. 2017). Reads potentially arising from PCR duplicates
during sequencing were not explicitly accounted for because they occur at
low frequency and presumably do not significantly impact most population
genetic parameter estimates (Schweyen et al. 2014). During filtering,
sites with < 99% base call accuracy (Phred score = 20) were
converted to missing data and reads with ≥ 10% missing sites were
discarded. No barcode mismatches were allowed during demultiplexing. Reads
were aligned into stacks with a minimum depth of coverage of 5 and a
maximum of 2 nucleotide differences between stacks. The minor allele count
was set to 2 to eliminate singletons, which reduces errors in model-based
clustering methods (Linck & Battey 2019). Loci that were
invariant, non-biallelic, or absent from > 20% of samples were
removed. Samples with > 70% missing data were also removed. One
random variable site per locus was sampled to minimize the chance of
retaining physically linked SNPs. The resulting unlinked SNP dataset was
used to generate input files for downstream analyses. Population structure
Two methods were used to estimate population structure. The maximum
likelihood method ADMIXTURE v1.3.0 (Alexander et al. 2009) was used to
estimate (1) the number of populations (K) and (2) admixture proportions
of samples to identify putative "hybrids" of mixed population
ancestry. Samples with admixture proportions < 0.95 were considered
admixed. The cross-validation errors for analyses from K = 1 to K = 10
were compared to determine which K minimized group assignment error; e.g.,
the K with the lowest cross-validation (CV) error is the model best
supported by the data. However, parametric methods for estimating
population structure are sensitive to violations of model assumptions
prevalent in natural populations (Lawson et al. 2018). Therefore,
Discriminant Analysis of Principle Components (DAPC), a non-parametric
method, was also used to estimate population structure as implemented in
the R package adegenet (Jombart et al. 2010; Jombart & Ahmed
2011). First, principal components of genetic variation were estimated
without any assumptions about grouping information or the underlying
population genetic model. Second, the find.clusters function was used to
evaluate successive values of K from 1 to 10 with the Bayesian information
criterion (BIC) to determine the most likely number of populations. The K
value with the lowest BIC score was considered optimal, although it is
important to consider that other K values can also provide biologically
realistic models. When multiple SNPs are present across loci, the random
selection of one SNP per locus represents one of many possible dataset
combinations. To assess the stability and sensitivity of K to this SNP
selection procedure, we performed random selection of one SNP per locus
100 times to create 100 replicates of unlinked SNP datasets. These dataset
permutations were created using STACKS_PIPELINE. The optimal K values for
all 100 replicates were then evaluated with ADMIXTURE and DAPC, and
summarized to determine variation in support for K values. Network and
population tree analysis Three methods were used to visualize
intraspecific genetic relationships and estimate relationships among
populations. Network methods can depict relationships that are not
necessarily bifurcating and can identify admixed samples (Posada and
Crandall 2001). A genetic network was constructed from the concatenated
SNP data (uncorrected “p” distances) using the NeighborNet algorithm
(Bryant and Moulton 2004) in SplitsTree v4.6 (Huson and Bryant 2006).
Phylogenetic analyses were conducted using SNAPP v1.5.0 (Bryant et al.
2012) and TreeMix v1.13 (Pickrell & Pritchard 2012). One key
difference between these methods is that SNAPP is migration-free and
produces strictly bifurcating trees, while TreeMix models migration events
as non-bifurcating, internal branches among populations. The population
assignments required for the phylogenetic methods were obtained from the
ADMIXTURE model for K=5 (see Results). Admixed samples (admixture
proportion < 0.95) were excluded prior to using SNAPP to avoid
biasing topologies and parameter estimates (Leaché et al. 2014). SNAPP was
implemented in BEAST v2.6.3 via CIPRES (Bouckaert et al., 2019; Miller et
al. 2010). The mutation rate priors (u and v) were set to 1, and the Yule
birth rate prior (λ) was set to 10. A diffuse prior on the expected
divergence between populations (θ) was assigned with a gamma distribution
(1, 250) corresponding to a mean of 0.004. Three independent analyses were
run for 1 million generations each, sampling every 1000 generations. The
posterior distribution of trees was summarized using DensiTree
(implemented in BEAST) to visualize uncertainty in the topology and branch
lengths. A maximum clade credibility (MCC) tree was summarized using
TreeAnnotator (BEAST) after discarding the first 20% of samples as
burn-in. Branch lengths were converted to time using a human mutation rate
of 1.0 × 10-8 substitutions per site per generation (Lynch, 2010; Scally
and Durbin, 2012) and a generation time 2 years for S. occidentalis
(Jameson & Allison 1976). TreeMix was used to estimate a maximum
likelihood population tree and infer gene flow between diverged
populations. The analysis included all samples that passed missing data
thresholds (see Results), including admixed samples. SNPs with missing
allele counts in ≥1 population were removed from the input dataset prior
to analysis. Block resampling across groups of 100 unlinked SNPs was
implemented to model uncertainty in the sample covariance matrix and to
effectively explore parameter space. A root position was selected for the
population tree according to the topology of the SNAPP tree (see Results).
The analysis added 1 – 6 migration events (m) to the tree, and each
analysis included 500 bootstrap replicates. Replicates were summarized
using the R package OptM (https://CRAN.R-project.org/package=OptM). The
value of m that explained most of the variation in the data was determined
by comparing the average increase in explained variance with each added
migration event and the rate of change in the likelihood with the Evanno
method (Evanno et al. 2005). We further tested for admixed ancestry with
the threepop and fourpop tests in TreeMix (Keinan et al., 2007; Reich et
al., 2009). The threepop analysis tests for admixture within all possible
triplets by evaluating the null hypothesis that the relationship within
the triplet can be explained by a simple bifurcating tree with a common
ancestor. A significantly negative value of the f3 statistic suggests the
presence of admixture within the triplet, and that relationship cannot be
captured by a simple bifurcating tree. The fourpop analysis tests for
treeness across all possible quartets by evaluating the null hypothesis
that two pairs of populations can only be connected by one internal
branch. An f4 statistic that significantly deviates from zero suggests the
presence of admixture within the quartet, meaning more than one internal
branch can connect two pairs of populations. Demographic models
Demographic models were compared to examine potential gene flow and
secondary contact between populations. Divergence scenarios were modeled
using joint site frequency spectra (JSFS) in dadi v2.1.0 (Gutenkunst et
al. 2009), and model optimization routines were performed using
DADI_PIPELINE V3.1.5 (Portik et al. 2017). Parameters of interest included
divergence times (T), population sizes (nu), and the amounts and
directions of gene flow (m). Primary analyses were conducted on
geographically contiguous populations that reflect phylogenetic
relationships among populations (see Results). Population abbreviations
used were: Southern California (SCA), West Sierra Nevada (WSN), East
Sierra Nevada (ESN), Pacific Northwest (PNW), Great Basin (GB). Population
sets included WSN/PNW, ESN/GB; and ESN/GB/SCA. Secondary analyses were
conducted on geographically contiguous populations that do not reflect
direct phylogenetic relationships – these were used to validate whether
the presence of admixed individuals at population boundaries can be
explained by contemporary gene flow. These population comparisons involved
WSN/ESN and PNW/ESN. There are three higher-level divergence scenarios
represented by the models including divergence with isolation, divergence
with isolation followed by secondary contact, and divergence with
continuous gene flow. Eighteen 2D population models were used, which
represent variations of the three main model scenarios. These also allow
for symmetric or asymmetric gene flow and the possibility of size changes,
and are described in greater detail elsewhere (Portik et al. 2017). Four
new 3D models were created to test divergence scenarios in ESN/GB/SCA. The
results of the 2D ESN/GB modeling were used to fix this aspect of the 3D
model. The four models therefore vary in aspects of migration and
isolation between SCA and the ancestor of ESN/GB, and between SCA and ESN.
The SNP datasets for each population comparison were obtained by
re-running Stacks on the most inclusive subset of samples (e.g., only
individuals belonging to populations in the model) to maximize the number
of SNPs for modeling. Input files for dadi were created using the
Convert_tsv_to_dadi.py module in STACKS_PIPELINE and sample sizes were
projected down to account for missing data and maximize the number of
segregating sites (Table S3). The optimization routine in DADI_PIPELINE
was used to run four consecutive rounds using the following settings:
replicates = 10, 10, 15, 25; maxiter = 3, 5, 10, 15; fold = 3, 2, 2, 1,
Nelder-Mead optimizer. Replicates are considered independent within an
optimization routine, and the top-ranked replicate should represent the
global optimum. Although unlikely, it is possible for the analysis to
become trapped on a local optima. This can be checked by running the
optimization routine more than once. If the log-likelihood values of the
top-ranked replicates are highly similar across analyses, this provides
evidence for convergence on the global optima. For each model, we
performed two independent runs of the optimization routine and ensured
that the top-ranked replicates had similar log-likelihood scores. By using
unlinked SNPS, log-likelihood values returned by dadi represent true
likelihood values, rather than composite likelihood values. Models were
therefore compared using the Akaike information criterion (AIC) including
AIC scores, ΔAIC scores and Akaike weights (ωi) (Burnham &
Anderson, 2002). The top-ranked model for each population comparison was
then subjected to a goodness-of-fit test to verify it represented a
reasonable explanation of the JSFS. Goodness-of-fit tests were performed
as described in Barratt et al. (2018), using 100 parametric bootstraps for
the top-ranked model. The test was considered passed if the empirical
log-likelihood was contained within distribution of log-likelihood values
obtained from the bootstraps. Effective migration and landscape
connectivity We used EEMS (Estimating Effective Migration Surfaces;
Petkova et al. 2016) to identify corridors of spatial connectivity and
geographic barriers to migration and to estimate relative genetic
diversity across the species range. EEMS compares empirical
dissimilarities in genetic and geographic structure to those expected
under an isolation-by-distance (IBD) model to estimate an effective
migration surface. Two key parameters are estimated by the model:
effective diversity rates (q) quantify the genetic similarity of two
individuals from the same location, and effective migration rates (m)
quantify how genetic similarity decays across space. Georeferenced samples
are assigned to one of a specified number of demes, and dissimilarities
are calculated between adjacent demes, with the assumption that an
infinite number of demes would approach a continuous estimate of gene flow
across the landscape. Three replicate EEMS runs (1 million iterations,
50,000 burn-in, sampling every 1,000 iterations) assuming 700 demes were
compared to check for convergence before combining the analyses to
summarize the final results. To visualize the spatial structure of genetic
diversity (q) and migration (m), contour maps were produced using the R
package rEEMSplots (https://github.com/dipetkov).