10.5061/DRYAD.M905QFV26
Gompert, Zachariah
0000-0003-2248-2488
Utah State University
Feder, Jeff
Notre Dame University
Nosil, Patrik
Centre d'Ecologie Fonctionnelle et Evolutive
Natural selection drives genome-wide evolution via chance genetic associations
Dryad
dataset
2021
Contemporary Evolution
Population Genetics - Empirical
Natural Selection and Contemporary Evolution
National Science Foundation
https://ror.org/021nxhr62
DEB 1844941
European Research Council
https://ror.org/0472cxd90
EE-Dynamics 770826
2021-10-29T00:00:00Z
2021-10-29T00:00:00Z
en
3041156525 bytes
2
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Understanding selection's impact on the genome is a major theme in
biology. Functionally-neutral genetic regions can be affected indirectly
by natural selection, via their statistical association with genes under
direct selection. The genomic extent of such indirect selection,
particularly across loci not physically linked to those under direct
selection, remains poorly understood, as does the time scale at which
indirect selection occurs. Here we use field experiments and genomic data
in stick insects, deer mice and stickleback fish to show that widespread
statistical associations with genes known to affect fitness cause many
genetic loci across the genome to be impacted indirectly by selection.
This includes regions physically distant from those directly under
selection. Then, focusing on the stick insect system, we show that
statistical associations between SNPs and other unknown, causal variants
result in additional indirect selection in general and specifically within
genomic regions of physically linked loci. This widespread indirect
selection necessarily makes aspects of evolution more predictable. Thus,
natural selection combines with chance genetic associations to affect
genome-wide evolution across linked and unlinked loci and even in
modest-sized populations. This process has implications for the
application of evolutionary principles in basic and applied science.
Whole genome DNA sequence data were previously generated from 491 Timema
cristinae stick insects that were part of a release-recapture selection
experiment (available from the NCBI SRA PRJNA356801). For the current
study, we aligned the whole genome DNA sequence data from each of these
491 T. cristinae to the T. cristinae reference genome (version 1.3) using
the bwa (version 07.10-r789) mem algorithm with a band width of 100, a 20
bp seed length and a minimum score for output of 30. We then used samtools
(version 1.5) to compress, sort and index the alignments, and to remove
PCR duplicates. We then used the GATK HaplotypeCaller and GenotypeGVCFs
modules (version 3.5) to call variants and calculate genotype likelihoods.
We required a minimum base quality of 30, set the prior probability of
heterozygosity to 0.001, and only called variants with a minimum
phred-scaled confidence of 50. The following filters were then applied
using custom Perl scripts: minimum coverage of 1Xp er individual (i.e.,
491X coverage across all individuals), a minimum ratio of variant
confidence to non-reference read depth of 2, a minimum mapping quality of
40, a maximum phred-scaled P-value of Fisher's exact test for strand
bias of 60, and a minimum minor allele frequency of 0.01. Further, we only
retained SNPs mapped to one of the 13 T. cristinae linkage groups. This
resulted in 7,243,463 SNPs, which were used in subsequent analyses. Next,
we obtained maximum likelihood estimates of allele frequencies for all
experimental samples using an expectation-maximization (EM) algorithm as
implemented in estpEM (version 0.1). For this, we used a convergence
tolerance of 0.001 and allowed for a maximum of 30 EM iterations. We then
used these allele frequency estimates and the genotype likelihoods from
GATK to calculate empirical Bayesian genotype estimates. These point
estimates range from zero to two, and are not constrained to be integer
values. Thus, this data set includes the genotype estimates for the 491
individuals as well as the data on survival, i.e., whether or not they
were re-captured at the end of the experiment.
pntest_LG_*_mod_filtered1X_tcrExperimentVariants.txt.gz (* denotes 1, 2,
... ,13) These text files contain the genotype estimates (Bayesian point
estimates of the number of non-reference alleles). There is one file per
chromosome (linkage group, numbered 1 to 13). Each contains one row per
SNP locus and one column per individual. LG*_SNPs.txt.gz These text files
provide information about the genomic location of each SNP in the genotype
files described above. There is one file per chromosome (linkage group,
numbered 1 to 13). Each contains one row per SNP locus with the scaffold
number (1st column), linkage group number (2nd column), 3 estimates of map
position (in cM, columns 3-5), and the position in base pairs.
Survival.txt This text file contains one row per individual with 1 or 0
denoting whether the stick insect survived (1) or died (0). TrtmntHost.txt
This text file contains one row per individual stick insect with the first
column denoting the block number and the second column denoting the host
plant (A = Adenostoma, C = Ceanothus).