10.5061/DRYAD.NS47Q
Gompert, Zachariah
Utah State University
Egan, Scott P.
Rice University
Barrett, Rowan D. H.
McGill University
Feder, Jeffrey L.
Notre Dame University
University of Notre Dame
Nosil, Patrik
University of Sheffield
Data from: Multilocus approaches for the measurement of selection on
correlated genetic loci
Dryad
dataset
2016
Natural Selection and Contemporary Evolution
National Science Foundation
https://ror.org/021nxhr62
DEB-1638768
2016-09-29T05:59:12Z
2016-09-29T05:59:12Z
en
https://doi.org/10.1111/mec.13867
914440640 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
The study of ecological speciation is inherently linked to the study of
selection. Methods for estimating phenotypic selection within a generation
based on associations between trait values and fitness (e.g. survival) of
individuals are established. These methods attempt to disentangle
selection acting directly on a trait from indirect selection caused by
correlations with other traits via multivariate statistical approaches
(i.e. inference of selection gradients). The estimation of selection on
genotypic or genomic variation could also benefit from disentangling
direct and indirect selection on genetic loci. However, achieving this
goal is difficult with genomic data because the number of potentially
correlated genetic loci (p) is very large relative to the number of
individuals sampled (n). In other words, the number of model parameters
exceeds the number of observations (p ≫ n). We present simulations
examining the utility of whole-genome regression approaches (i.e. Bayesian
sparse linear mixed models) for quantifying direct selection in cases
where p ≫ n. Such models have been used for genome-wide association
mapping and are common in artificial breeding. Our results show they hold
promise for studies of natural selection in the wild and thus of
ecological speciation. But we also demonstrate important limitations to
the approach and discuss study designs required for more robust
inferences.
Null model simulationsThis R script was used to simulate random mortality
and then infer total selection on genotypes based on allele frequency
changes while ignoring the effect of genetic drift.nullSim.RSimulated
phenotypes based on Timema GBS dataThis compressed folder contains the
simulated data sets where Timema cristinae GBS data were used as the
starting point. It contains three gemma formatted genotype files
(geno_sims.txt, geno2500_sims.txt, and geno592r_sims.txt) that contain the
underlying genotypic data for the base simulations, those with 2500
individuals and those with replicated genotypes, respectively. The
remaining files are named based on the simulations condition (simsN): 1,
h2=0.3, L = 100; 2, h2=0.3, L=10; 3, h2=0.05, L=100; 4, h2=0.05, L=10; 5,
h2=0.3, L=1000; and 6, h2=0.05, L=1000. Files with 'bin' in
their name correspond to results with a binary measure of fitness (e.g.,
alive or dead). The general file types are as follows: pheno* = gemma
formatted phenotype file with one column per simulated phenotypic data
set, bv* = breeding values (expected phenotypic values) for each
individual based on the genotypic data, effect* = phenotypic effect sizes
for causal variants, functvar* = a list of the causal variants (by SNP
number, and in the same order as in
effect*).simdata_timema.tar.gzSimulated phenotypes based on the Rhagoletis
GBS dataThis compressed folder contains the simulated data sets where
Rhagoletis pomonella GBS data were used as the starting point. It contains
two gemma formatted genotype files (geno_sims.txt and geno592r_sims.txt)
that contain the underlying genotypic data for the base simulations and
those with replicated genotypes, respectively. The remaining files are
named based on the simulations condition (simsN): 1, h2=0.3, L = 100; 2,
h2=0.3, L=10; 3, h2=0.05, L=100; 4, h2=0.05, L=10; 5, h2=0.3, L=1000; and
6, h2=0.05, L=1000. The general file types are as follows: pheno* = gemma
formatted phenotype file with one column per simulated phenotypic data
set, bv* = breeding values (expected phenotypic values) for each
individual based on the genotypic data, effect* = phenotypic effect sizes
for causal variants, functvar* = a list of the causal variants (by SNP
number, and in the same order as in effect*).simdata_rhag.tar.gzTimema
simulation scriptsThis compressed directory contains four R scripts. The
first three sims.R, sims2500.R and sims592r.R were used to simulate the
quantitative phenotypes (fitness values) from the Timema GBS data for the
original genotypic data, data with 2500 individuals, and with replicated
genotypes, respectively. The last script, liability2binary.R, was used to
convert quantitative fitness values to binary
data.simscripts_timema.tar.gzRhagoletis simulation scriptsThis compressed
directory contains two R scripts. These scripts (sims.R and sims592r.R)
were used to simulate the quantitative phenotypes (fitness values) from
the Rhagoletis GBS data for the original genotypic data and data with
replicated genotypes, respectively.simscripts_rhag.tar.gzSummary
scriptsThis compressed directory contains the perl scripts used to
summarize (1) the posterior distribution of genome-level genetic
architecture parameters (calcpost*), and (2) SNP specific selection
coefficients (summarizeSNP*pl).analysis_scripts.tar.gz