10.5061/DRYAD.C2FQZ614W
Brock, Marcus
0000-0002-0330-0426
University of Wyoming
Data from: A nested association mapping panel in Arabidopsis thaliana for
mapping and characterizing genetic architecture
Dryad
dataset
2020
National Science Foundation
https://ror.org/021nxhr62
IOS-0923752
National Science Foundation
https://ror.org/021nxhr62
IOS-1444571
National Science Foundation
https://ror.org/021nxhr62
IOS-0923752
2020-10-09T00:00:00Z
2020-10-09T00:00:00Z
en
https://doi.org/10.1534/g3.120.401239
47010564 bytes
3
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Linkage and association mapping populations are crucial public resources
that facilitate the characterization of trait genetic architecture in
natural and agricultural systems. We define a large nested association
mapping panel (NAM) from 14 publicly available recombinant inbred
populations (RILs) of Arabidopsis thaliana, which share a common recurrent
parent (Col-0). Using a genotype-by-sequencing approach (GBS), we
identified single nucleotide polymorphisms (SNPs; range 563-1525 per
population) and subsequently built updated linkage maps in each of the 14
RIL sets. Simulations in individual RIL populations indicate that our GBS
markers have improved power to detect small effect QTL and enhanced
resolution of QTL support intervals in comparison to original linkage
maps. Using these robust linkage maps, we imputed a common set of
publicly available parental SNPs into each RIL linkage map, generating
overlapping markers across all populations. Though ultimately depending
on allele frequencies at causal loci, simulations of the NAM panel suggest
that surveying between 4 to 7 of the 14 RIL populations provides high
resolution of the genetic architecture of complex traits, relative to a
single mapping population.
SNP discovery and curation: We selected 14 Arabidopsis thaliana RIL
populations from the Institut National de la Recherche Agronomique (INRA;
Versailles, France) that utilize Col-0 as a common, recurrent,
parent. From each population, the most informative 150 RILs were selected
from each population comprising 2100 unique F8 RIL lines. To build high
density linkage maps across all 14 Arabidopsis thaliana RIL populations,
we took a genotyping-by-sequencing (GBS) approach to SNP discovery. We
digested each DNA sample with the restriction
endonucleases EcoRI and HindIII and then ligated customized adapters to
each fragment containing the Illumina adaptor sequences and 8-10 bp
barcode sequences. Ligated fragments were PCR amplified using two
separate reactions and resulting products were pooled to limit stochastic
effects on relative abundance of fragments. PCR products were then pooled
across individuals and libraries were size selected for fragments between
250-700bp using a BluePippin (Beverly, MA, USA). Initial GBS libraries
were sent to the RTSF Genomics Core (Michigan State University, East
Lansing, MI, USA) and follow-up runs were sent to the Genomic Sequencing
and Analysis Facility (University of Texas, Austin, TX, USA). At both
facilities, libraries were sequenced on the Illumina HiSeq 2500 platform
(1 × 100 bp) and over 1 billion reads were assigned to barcoded samples.
Reads were mapped onto the Arabidopsis thaliana reference genome (TAIR10)
using two separate approaches and resulting SNPs calls were
merged. First, we used SOAP (SOAPaligner ver. 2.21 and SOAPsnp ver.1.03)
in order to set priors on genotype calls based on the probability of
expected homozygosity in an F8 RIL population. Second, we utilized
BWA’s aln and samse algorithms (ver. 0.7) to map reads to the TAIR10
reference. We then called SNPs using SAMtools mpileup and
BCFtools view (ver. 0.1.19) algorithms. In both approaches, we retained
only uniquely mapping reads and only SNP genotype calls with a read depth
of eight or more. We used custom perl scripts to combine SNP calling
approaches, merging the novel SNPs from BWA/SAMtools into the SOAPsnp
results. Finally, we merged SNPs originally genotyped from each RIL
population (INRA; Versailles, France) into our GBS approach after
converting INRA SNPs to the TAIR10 coordinate system. RIL Linkage map
construction and SNP imputation: For each population, we combined our new
GBS SNP markers with existing INRA markers and imported these data into
the R/qtl package (Broman et al. 2003)with SNP order based on physical
location. In each RIL population, we estimated marker map locations
(est.map; R/qtl) for each chromosome using a Kosambi mapping function. We
then imputed missing data across markers in each RIL set using
R/qtl’s fill.geno function to “fill in” missing genotypes between markers
with identical genotypes (ignoring chromosome ends and recombination
breakpoint regions). We then removed any imputed genotypes where
multipoint marker data estimated genotype probabilities
(calc.genoprob; R/qtl) were less than 99%. NAM population SNP imputation
and joint-linkage map construction: Because our GBS markers rarely
overlapped across populations, we used the robust linkage maps of each
individual population to impute a common set of SNP markers across all 14
RIL sets. We utilized the publicly available 250K SNP Arabidopsis dataset
for imputation (Horton et al. 2012; Atwell et al. 2010) because it
contained 211,786 overlapping SNPs from 13 of the 14 alternate
parents. For the remaining parent, Ita-0, we interrogated publicly
available bam files (Durvasula et al. 2017) at each SNP location in the
250K dataset to determine Ita-0 marker states. In each RIL population, we
interpolated map positions of the 250K SNPs and again used fill.geno to
impute the 250K marker states from each parent between GBS markers with
identical marker states, e.g., we filled alternate parent SNP states from
the 250K dataset into intervals anchored at both ends by alternate GBS
marker states. Given that the intervals in the GBS derived linkage maps
are on average 0.4cM (see results), this fill in approach has an average
imputation error rate of 0.0016% (i.e., the probability of a double
crossover in intervals anchored by like parental marker states) and a
maximal error rate of 1.93% (in the largest interval across all
populations, 13.9cM). All 14 RIL populations were merged together based
on imputed, overlapping SNPs and neighboring markers in perfect linkage
with respect to both marker state and missing data were reduced to a
single entry. These SNPs could be used to map trait genetic architecture
via GWA style analyses that control for population
structure. Alternatively, the genetic architecture of complex traits in
NAM populations can be resolved via extensions of traditional linkage
mapping approaches in concert with a joint-linkage map. Using the final
imputed SNP files of our merged NAM population, we also generated a
joint-linkage map for all 14 populations. Because recombination events
can only be detected between polymorphic SNPs, we selected SNPs for which
at least 11 of the 14 alternate parents shared a SNP state and differed
from the recurrent parent. SNPs in populations that were not polymorphic
for a specific marker were encoded as missing data. Markers were imported
into R/qtl with ordering based on physical location and genetic map
locations were estimated using the Kosambi mapping function (est.map;
R/qtl). LITERATURE CITED Atwell, S., Y.S. Huang, B.J. Vilhjalmsson, G.
Willems, M. Horton et al., 2010 Genome-wide association study of 107
phenotypes in Arabidopsis thaliana inbred lines. Nature 465
(7298):627-631. Broman, K.W., H. Wu, Ś. Sen, and C.G. A., 2003 R/qtl: QTL
mapping in experimental crosses. Bioinformatics 19:889-890. Durvasula, A.,
A. Fulgione, R.M. Gutaker, S.I. Alacakaptan, P.J. Flood et al., 2017
African genomes illuminate the early history and transition to selfing
in Arabidopsis thaliana. Proceedings of the National Academy of Sciences
of the United States of America 114 (20):5213-5218. Horton, M.W., A.M.
Hancock, Y.S. Huang, C. Toomajian, S. Atwell et al., 2012 Genome-wide
patterns of genetic variation in worldwide Arabidopsis thaliana accessions
from the RegMap panel. Nature Genetics 44 (2):212-216.
USAGE NOTES FOR INRA RIL FILES: INRA_2RV_bla1_col0_imputed.csv to
INRA_29RV_ita0_col0_imputed.csv GBS SNPs for each of 14 INRA Arabidopsis
thaliana RIL populations with Col-0 as a common recurrent parent. GBS
markers and original INRA markers were merged together and imported into
R/qtl based on physical positions and linkage maps estimated. HEADER
legend: marker: markers names of chromosome and bp position in TAIR10,
e.g., 1_36150 is chr1 at 36150 chr: chromosome pos: centimorgan position
X2RV4..X23RV499--list of INRA RILs. e.g., X2RV4 is population 2RV, RIL 4
USAGE NOTES FOR INRA NAM FILES: INRA_NAM_imputed_SNPs_chr1.csv to
INRA_NAM_imputed_SNPs_chr5.csv Imputed SNPs in 14 Arabidopsis thaliana RIL
populations that share Col-0 as a common recurrent parent. Neighboring
SNP markers in perfect linkage with respect to both marker state and
missing data were reduced to a single entry--however, for subsets of the
14 populations many of these SNPs could be collapsed again to reduce file
sizes / SNP numbers. HEADER legend: marker: markers names of chromosome
and bp position in TAIR10, e.g., 1_36150 is chr1 at 36150 chr: chromosome
pos: position in bp col0_base: SNP call in Col0 col0_allele: defined as
REF ALT_base: SNP call in alternate parents that differ from Col0
ALT_allele: defined as ALT X2RV4..X23RV499--list of INRA RILs and their
allele states (REF vs. ALT for each marker) e.g., X2RV4 is population
2RV, RIL 4 USAGE NOTES FOR INRA JOINT-LINKAGE FILES:
INRA_NAM_joint_linkage_map.csv Using INRA_NAM_imputed_SNPs, we identified
markers where at least 11 of the 14 alternate parents shared a SNP state
and differed from the recurrent parent. SNPs in populations that were not
polymorphic for a specific marker were encoded as missing data. Markers
were imported into R/qtl with ordering based on physical location and
genetic map locations were estimated using the Kosambi mapping function
(est.map; R/qtl). HEADER legend: marker: markers names of chromosome and
bp position in TAIR10, e.g., 1_48181 is chr1 at 48181bp chr: chromosome
cM: centimorgan position X2RV4..X23RV499--list of INRA RILs and their
genotypes at each marker. e.g., X2RV4 is population 2RV, RIL 4 AA is
homozygous for the Reference allele (Col0) BB is homozygous for the
Alternate allele "-" is missing data USAGE NOTES FOR INRA
JOINT-LINKAGE FILES: INRA_NAM_joint_linkage_map_SNP_states.csv SNP states
at markers in the INRA_NAM_joint_linkage_map.csv HEADER legend: marker:
markers names of chromosome and bp position in TAIR10, e.g., 1_48181 is
chr1 at 48181bp REF_SNP_state: base call in RILs carrying the REF allele
(i.e., Col0 allele) ALT_SNP_state: base call in RILs carrying the
ALT allele