10.5061/DRYAD.2G3S4
Kofler, Robert
University of Veterinary Medicine Vienna
Langmüller, Anna Maria
University of Veterinary Medicine Vienna
Nouhaud, Pierre
University of Veterinary Medicine Vienna
Otte, Kathrin Anna
University of Veterinary Medicine Vienna
Schlötterer, Christian
University of Veterinary Medicine Vienna
Data from: Suitability of different mapping algorithms for genome-wide
polymorphism scans with Pool-Seq data.
Dryad
dataset
2016
Next Generation Sequencing
mapping algorithm
Pool-Seq
2016-09-27T14:19:37Z
2016-09-27T14:19:37Z
en
https://doi.org/10.1534/g3.116.034488
675304853 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
The cost-effectiveness of sequencing pools of individuals (Pool-Seq)
provides the basis for the popularity and wide-spread use of this method
for many research questions, ranging from unravelling the genetic basis of
complex traits to the clonal evolution of cancer cells. Because the
accuracy of Pool-Seq could be affected by many potential sources of error,
several studies determined, for example, the influence of the sequencing
technology, the library preparation protocol, and mapping parameters.
Nevertheless, the impact of the mapping tools has not yet been evaluated.
Using simulated and real Pool-Seq data, we demonstrate a substantial
impact of the mapping tools leading to characteristic false positives in
genome-wide scans. The problem of false positives was particularly
pronounced when data with different read lengths and insert sizes were
compared. Out of 14 evaluated algorithms novoalign, bwa mem and clc4 are
most suitable for mapping Pool-Seq data. Nevertheless, no single algorithm
is sufficient for avoiding all false positives. We show that the
intersection of the results of two mapping algorithms provides a simple,
yet effective strategy to eliminate false positives. We propose that the
implementation of a consistent Pool-seq bioinformatics pipeline building
on the recommendations of this study can substantially increase the
reliability of Pool-Seq results, in particular when libraries generated
with different protocols are being compared.
Simulated data: comparing allele frequencies with FSTfst.zipSimulated
data: with SNPs (no indels)ref-snps.zipSimulated data: with SNPs and
indelsref-snpsrandposindel.zipScripts used for the analysisscripts.zip