10.5061/DRYAD.GV1Q5
Wang, Huai-Chun
Dalhousie University
Minh, Bui Quang
Medical University of Vienna
University of Vienna
Susko, Edward
Dalhousie University
Roger, Andrew J.
Dalhousie University
Data from: Modeling site heterogeneity with posterior mean site frequency
profiles accelerates accurate phylogenomic estimation
Dryad
dataset
2017
long-branch repulsion
long-branch repel
site heterogeneity
long-branch attraction
mixture model
posterior mean site frequency
2017-08-04T01:10:47Z
2017-08-04T01:10:47Z
en
https://doi.org/10.1093/sysbio/syx068
1853915193 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Proteins have distinct structural and functional constraints at different
sites that lead to site-specific preferences for particular amino acid
residues as the sequences evolve. Heterogeneity in the amino acid
substitution process between sites is not modeled by commonly used
empirical amino acid exchange matrices. Such model misspecification can
lead to artefacts in phylogenetic estimation such as long-branch
attraction. Although sophisticated site-heterogeneous mixture models have
been developed to address this problem in both Bayesian and maximum
likelihood (ML) frameworks, their formidable computational time and memory
usage severely limits their use in large phylogenomic analyses. Here we
propose a posterior mean site frequency (PMSF) method as a rapid and
efficient approximation to full empirical profile mixture models for ML
analysis. The PMSF approach assigns a conditional mean amino acid
frequency profile to each site calculated based on a mixture model fitted
to the data using a preliminary guide tree. These PMSF profiles can then
be used for in-depth tree-searching in place of the full mixture model.
Compared with widely used empirical mixture models with k classes, our
implementation of PMSF in IQ-TREE (http://www.iqtree.org) speeds up the
computation by approximately k /1.5-fold and requires a small fraction of
the RAM. Furthermore, this speedup allows, for the first time, full
nonparametric bootstrap analyses to be conducted under complex
site-heterogeneous models on large concatenated data matrices. Our
simulations and empirical data analyses demonstrate that PMSF can
effectively ameliorate long-branch attraction artefacts. In some empirical
and simulation settings PMSF provided more accurate estimates of
phylogenies than the mixture models from which they derive.
simuLBA.C20F.fourtaxa.tar: 4taxa 20K sitesThe sequence data (20K sites)
were simulated under LG+C20+F+G for four-taxon trees under the
LBA-inducing conditions. The tree files are also
included.simuLBA.C20F.fourtaxa.tar.gzsimuLBA.C60F.fourtaxa.tar: 4 taxa 20K
sitesThe sequence data (20K sites) were simulated under LG+C60+F+G for
four-taxon trees under the LBA-inducing conditions. The tree files are
also included.simuLBA.C60F.fourtaxa.tar.gzsimuLBR.C20F.fourtaxa.tar: 4
taxa 20K sitesThe sequence data were simulated under LG+C20+F+G for
four-taxon trees under the LBR-inducing conditions. The tree files are
also included.simuLBR.C20F.fourtaxa.tar.gzsimuLBR.C60F.fourtaxa.tar: 4
taxa 20K sitesThe sequence data were simulated under LG+C60+F+G for
four-taxon trees under the LBR-inducing conditions. The tree files are
also included.simuLBR.C60F.fourtaxa.tar.gzsimuLBR.8taxa.tre.seqThe
sequence data were simulated under LG+C20+F+G for an 8-taxon tree under an
LBR-inducing condition. The tree file is also
included.simuLBR.12taxa.tre.seqThe sequence data were simulated under
LG+C20+F+G for a 12-taxon tree under an LBR-inducing condition. The tree
file is also included.simuLBR.16taxa.tre.seqThe sequence data were
simulated under LG+C20+F+G for a 16-taxon tree under an LBR-inducing
condition. The tree file is also included.simuLBR.20taxa.tre.seqThe
sequence data were simulated under LG+C20+F+G for a 20-taxon tree under an
LBR-inducing condition. The tree file is also included.Supplementary
Materials: main filePMSF.Sup.Materials.pdfSupplementary Materials: file
2PMSF.Sup.Materials.2.pdfsimuLBA.LGFG.tarSimulation under LG+F+G for 4
taxa 1000 sites under LBA setting.simuLBR.LGFG.tarSimulation under LG+F+G
for 4 taxa 1000 sites under LBR setting.simuLBR.EXEHO.tarSimulation under
EX_EHO for 4 taxa 6000 sites under LBR setting.simuLBA.EXEHO.tarSimulation
under EX_EHO for 4 taxa 6000 sites under LBA
setting.simu.amborella.JTT.tarSimulate under JTT+F+G based on an
Amborella/Angiosperm tree for 12549 sites; used for Fig.
S25.simu.Ord0245.tarBootstrap alignment files for fitting PMSF based on
the ML tree estimated from Ord0245, one of the 300 proteins in the HSSP
test datasets. These data were used for producing Fig.
S1.simuLBA.C20F.1050sites.tarSimulation under LG+C20+F+G for 4 taxa 1050
sites under LBA setting.simuLBR.C20F.1050sites.tarSimulation under
LG+C20+F+G for 4 taxa 1050 sites under LBR setting.