10.5061/DRYAD.GQ0PB
Whelan, Simon
Uppsala University
University of Manchester
Allen, James E.
Uppsala University
University of Manchester
Blackburne, Benjamin P.
Uppsala University
University of Manchester
Talavera, David
Uppsala University
University of Manchester
Data from: ModelOMatic: fast and automated comparison between RY,
nucleotide, amino acid, and codon substitution models
Dryad
dataset
2014
model selection
substitution models
AIC
2014-09-09T15:34:18Z
2014-09-09T15:34:18Z
en
https://doi.org/10.1093/sysbio/syu062
5247575 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Molecular phylogenetics is a powerful tool for inferring both the process
and pattern of evolution from genomic sequence data. Statistical
approaches, such as maximum likelihood and Bayesian inference, are now
established as the preferred methods of inference. The choice of models
that a researcher uses for inference is of critical importance, and there
are established methods for model selection conditioned on a particular
type of data, such as nucleotides, amino acids, or codons. A major
limitation of existing model selection approaches is that they can only
compare models acting upon a single type of data. Here we extend model
selection to allow comparisons between models describing different types
of data by introducing the idea of adapter functions, which project
aggregated models onto the originally observed sequence data. These
projections are implemented in the program ModelOMatic and used to perform
model selection on 3,722 families from the PANDIT database, 68 genes from
an arthropod phylogenomic data set, and 248 genes from a vertebrate
phylogenomic data set. For the PANDIT and arthropod data, we find that
amino acid models are selected for the overwhelming majority of
alignments; with progressively smaller numbers of alignments selecting
codon and nucleotide models, and no families selecting RY-based models. In
contrast, nearly all alignments from the vertebrate data set select
codon-based models. The sequence divergence, the number of sequences, and
the degree of selection acting upon the protein sequences may contribute
to explaining this variation in model selection. Our ModelOMatic program
is fast, with most families from PANDIT taking fewer than 150 seconds to
complete, and should therefore be easily incorporated into existing
phylogenetic pipelines.
Whelan_ModelOMaticAppendixAppendix for ModelOMatic