10.5061/DRYAD.5VP21B10
Bérard, Jean
University of Lyon System
Guéguen, Laurent
University of Lyon System
Data from: Accurate estimation of substitution rates with
neighbour-dependent models in a phylogenetic context
Dryad
dataset
2012
neighbour-dependent substitution
CpG hypermutability
CpG islands
maximum likelihood phylogeny
2012-01-27T19:18:53Z
2012-01-27T19:18:53Z
en
https://doi.org/10.1093/sysbio/sys024
25346592 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Most models and algorithms developed to perform statistical inference from
DNA data make the assumption that substitution processes affecting
distinct nucleotide sites are stochastically independent. This assumption
ensures both mathematical and computational tractability, but is in
disagreement with observed data in many situations -- one well-known
example being CpG dinucleotide hypermutability in mammalian genomes. In
this paper, we consider the class of RN95+YpR substitution models, which
allows neighbour-dependent effects -- including CpG hypermutability -- to
be taken into account, through transitions between pyrimidine-purine
dinucleotides. We show that it is possible to adapt inference methods
originally developed under the assumption of independence between sites to
RN95+YpR models, using a mathematically rigorous framework provided by
specific structural properties of this class of models. We assess how
efficient this approach is at inferring the CpG hypermutability rate from
aligned DNA sequences. The method is tested on simulated data and compared
against several alternatives; the results suggest that it delivers a high
degree of accuracy at a low computational cost. We then apply our method
to an alignment of ten DNA sequences from primate species. Model
comparisons within the RN95+YpR class show the importance of taking into
account neighbour-dependent effects. An application of the method to the
detection of hypomethylated islands is discussed.
appendixENm001Sequence of the ENm001 region of the ENCODE Pilot Project
(position 115,810,521 to position 117,687,946 on chromosome 7) from the
hg19 version of the human genome, together with aligned sequences from
nine other primate species (Chimpanzee, Gorilla, Orang-utan, Macaque,
Baboon, Marmoset, Tarsier, Gray mouse lemur, Galago) as available from the
Galaxy web tool (http://galaxy.psu.edu/).ENm001_ARPortion of ENm001 made
from "Ancestral" repeated elements, according to the
RepeatMasker annotations on the human sequence from the UCSC Table Browser
(http://genome.ucsc.edu/cgi-bin/hgTables) Pieces of the alignment
corresponding to simple repeats, low complexity regions, members of the
Alu family, and RNA elements that diverged less than 25% and L1 elements
that diverged less than 20% from the reference RepBase sequence, were
removed.