10.5061/DRYAD.RM907
Dececchi, Thomas Alexander
Balhoff, James P.
National Evolutionary Synthesis Center
Lapp, Hilmar
National Evolutionary Synthesis Center
Mabee, Paula M.
University of South Dakota
Data from: Toward synthesizing our knowledge of morphology: using
ontologies and machine reasoning to extract presence/absence evolutionary
phenotypes across studies
Dryad
dataset
2015
missing data
ontology
Morphological Characters
Character Conflict
Sarcopterygii
Supermatrix
Evolutionary Mapping
2015-06-09T14:26:43Z
2015-06-09T14:26:43Z
en
https://doi.org/10.1093/sysbio/syv031
50757692 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
The reality of larger and larger molecular databases and the need to
integrate data scalably have presented a major challenge for the use of
phenotypic data. Morphology is currently primarily described in discrete
publications, entrenched in noncomputer readable text, and requires
enormous investments of time and resources to integrate across large
numbers of taxa and studies. Here we present a new methodology, using
ontology-based reasoning systems working with the Phenoscape Knowledgebase
(KB; kb.phenoscape.org), to automatically integrate large amounts of
evolutionary character state descriptions into a synthetic character
matrix of neomorphic (presence/absence) data. Using the KB, which includes
more than 55 studies of sarcopterygian taxa, we generated a synthetic
supermatrix of 639 variable characters scored for 1051 taxa, resulting in
over 145,000 populated cells. Of these characters, over 76% were made
variable through the addition of inferred presence/absence states derived
by machine reasoning over the formal semantics of the source ontologies.
Inferred data reduced the missing data in the variable character-subset
from 98.5% to 78.2%. Machine reasoning also enables the isolation of
conflicts in the data, that is, cells where both presence and absence are
indicated; reports regarding conflicting data provenance can be generated
automatically. Further, reasoning enables quantification and new
visualizations of the data, here for example, allowing identification of
character space that has been undersampled across the fin-to-limb
transition. The approach and methods demonstrated here to compute
synthetic presence/absence supermatrices are applicable to any taxonomic
and phenotypic slice across the tree of life, providing the data are
semantically annotated. Because such data can also be linked to model
organism genetics through computational scoring of phenotypic similarity,
they open a rich set of future research questions into phenotype-to-genome
relationships.
READMEREADME filesupplementary_table_1Supplementary Materials Table 1.
List of publications used in constructing the synthetic supermatrix. Focal
group, number of taxa, and number of fin, limb, and girdle characters,
states and phenotype annotations. Studies focused explicitly on the fin to
limb transition are denoted by an
asterisk.supplementary_table_2Supplementary Materials Table 2. Taxa (136)
present in the variable-only synthetic supermatrix based on inferred data
alone.supplementary_table_3Supplementary Materials Table 3. Conflicting
characters. Characters with conflicting states in the variable-only
supermatrix, listed by taxon. Conflict type (between direct assertions,
direct vs. inferred, and inferred vs. inferred) indicated in right-most
column.supplementary_table_4Supplementary Materials Table 4. Isomorphic
characters. Clusters (93) of fully isomorphic characters across the
variable-only synthetic supermatrix, arranged from high (10) to low
(2).supplementary_table_5Supplementary Materials Table 5. Taxon sampling.
The number of source matrices (right-most column) from which taxa
(Vertebrate Taxonomy Ontology (VTO) identifier number, left-most column),
at various taxonomic ranks, were
sampled.supplementary_table_6Supplementary Materials Table 6. The number
of published character states that entail the presence or absence for
selected sets of anatomical entities and taxa.uberon_presencesWL ontology
containing a "presence class" corresponding to each anatomical
structure from the Uberon anatomy ontology. Ontotrace code in
`GeneratePresenceClasses.scala`sarcop-presence-absence-variableNEXUS-formatted translation of `sarcop-presence-absence-variable.xmlsarcop-presence-absence-variable.xmlNeXML character matrix retaining only variable columns from `sarcop-presence-absence-all.xml`sarcop-presence-absence-all.xmlNeXML character matrix, generated using these input expressions to Ontotrace.