{"data":[{"id":"10.6076/d1gw20","type":"dois","attributes":{"doi":"10.6076/d1gw20","identifiers":[],"creators":[{"name":"Neu, Alexander","nameType":"Personal","givenName":"Alexander","familyName":"Neu","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0003-0833-1704","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"Mapfile and ASV table of whole-body and shell-surface samples from geminate species of gastropods separated by the Isthmus of Panama"}],"publisher":"Dryad","container":{},"publicationYear":2021,"subjects":[],"contributors":[],"dates":[{"date":"2021-07-13T16:52:02Z","dateType":"Submitted"},{"date":"2021-07-15T00:00:00Z","dateType":"Issued"},{"date":"2021-07-15T00:00:00Z","dateType":"Available"},{"date":"2021-07-16T00:00:00Z","dateType":"Updated"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/2021.07.08.451645v1","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1128/aem.01003-24","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["21076366 bytes"],"formats":[],"version":"3","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"The rise of the Isthmus of Panama ~3.5 mya separated populations of many\n marine organisms, which then diverged into new geminate sister species\n currently living in the Eastern Pacific Ocean and Caribbean Sea. However,\n we know very little about how such evolutionary divergences of host\n species have shaped their microbiomes. Here, we compared the microbiomes\n of whole-body and shell-surface samples of geminate species of marine\n gastropods in the\n genera Cerithiumand Cerithideopsis to those of\n congeneric outgroups. Our results show that the effects of the Isthmus on\n microbiome composition varied among host genera and between sample types\n within the same hosts. In the whole-body samples, microbiome compositions\n of geminate species pairs in the focal genera tended to be similar, likely\n due to host filtering, although the strength of this relationship varied\n among the two groups and across similarity metrics. Shell-surface\n communities showed contrasting patterns, with co-divergence between the\n host taxa and a small number of microbial clades evident\n in Cerithideopsis, but not Cerithium. These results\n suggest that (i) the rise of the Isthmus of Panama affected microbiomes of\n geminate hosts in a complex and clade-specific manner and (ii)\n host-associated microbial taxa respond differently to vicariance events\n than the hosts themselves. ","descriptionType":"Abstract"},{"description":"These data were collected by sequencing the V4 region of the 16S\n rRNA gene from seven species of intertidal gastropods collected from four\n sites across the Isthmus of Panama, as well as environmental samples.\n Three of these hosts are from the genus \u003cem\u003eCerithium\u003c/em\u003e,\n the others are from the genus \u003cem\u003eCerithideopsis\u003c/em\u003e.\n Whole-body tissues and shell-surface swabs from each sample were processed\n using the Qiagen DNeasy Blood and Tissue kit, the 16S rRNA gene region was\n amplified using the 515f-806rb primer pair and amplicons were barcoded\n using the NexTeraXT barcode kit. Sequencing was conducted on an Illumina\n MiSeq (2x250bp, paired end). Sequences were processed via DADA2 and\n assigned taxonomy using the Silva 138 database.","descriptionType":"Methods"},{"description":"The associated mapfile should provide all necessary metadata for\n the usage of the ASV table. Raw sequences are available from the NCBI SRA\n under BioProject accession #PRJNA74415.","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[],"url":"https://datadryad.org/dataset/doi:10.6076/D1GW20","contentUrl":null,"metadataVersion":13,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":124,"downloadCount":4,"referenceCount":0,"citationCount":2,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2021-07-16T02:43:36Z","registered":"2021-07-16T02:43:37Z","published":null,"updated":"2026-03-25T19:03:40Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d10w2n","type":"dois","attributes":{"doi":"10.6076/d10w2n","identifiers":[],"creators":[{"name":"Contijoch, Francisco","nameType":"Personal","givenName":"Francisco","familyName":"Contijoch","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-9616-3274","nameIdentifierScheme":"ORCID"}]},{"name":"Colvert, Brendan","nameType":"Personal","givenName":"Brendan","familyName":"Colvert","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-4812-8228","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"U-net for automated thoracic CT semantic segmentation"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Medical engineering","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Medical engineering","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"convolutional neural networks"},{"subject":"computed tomography (CT)"},{"subject":"segmentation"}],"contributors":[],"dates":[{"date":"2022-06-04T21:31:06Z","dateType":"Submitted"},{"date":"2023-05-09T00:00:00Z","dateType":"Issued"},{"date":"2023-05-09T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1002/mp.15106","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["69341995 bytes"],"formats":[],"version":"8","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Cardiac computed tomography has a clear clinical role in the\n evaluation of coronary artery disease and assessment of coronary\n artery calcium (CAC) but the use of ionizing radiation limits the\n clinical use.\n Beam-shaping “bow-tie” filters determine the\n radiation dose and the effective scan field-of-view diameter\n (SFOV) by delivering higher X-ray fluence to a region\n centered at the isocenter. A method for positioning the heart\n near the isocenter could enable reduced SFOV imaging and reduce\n dose in cardiac scans. We developed a predictive approach to center the\n heart and reduce the SFOV. As part of this effort, we used a UNet to\n segment noncontrast thoracic CT scans to estimate the associated dose\n reductions. Here we publish the UNet network. Specifically, this dataset\n contains a trained U-net (convolutional neural network) which was trained\n for the purpose of segmenting noncontrast thoracic computed tomography\n images. ","descriptionType":"Abstract"},{"description":"We collected noncontrast thoracic CT images from our institution\n and manually segmented them. We then trained a U-Net (with the Pytorch\n framework) to perform semantic segmentation. The final state of the\n trained network is contained in this dataset.","descriptionType":"Methods"},{"description":"This repository contains a .pth file which is the complete set of\n trained weights for the neural network. A repository of Python code\n contained at https://github.com/ucsd-fcrl/unet_deploy may be a useful\n starting point for using this U-Net.","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Institutes of Health","awardNumber":"75N92020D00001 ,","funderIdentifier":"https://ror.org/01cwqze88","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"75N92020D00002","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"75N92020D00003","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"75N92020D00004","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"75N92020D00005","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"75N92020D00006","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"75N92020D00007","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"HHSN268201500003I","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"K01HL143113","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"R01HL116395","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95159","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95160","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95161","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95162","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95163","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95164","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95165","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95166","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95167","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC- 95168","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Cancer Institute","awardNumber":"N01-HC-95169","funderIdentifier":"https://ror.org/040gcmg81","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Center for Advancing Translational Sciences","awardNumber":"UL1-TR-000040","funderIdentifier":"https://ror.org/04pw6fb54","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Center for Advancing Translational Sciences","awardNumber":"UL1-TR-001079","funderIdentifier":"https://ror.org/04pw6fb54","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Center for Advancing Translational Sciences","awardNumber":"UL1-TR-001420","funderIdentifier":"https://ror.org/04pw6fb54","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"Nvidia (United States)","awardNumber":"N/A","funderIdentifier":"https://ror.org/03jdj4y14","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D10W2N","contentUrl":null,"metadataVersion":8,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":229,"downloadCount":24,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-05-09T17:57:46Z","registered":"2023-05-09T17:57:47Z","published":null,"updated":"2026-03-24T19:08:57Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1f300","type":"dois","attributes":{"doi":"10.6076/d1f300","identifiers":[],"creators":[{"name":"Edmonds, Emily","nameType":"Personal","givenName":"Emily","familyName":"Edmonds","affiliation":["VA San Diego Healthcare System"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-5130-0500","nameIdentifierScheme":"ORCID"}]},{"name":"Smirnov, Denis","nameType":"Personal","givenName":"Denis","familyName":"Smirnov","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Thomas, Kelsey","nameType":"Personal","givenName":"Kelsey","familyName":"Thomas","affiliation":["VA San Diego Healthcare System"],"nameIdentifiers":[]},{"name":"Graves, Lisa","nameType":"Personal","givenName":"Lisa","familyName":"Graves","affiliation":["VA San Diego Healthcare System"],"nameIdentifiers":[]},{"name":"Bangen, Katherine","nameType":"Personal","givenName":"Katherine","familyName":"Bangen","affiliation":["VA San Diego Healthcare System"],"nameIdentifiers":[]},{"name":"Delano-Wood, Lisa","nameType":"Personal","givenName":"Lisa","familyName":"Delano-Wood","affiliation":["VA San Diego Healthcare System"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5529-8703","nameIdentifierScheme":"ORCID"}]},{"name":"Galasko, Douglas","nameType":"Personal","givenName":"Douglas","familyName":"Galasko","affiliation":["VA San Diego Healthcare System"],"nameIdentifiers":[]},{"name":"Salmon, David","nameType":"Personal","givenName":"David","familyName":"Salmon","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Bondi, Mark","nameType":"Personal","givenName":"Mark","familyName":"Bondi","affiliation":["VA San Diego Healthcare System"],"nameIdentifiers":[]}],"titles":[{"title":"Data-driven versus consensus diagnosis of MCI: enhanced sensitivity for detection of dementia progression, biomarker status, and neuropathological outcomes"}],"publisher":"Dryad","container":{},"publicationYear":2021,"subjects":[{"subject":"Alzheimer's disease","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"Assessment of cognitive disorders/dementia"},{"subject":"MCI (mild cognitive impairment)"},{"subject":"All Neuropsychology/Behavior"}],"contributors":[],"dates":[{"date":"2021-05-20T18:29:02Z","dateType":"Submitted"},{"date":"2021-06-03T00:00:00Z","dateType":"Issued"},{"date":"2021-06-03T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1212/wnl.0000000000012600","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["403777 bytes"],"formats":[],"version":"2","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Objective: Given prior work demonstrating that mild cognitive impairment\n (MCI) can be empirically differentiated into meaningful cognitive\n subtypes, we applied actuarial methods to comprehensive neuropsychological\n data from the University of California San Diego (UCSD) Alzheimer’s\n Disease Research Center (ADRC) in order to identify cognitive subgroups\n within nondemented ADRC participants, and to examine cognitive, biomarker,\n and neuropathological trajectories. Methods: Cluster analysis was\n performed on baseline neuropsychological data (n=738; mean age=71.8).\n Survival analysis examined progression to dementia (mean follow-up=5.9\n years). CSF AD biomarker status and neuropathological findings at\n follow-up were examined in a subset with available data. Results: Five\n clusters were identified: \"optimal\" cognitively normal\n (CN; n=130) with above-average cognition, \"typical\" CN\n (n=204) with average cognition, non-amnestic MCI (naMCI; n=104), amnestic\n MCI (aMCI; n=216), and mixed MCI (mMCI; n=84). Progression to dementia\n differed across MCI subtypes (mMCI\u0026gt;aMCI\u0026gt;naMCI), with the\n mMCI group demonstrating the highest rate of CSF biomarker positivity and\n AD pathology at autopsy. Actuarial methods classified 29.5% more of the\n sample with MCI and outperformed consensus diagnoses in capturing those\n who had abnormal biomarkers, progressed to dementia, or had AD pathology\n at autopsy. Conclusions: We identified subtypes of MCI and CN with\n differing cognitive profiles, clinical outcomes, CSF AD biomarkers, and\n neuropathological findings over more than 10 years of follow-up. Results\n demonstrate that actuarial methods produce reliable cognitive phenotypes,\n with data from a subset suggesting unique biological and neuropathological\n signatures. Findings indicate that data-driven algorithms enhance\n diagnostic sensitivity relative to consensus diagnosis for identifying\n older adults at risk for cognitive decline.","descriptionType":"Abstract"}],"geoLocations":[],"fundingReferences":[{"funderName":"\n        U.S. Department of Veterans Affairs Clinical Sciences Research and\n        Development Service*\n      ","awardNumber":"1IK2CX001415"},{"funderName":"\n        U.S. Department of Veterans Affairs Clinical Sciences Research and\n        Development Service*\n      ","awardNumber":"1IK2CX001865"},{"funderName":"\n        U.S. Department of Veterans Affairs Clinical Sciences Research and\n        Development Service*\n      ","awardNumber":"1I01CX001842"},{"schemeUri":"https://ror.org","funderName":"National Institute on Aging","awardNumber":"P30 AG062429","funderIdentifier":"https://ror.org/049v75w11","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Institute on Aging","awardNumber":"R01 AG049810","funderIdentifier":"https://ror.org/049v75w11","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Institute on Aging","awardNumber":"R01 AG063782","funderIdentifier":"https://ror.org/049v75w11","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Institute on Aging","awardNumber":"R03 AG070435","funderIdentifier":"https://ror.org/049v75w11","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"Alzheimer's Association","awardNumber":"AARF-17-528918","funderIdentifier":"https://ror.org/0375f4d26","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1F300","contentUrl":null,"metadataVersion":13,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":265,"downloadCount":37,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2021-06-03T22:26:30Z","registered":"2021-06-03T22:26:32Z","published":null,"updated":"2026-03-17T15:21:44Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1pw2v","type":"dois","attributes":{"doi":"10.6076/d1pw2v","identifiers":[],"creators":[{"name":"Franks, Peter","nameType":"Personal","givenName":"Peter","familyName":"Franks","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0003-1862-0171","nameIdentifierScheme":"ORCID"}]},{"name":"Inman, Bryce","nameType":"Personal","givenName":"Bryce","familyName":"Inman","affiliation":["University of California San Diego"],"nameIdentifiers":[]}],"titles":[{"title":"Turbulent velocity profiles for dissipation rate of 1e-6 W/kg"}],"publisher":"Dryad","container":{},"publicationYear":2024,"subjects":[{"subject":"FOS: Earth and related environmental sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Earth and related environmental sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"Turbulence","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"Plankton","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"Dissipation rate"},{"subject":"Velocity Probability"}],"contributors":[],"dates":[{"date":"2023-07-18T23:07:05Z","dateType":"Created"},{"date":"2024-02-08T20:41:11Z","dateType":"Submitted"},{"date":"2024-02-23T00:00:00Z","dateType":"Issued"},{"date":"2024-02-23T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1002/lno.12501","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["145258572 bytes"],"formats":[],"version":"1","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Fundamental marine ecosystem dynamics such as mating, predation, and\n infection require individual plankton to move relative to one another.\n Ambient turbulence is often invoked as a mechanism to facilitate such\n interactions. The local intensity of turbulence is quantified as the\n dissipation rate of turbulent kinetic energy. While the dissipation rate\n is central to understanding large-scale fluxes of heat, salt, and\n nutrients in the ocean, we show that it can be a poor descriptor of the\n turbulent environment experienced by individual plankton. A dissipation\n rate is a single integrated quantity representing all the complex motions\n in a turbulent region; the instantaneous turbulent environment of plankton\n may bear little resemblance to that predicted by the dissipation rate or\n quantities derived from it. Most importantly, the statistics\n (probabilities) of the relative motions of plankton in turbulence cannot\n be recovered from the dissipation rate or its spectrum: the probabilities\n of the plankton experiencing any given turbulent shear are lost in the\n calculation. This presents a fundamental barrier to our understanding of\n the effects of ambient turbulence on planktonic ecosystem dynamics in the\n ocean. Rather than relying on dissipation rates, we show that quantifying\n the probability distributions of the microscale turbulent motions can\n provide much richer insights into the turbulent environment of individual\n plankton. Expanding such statistical analyses, and improving our\n understanding of the Lagrangian properties of ocean turbulence as\n experienced by plankton in the ocean will lead to significant increases in\n our ability to understand and quantify the effects of turbulence on\n plankton.","descriptionType":"Abstract"},{"description":"Data were acquired with the Modular\n Microstructure Profiler (MMP, MacKinnon and Gregg, 2003; Mickett et al.,\n 2004) deployed during the FLEAT (FLow Encountering Abrupt Topography)\n experiment around Palau (Johnston et al., 2019; MacKinnon et al., 2019).\n The 440 vertical velocity profiles gave data from ~20 m to as deep as 500\n m, with ~1.7 mm spatial resolution; turbulent velocities were low-pass\n filtered to remove noise created by the instrument\n vibrations.","descriptionType":"Methods"},{"description":"Data are in ASCII format, readable by most software.","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[],"url":"https://datadryad.org/dataset/doi:10.6076/D1PW2V","contentUrl":null,"metadataVersion":6,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":23,"downloadCount":1,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-10-04T22:46:26Z","registered":"2023-10-04T22:46:27Z","published":null,"updated":"2026-03-17T12:55:10Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1h01z","type":"dois","attributes":{"doi":"10.6076/d1h01z","identifiers":[],"creators":[{"name":"Ponganis, Paul","nameType":"Personal","givenName":"Paul","familyName":"Ponganis","affiliation":["Scripps Institution of Oceanography"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-1556-770X","nameIdentifierScheme":"ORCID"}]},{"name":"Williams, Cassondra","nameType":"Personal","givenName":"Cassondra","familyName":"Williams","affiliation":["National Marine Mammal Foundation"],"nameIdentifiers":[]},{"name":"Czapanskiy, Max","nameType":"Personal","givenName":"Max","familyName":"Czapanskiy","affiliation":["Hopkins Marine Station"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-6302-905X","nameIdentifierScheme":"ORCID"}]},{"name":"John, Jason","nameType":"Personal","givenName":"Jason","familyName":"John","affiliation":["University of California Santa Cruz"],"nameIdentifiers":[]},{"name":"St. Leger, Judy","nameType":"Personal","givenName":"Judy","familyName":"St. Leger","affiliation":["Scripps Institution of Oceanography"],"nameIdentifiers":[]},{"name":"Scadeng, Miriam","nameType":"Personal","givenName":"Miriam","familyName":"Scadeng","affiliation":["University of Auckland"],"nameIdentifiers":[]}],"titles":[{"title":"Emperor penguin air sac oxygen"}],"publisher":"Dryad","container":{},"publicationYear":2020,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"}],"contributors":[],"dates":[{"date":"2020-12-23T17:40:03Z","dateType":"Submitted"},{"date":"2020-12-26T00:00:00Z","dateType":"Issued"},{"date":"2020-12-26T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1242/jeb.230219","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["33382398 bytes"],"formats":[],"version":"3","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Some marine birds and mammals can perform dives of extraordinary duration\n and depth. Such dive performance is dependent on many factors, including\n total body oxygen (O2) stores. For diving penguins, the respiratory system\n (air sacs and lungs) constitutes 30-50% of the total body O2 store. To\n better understand the role and mechanism of parabronchial ventilation and\n O2 utilization in penguins both on the surface and during the dive, we\n examined air sac partial pressures of O2 (PO2) in emperor penguins\n (Aptenodytes forsteri) equipped with backpack PO2 recorders. Cervical air\n sac PO2s at rest were lower than in other birds, while the cervical air\n sac to posterior thoracic air sac PO2 difference was larger. Pre-dive\n cervical air sac PO2s were often greater than those at rest, but had a\n wide range and were not significantly different from those at rest. The\n maximum respiratory O2 store and total body O2 stores calculated with\n representative anterior and posterior air sac PO2 data did not differ from\n prior estimates. The mean calculated anterior air sac O2 depletion rate\n for dives up to 11 min was approximately one-tenth that of the posterior\n air sacs.  Low cervical air sac PO2s at rest may be secondary to\n a low ratio of parabronchial ventilation to parabronchial blood O2\n extraction. During dives, overlap of simultaneously recorded cervical and\n posterior thoracic air sac PO2 profiles supported the concept of\n maintenance of parabronchial ventilation during a dive by air movement\n through the lungs.","descriptionType":"Abstract"},{"description":"Data set was collected at a research camp in Antarctica at which\n emperor penguins voluntarily dove beneath the sea ice to forage.  While\n under anesthesia, birds were equipped withmicrocpocessor-based backpack\n recorders to collect data on depth, wing stroke rate (via accelerometry),\n and air sac oxygen levels (with an indwellling oxygen electrode).  Birds\n were allowed to dive over a one to two-day period after overnight recovery\n from anesthesia.  Devices were removed under anesthesia.  The birds were\n released back to to sea at the ice edge. Data were\n downloaded into computers and analyzed with excel software to assemble the\n csv files in this repository.  These data files were then analyzed with\n custom programs for the anlayses and graphs detailed in the manuscript. \n These files are the basis for all the analyses.","descriptionType":"Methods"},{"description":"These files do not have any missing values.  The data sets are\n ready for export or copy into any analysis program.","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"1643532","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1H01Z","contentUrl":null,"metadataVersion":13,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":280,"downloadCount":35,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2020-12-26T22:30:15Z","registered":"2020-12-26T22:30:17Z","published":null,"updated":"2026-03-16T23:18:22Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1r30s","type":"dois","attributes":{"doi":"10.6076/d1r30s","identifiers":[],"creators":[{"name":"Meyer, Justin","nameType":"Personal","givenName":"Justin","familyName":"Meyer","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5566-8452","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"Data: Canonical host-pathogen tradeoffs subverted by mutations with dual benefits"}],"publisher":"Dryad","container":{},"publicationYear":2022,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"Escherichia coli","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"Bacteriophage Lambda"},{"subject":"microbial experimental evolution"},{"subject":"tradeoffs"},{"subject":"Protein structure","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"coevolutionary arms race"}],"contributors":[],"dates":[{"date":"2022-09-22T22:14:04Z","dateType":"Submitted"},{"date":"2022-11-02T00:00:00Z","dateType":"Issued"},{"date":"2022-11-02T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/818492","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1086/723413","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["73969014 bytes"],"formats":[],"version":"5","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Host-parasite coevolution is expected to drive the evolution of genetic\n diversity because the traits used in arms races, namely host range and\n parasite resistance, are hypothesized to trade off with traits used in\n resource competition. We therefore tested data for several tradeoffs among\n 93 isolates of bacteriophage and 51 Escherichia coli genotypes that\n coevolved during a laboratory experiment. Surprisingly, we found multiple\n tradeups (positive trait correlations) but little evidence of several\n canonical tradeoffs. For example, some bacterial genotypes evaded a\n tradeoff between phage resistance and absolute fitness, instead evolving\n simultaneous improvements in both these traits. This was surprising\n because our experimental design was predicted to expose resistance-fitness\n tradeoffs by culturing E. coli in a medium where the phage\n receptor, LamB, is also used for nutrient acquisition. On reflection, LamB\n mediates not one but many tradeoffs, allowing for more complex trait\n interactions than just pairwise tradeoffs. Here, we report that\n mathematical reasoning and laboratory data highlight how tradeups should\n exist whenever an evolutionary system exhibits multiple interacting\n tradeoffs. Does this mean that coevolution should not promote genetic\n diversity? No, quite the contrary: we deduce that whenever positive trait\n correlations are observed in multi-dimensional traits, other traits may\n tradeoff and so provide the right circumstances for diversity maintenance.\n Overall, this study reveals there are predictive limits when data only\n account for pairwise trait correlations and it argues that a wider range\n of circumstances than previously anticipated can promote genetic and\n species diversity.","descriptionType":"Abstract"},{"description":"Many different sources of data including genome sequences, images\n of infection assays, OD-based growth curves, protein structural models,\n and bacterial competition experiments. ","descriptionType":"Methods"},{"description":"See the README file.","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","awardTitle":"\n        Experimental tests of the role genetic architecture, resource\n        competition, and gene flow play during speciation\n      ","funderName":"Division of Environmental Biology","awardNumber":"1934515","funderIdentifier":"https://ror.org/03g87he71","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1R30S","contentUrl":null,"metadataVersion":9,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":114,"downloadCount":5,"referenceCount":0,"citationCount":2,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2022-11-02T17:58:37Z","registered":"2022-11-02T17:58:37Z","published":null,"updated":"2026-03-05T23:08:56Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d12s3m","type":"dois","attributes":{"doi":"10.6076/d12s3m","identifiers":[],"creators":[{"name":"Wertheim, Joel","nameType":"Personal","givenName":"Joel","familyName":"Wertheim","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0003-4882-5856","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"Accuracy in near-perfect virus phylogenies"}],"publisher":"Dryad","container":{},"publicationYear":2021,"subjects":[],"contributors":[],"dates":[{"date":"2021-08-05T22:10:03Z","dateType":"Submitted"},{"date":"2021-08-05T00:00:00Z","dateType":"Issued"},{"date":"2021-08-05T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsDerivedFrom","relatedIdentifier":"10.5281/zenodo.5165269","relatedIdentifierType":"DOI"},{"relationType":"IsSourceOf","relatedIdentifier":"10.5281/zenodo.5165271","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/2021.05.06.442951","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/sysbio/syab069","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["2002952 bytes"],"formats":[],"version":"3","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Phylogenetic trees from real-world data often include short edges with\n very few substitutions per site, which can lead to partially resolved\n trees and poor accuracy. Theory indicates that the number of sites needed\n to accurately reconstruct a fully resolved tree grows at a rate\n proportional to the inverse square of the length of the shortest edge.\n However, when inferred trees are partially resolved due to short edges,\n \"accuracy\" should be defined as the rate of discovering\n false splits (clades on a rooted tree) relative to the actual number\n found. Thus, accuracy can be high even if short edges are common.\n Specifically, in a \"near-perfect\" parameter space in which trees\n are large, the tree length ξ (the sum of all edge lengths), is small, and\n rate variation is minimal, the expected false positive rate is less than\n ξ/3; the exact value depends on tree shape and sequence length. This\n expected false positive rate is far below the false negative rate for\n small $\\xi$ and often well below 5% even when some assumptions are\n relaxed. We show this result analytically for maximum parsimony and\n explore its extension to maximum likelihood using theory and simulations.\n For hypothesis testing, we show that measures of split\n \"support\" that rely on bootstrap resampling\n consistently imply weaker support than that implied by the false positive\n rates in near-perfect trees. The near-perfect parameter space closely fits\n several empirical studies of human virus diversification during outbreaks\n and epidemics, including Ebolavirus, Zika virus, and SARS-CoV-2,\n reflecting low substitution rates relative to high transmission/sampling\n rates in these viruses.","descriptionType":"Abstract"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Institute of Allergy and Infectious Diseases","awardNumber":"AI135992","funderIdentifier":"https://ror.org/043z4tv69","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D12S3M","contentUrl":null,"metadataVersion":12,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":252,"downloadCount":53,"referenceCount":1,"citationCount":2,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2021-08-06T03:50:49Z","registered":"2021-08-06T03:50:50Z","published":null,"updated":"2026-03-05T22:01:15Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1js3z","type":"dois","attributes":{"doi":"10.6076/d1js3z","identifiers":[],"creators":[{"name":"Jiang, Yueyu","nameType":"Personal","givenName":"Yueyu","familyName":"Jiang","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-8425-7556","nameIdentifierScheme":"ORCID"}]},{"name":"Balaban, Metin","nameType":"Personal","givenName":"Metin","familyName":"Balaban","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Zhu, Qiyun","nameType":"Personal","givenName":"Qiyun","familyName":"Zhu","affiliation":["Arizona State University"],"nameIdentifiers":[]},{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"DEPP: Deep learning enables extending species trees using single genes"}],"publisher":"Dryad","container":{},"publicationYear":2022,"subjects":[{"subject":"phylogenetic placement"},{"subject":"Neural networks","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"gene tree discordance"},{"subject":"Microbiome analyses"},{"subject":"Metagenomics","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"}],"contributors":[],"dates":[{"date":"2021-05-20T22:01:03Z","dateType":"Submitted"},{"date":"2021-06-04T00:00:00Z","dateType":"Issued"},{"date":"2021-06-04T00:00:00Z","dateType":"Available"},{"date":"2022-06-06T00:00:00Z","dateType":"Updated"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsSupplementedBy","relatedIdentifier":"10.6076/d14g68","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/2021.01.22.427808","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/sysbio/syac031","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["163947493782 bytes"],"formats":[],"version":"8","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Placing new sequences onto reference phylogenies is increasingly used for\n analyzing environmental samples, especially microbiomes. However, existing\n placement methods have a fundamental limitation: they assume that query\n sequences have evolved using specific models directly on the reference\n phylogeny. Thus, they can place single-gene data (e.g., 16S rRNA\n amplicons) onto their own gene tree. This practice is a proxy for a more\n ambitious goal: extending a (genome-wide) species tree given data from\n individual genes. No algorithm currently addresses this challenging\n problem. Here, we introduce Deep-learning Enabled Phylogenetic Placement\n (DEPP), an algorithm that learns to extend species trees using single\n genes without pre-specified models. We show that DEPP updates the\n multi-locus microbial tree-of-life with single genes with high accuracy.\n We further demonstrate that DEPP can achieve the long-standing goal of\n combining 16S and metagenomic data onto a single tree, enabling community\n structure analyses that were previously impossible and producing robust\n patterns.","descriptionType":"Abstract"},{"description":"Visit this link to view the most recent version of this dataset:\n https://doi.org/10.6076/D14G68","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[],"url":"https://datadryad.org/dataset/doi:10.6076/D1JS3Z","contentUrl":null,"metadataVersion":12,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":184,"downloadCount":13,"referenceCount":3,"citationCount":3,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2021-06-04T17:48:49Z","registered":"2021-06-04T17:48:50Z","published":null,"updated":"2026-03-04T22:08:37Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d14g68","type":"dois","attributes":{"doi":"10.6076/d14g68","identifiers":[],"creators":[{"name":"Jiang, Yueyu","nameType":"Personal","givenName":"Yueyu","familyName":"Jiang","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-8425-7556","nameIdentifierScheme":"ORCID"}]},{"name":"Balaban, Metin","nameType":"Personal","givenName":"Metin","familyName":"Balaban","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Zhu, Qiyun","nameType":"Personal","givenName":"Qiyun","familyName":"Zhu","affiliation":["Arizona State University"],"nameIdentifiers":[]},{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"DEPP: Deep learning enables extending species trees using single genes"}],"publisher":"Dryad","container":{},"publicationYear":2022,"subjects":[{"subject":"FOS: Engineering and technology","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Engineering and technology","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"phylogenetic placement"},{"subject":"Deep convolutional neural network"},{"subject":"gene tree discordance"},{"subject":"Microbiome analyses"},{"subject":"Metagenomics","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"}],"contributors":[],"dates":[{"date":"2022-04-22T13:25:14Z","dateType":"Submitted"},{"date":"2022-05-10T00:00:00Z","dateType":"Issued"},{"date":"2022-05-10T00:00:00Z","dateType":"Available"},{"date":"2022-06-06T00:00:00Z","dateType":"Updated"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsSupplementedBy","relatedIdentifier":"10.6076/d1js3z","relatedIdentifierType":"DOI"},{"relationType":"IsSourceOf","relatedIdentifier":"10.5281/zenodo.6582382","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/2021.01.22.427808","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/sysbio/syac031","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["256341469757 bytes"],"formats":[],"version":"11","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Placing new sequences onto reference phylogenies is increasingly used for\n analyzing environmental samples, especially microbiomes. However, existing\n placement methods have a fundamental limitation: they assume that query\n sequences have evolved using specific models directly on the reference\n phylogeny. Thus, they can place single-gene data (e.g., 16S rRNA\n amplicons) onto their own gene tree. This practice is a proxy for a more\n ambitious goal: extending a (genome-wide) species tree given data from\n individual genes. No algorithm currently addresses this challenging\n problem. Here, we introduce Deep-learning Enabled Phylogenetic Placement\n (DEPP), an algorithm that learns to extend species trees using single\n genes without pre-specified models. We show that DEPP updates the\n multi-locus microbial tree-of-life with single genes with high accuracy.\n We further demonstrate that DEPP can achieve the long-standing goal of\n combining 16S and metagenomic data onto a single tree, enabling community\n structure analyses that were previously impossible and producing robust\n patterns.","descriptionType":"Abstract"},{"description":"Please note, this dataset is the most recent version of a\n duplicate dataset available via this link: https://doi.org/10.6076/D1JS3Z (published February 4, 2022).","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[],"url":"https://datadryad.org/dataset/doi:10.6076/D14G68","contentUrl":null,"metadataVersion":13,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":203,"downloadCount":31,"referenceCount":1,"citationCount":3,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2022-05-11T01:51:55Z","registered":"2022-05-11T01:51:57Z","published":null,"updated":"2026-03-04T22:08:36Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1159c","type":"dois","attributes":{"doi":"10.6076/d1159c","identifiers":[],"creators":[{"name":"Contijoch, Francisco","nameType":"Personal","givenName":"Francisco","familyName":"Contijoch","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-9616-3274","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"Closed-loop control of k-space sampling via physiologic feedback for cine MRI"}],"publisher":"Dryad","container":{},"publicationYear":2020,"subjects":[{"subject":"FOS: Medical and health sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Medical and health sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"}],"contributors":[],"dates":[{"date":"2020-12-09T13:01:04Z","dateType":"Submitted"},{"date":"2020-12-15T00:00:00Z","dateType":"Issued"},{"date":"2020-12-15T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/2020.06.22.20137638","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1371/journal.pone.0244286","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["16975896 bytes"],"formats":[],"version":"7","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"This dataset accompanies the manuscript outlining a method for closed-loop\n sampling of k-space in response to physiologic changes. The closed-loop\n approach enables near-uniform radial sampling in a segmented acquisition\n approach which was higher than predetermined golden-angle radial sampling.\n This can be utilized to increase the sampling or decrease the temporal\n footprint of an acquisition and the closed-loop framework has the\n potential to be applied to patients with complex heart rhythms. Briefly,\n Segmented cine cardiac MRI combines data from multiple heartbeats to\n achieve high spatiotemporal resolution cardiac images, yet predefined\n k-space segmentation trajectories can lead to suboptimal k-space sampling.\n In this work, we developed and evaluated an autonomous and closed-loop\n control system for radial k-space sampling to increase sampling\n uniformity. The dataset includes both the algorithm and the data used in\n our manuscript. Our closed-loop system autonomously selects radial k-space\n sampling trajectory during live segmented cine MRI and attempts to\n optimize angular sampling uniformity by selecting views in regions of\n k-space that were not previously well-sampled. Sampling uniformity and\n robustness to arrhythmias was assessed using ECG data acquired from 10\n normal subjects in an MRI scanner. The approach was then implemented with\n a fast gradient echo sequence on a whole-body clinical MRI scanner and\n imaging was performed in 4 healthy volunteers. The closed-loop k-space\n trajectory was compared to random, uniformly distributed and golden angle\n view trajectories via measurement of k-space uniformity and the point\n spread function. Lastly, an arrhythmic dataset was used to evaluate a\n potential application of the approach. The autonomous trajectory increased\n k-space sampling uniformity by 15±7%, main lobe point spread function\n (PSF) signal intensity by 6±4%, and reduced ringing relative to golden\n angle sampling. When implemented, the autonomous pulse sequence prescribed\n radial view angles faster than the scan TR (0.98 ± 0.01 ms, maximum = 1.38\n ms) and increased k-space sampling mean uniformity by 10±11%, decreased\n uniformity variability by 44±12%, and increased PSF signal ratio by 6±6%\n relative to golden angle sampling. This data is shared with Creative\n Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication","descriptionType":"Abstract"},{"description":"ECGs in 10 normal subjects were recorded while in an MRI scanner\n for evaluation of the approach. Imaging data using the\n method was acquired in 4 normal subjects. ","descriptionType":"Methods"},{"description":"The code utilizes Matlab to do the analysis of ECG data. Image\n reconstruction utilizes other publically available tools\n   Links to publications that cite or use the\n data:\u003cbr clear=\"none\"\u003e \u003cbr clear=\"none\"\u003e\n The manuscript for this data is:\u003cbr clear=\"none\"\u003e\n Contijoch et al. Closed-loop control of k-space sampling via physiologic\n feedback for cine MRI. https://www.medrxiv.org/content/10.1101/2020.06.22.20137638v1\u003cbr clear=\"none\"\u003e \u003cbr clear=\"none\"\u003e Please cite the dataset as:\u003cbr clear=\"none\"\u003e Contijoch, Francisco (2020), Closed-loop control of k-space sampling via physiologic feedback for cine MRI, Dryad, Dataset, https://doi.org/10.6076/D1159C\u003cbr clear=\"none\"\u003e \u003cbr clear=\"none\"\u003e DATA \u0026amp; FILE OVERVIEW\u003cbr clear=\"none\"\u003e \u003cbr clear=\"none\"\u003e File List: The set consists of 9 zip files.\u003cbr clear=\"none\"\u003e code.zip - Matlab files for closed-loop ECG sampling based control of MRI k-space sampling\u003cbr clear=\"none\"\u003e Simulation_Data.zip  - ECG recordings of 10 humans for simulation of MRI sampling with ARKS\u003cbr clear=\"none\"\u003e Simulation_Analysis.zip - Code which performs simulation experiments\u003cbr clear=\"none\"\u003e Simulation_Results.zip - Results of simulation used for publication\u003cbr clear=\"none\"\u003e \u003cbr clear=\"none\"\u003e Human_Data.zip  - Data from acquisitions with ARKS on clinical MRI scanner\u003cbr clear=\"none\"\u003e Human_Analysis.zip - Code which analyzes in-vivo experiments\u003cbr clear=\"none\"\u003e Human_Results.zip - Results of human imaging used for publication","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"HL120580","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"HL108157","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Heart Lung and Blood Institute","awardNumber":"HL963954","funderIdentifier":"https://ror.org/012pb6c26","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1159C","contentUrl":null,"metadataVersion":12,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":244,"downloadCount":44,"referenceCount":0,"citationCount":2,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2020-12-15T19:21:22Z","registered":"2020-12-15T19:21:24Z","published":null,"updated":"2026-03-04T22:08:24Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d11p4n","type":"dois","attributes":{"doi":"10.6076/d11p4n","identifiers":[],"creators":[{"name":"Fishbein, Adam","nameType":"Personal","givenName":"Adam","familyName":"Fishbein","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5918-9256","nameIdentifierScheme":"ORCID"}]},{"name":"Jovanovic, Vladimir","nameType":"Personal","givenName":"Vladimir","familyName":"Jovanovic","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"de la Mothe, Lisa","nameType":"Personal","givenName":"Lisa","familyName":"de la Mothe","affiliation":["Tennessee State University"],"nameIdentifiers":[]},{"name":"Lee, Kuo-Fen","nameType":"Personal","givenName":"Kuo-Fen","familyName":"Lee","affiliation":["Salk Institute for Biological Studies"],"nameIdentifiers":[]},{"name":"Miller, Cory","nameType":"Personal","givenName":"Cory","familyName":"Miller","affiliation":["University of California San Diego"],"nameIdentifiers":[]}],"titles":[{"title":"Behavioral context affects social signal representations within single primate prefrontal cortex neurons"}],"publisher":"Dryad","container":{},"publicationYear":2022,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"}],"contributors":[{"name":"University of California, San Diego","nameType":"Personal","givenName":"San Diego","familyName":"University of California","affiliation":[],"contributorType":"Sponsor","nameIdentifiers":[]}],"dates":[{"date":"2022-01-26T18:24:04Z","dateType":"Submitted"},{"date":"2022-03-08T00:00:00Z","dateType":"Issued"},{"date":"2022-03-08T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/2021.11.01.466818","relatedIdentifierType":"DOI"},{"relationType":"IsDerivedFrom","relatedIdentifier":"10.5281/zenodo.5809642","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1016/j.neuron.2022.01.020","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["77012344 bytes"],"formats":[],"version":"3","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"We tested whether social signal processing in more traditional,\n head-restrained contexts is representative of the putative natural analog\n – social communication – by comparing responses to vocalizations within\n individual neurons in marmoset prefrontal cortex (PFC) across a series of\n behavioral contexts ranging from traditional to naturalistic. Although\n vocalization responsive neurons were evident in all contexts,\n cross-context consistency was notably limited. A response to these social\n signals when subjects were head-restrained was not predictive of a\n comparable neural response to the identical vocalizations during natural\n communication. This pattern was evident both within individual neurons and\n at a population level, as PFC activity could be reliably decoded for the\n behavioral context in which vocalizations were heard. These results\n suggests that neural representations of social signals in primate PFC are\n not static, but highly flexible and likely reflect how nuances of the\n dynamic behavioral contexts affect the perception of these signals and\n what they communicate.","descriptionType":"Abstract"}],"geoLocations":[],"fundingReferences":[],"url":"https://datadryad.org/dataset/doi:10.6076/D11P4N","contentUrl":null,"metadataVersion":10,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":173,"downloadCount":11,"referenceCount":0,"citationCount":2,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2022-03-08T09:04:39Z","registered":"2022-03-08T09:04:40Z","published":null,"updated":"2026-03-04T21:43:56Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/j7wd3xhs","type":"dois","attributes":{"doi":"10.6076/j7wd3xhs","identifiers":[],"creators":[{"name":"Finley, Jr., Russell L.; Cohen, Barak; Brent, Roger","affiliation":[],"nameIdentifiers":[]}],"titles":[{"title":"Drosophila Cdi4 is a p21/p27/p57-like cyclin-dependent kinase inhibitor with specificity for cyclin E complexes."}],"publisher":"Fred Hutchinson Cancer Research Center","container":{},"publicationYear":2003,"subjects":[],"contributors":[],"dates":[{"date":"2003","dateType":"Issued"}],"language":null,"types":{"ris":"GEN","bibtex":"article","citeproc":"","schemaOrg":"Article","resourceTypeGeneral":"DataPaper"},"relatedIdentifiers":[],"relatedItems":[],"sizes":[],"formats":[],"version":null,"rightsList":[],"descriptions":[],"geoLocations":[],"fundingReferences":[],"url":"https://ezid.cdlib.org/tombstone/id/doi:10.6076/J7WD3XHS","contentUrl":null,"metadataVersion":4,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":0,"downloadCount":0,"referenceCount":0,"citationCount":0,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2016-09-14T15:56:37Z","registered":"2016-09-14T15:56:39Z","published":null,"updated":"2026-02-24T22:14:40Z"},"relationships":{"client":{"data":{"id":"cdl.cdl","type":"clients"}}}},{"id":"10.6076/j77d2s8q","type":"dois","attributes":{"doi":"10.6076/j77d2s8q","identifiers":[],"creators":[{"name":"Pesce CG, Zdraljevic S","nameType":"Personal","givenName":"Zdraljevic S","familyName":"Pesce CG","affiliation":[],"nameIdentifiers":[]}],"titles":[{"title":"Supplementary Materials for Single-cell profiling screen identifies microtubule-dependent reduction of variation in cell signaling"}],"publisher":"Molecular and Systems Biology","container":{},"publicationYear":2017,"subjects":[],"contributors":[],"dates":[{"date":"2017","dateType":"Issued"}],"language":null,"types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[],"relatedItems":[],"sizes":[],"formats":[],"version":null,"rightsList":[],"descriptions":[],"geoLocations":[],"fundingReferences":[],"url":"https://ezid.cdlib.org/tombstone/id/doi:10.6076/J77D2S8Q","contentUrl":null,"metadataVersion":1,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":0,"downloadCount":0,"referenceCount":0,"citationCount":0,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2017-08-14T23:07:06Z","registered":"2017-08-14T23:07:07Z","published":null,"updated":"2026-02-24T22:14:13Z"},"relationships":{"client":{"data":{"id":"cdl.cdl","type":"clients"}}}},{"id":"10.6076/j7rn35sf","type":"dois","attributes":{"doi":"10.6076/j7rn35sf","identifiers":[],"creators":[{"name":"Ptashne, Mark; Gill, Grace; Brent, Roger","affiliation":[],"nameIdentifiers":[]}],"titles":[{"title":"Modularity of eukaryotic transcription activators"}],"publisher":"www.ergito.com","container":{},"publicationYear":2003,"subjects":[],"contributors":[],"dates":[{"date":"2003","dateType":"Issued"}],"language":null,"types":{"ris":"GEN","bibtex":"article","citeproc":"","schemaOrg":"Article","resourceTypeGeneral":"DataPaper"},"relatedIdentifiers":[],"relatedItems":[],"sizes":[],"formats":[],"version":null,"rightsList":[],"descriptions":[],"geoLocations":[],"fundingReferences":[],"url":"https://ezid.cdlib.org/tombstone/id/doi:10.6076/J7RN35SF","contentUrl":null,"metadataVersion":3,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":0,"downloadCount":0,"referenceCount":0,"citationCount":0,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2016-09-17T01:26:05Z","registered":"2016-09-17T01:26:07Z","published":null,"updated":"2026-02-24T22:13:20Z"},"relationships":{"client":{"data":{"id":"cdl.cdl","type":"clients"}}}},{"id":"10.6076/d19p44","type":"dois","attributes":{"doi":"10.6076/d19p44","identifiers":[],"creators":[{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]},{"name":"Rachtman, Eleonora","nameType":"Personal","givenName":"Eleonora","familyName":"Rachtman","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-6104-5750","nameIdentifierScheme":"ORCID"}]},{"name":"Bafna, Vineet","nameType":"Personal","givenName":"Vineet","familyName":"Bafna","affiliation":["University of California San Diego"],"nameIdentifiers":[]}],"titles":[{"title":"CONSULT: accurate contamination removal using locality-sensitive hashing"}],"publisher":"Dryad","container":{},"publicationYear":2024,"subjects":[{"subject":"Applied mathematics","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"Computer Science Applications"},{"subject":"Genetics","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"Molecular biology","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"structural biology"},{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"}],"contributors":[],"dates":[{"date":"2024-02-27T23:44:43Z","dateType":"Created"},{"date":"2024-02-28T00:29:58Z","dateType":"Submitted"},{"date":"2024-03-29T00:00:00Z","dateType":"Issued"},{"date":"2024-03-29T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/nargab/lqab071","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["231116935649 bytes"],"formats":[],"version":"4","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"A fundamental question appears in many bioinformatics applications: Does a\n sequencing read belong to a large dataset of genomes from some broad\n taxonomic group, even when the closest match in the set is evolutionarily\n divergent from the query? For example, low-coverage genome sequencing\n (skimming) projects either assemble the organelle genome or compute\n genomic distances directly from unassembled reads. Using unassembled reads\n needs contamination detection because samples often include reads from\n unintended groups of species. Similarly, assembling the organelle genome\n needs distinguishing organelle and nuclear reads. While k-mer-based\n methods have shown promise in read-matching, prior studies have shown that\n existing methods are insufficiently sensitive for contamination detection.\n Here, we introduce a new read-matching tool called CONSULT that tests\n whether k-mers from a query fall within a user-specified distance of the\n reference dataset using locality-sensitive hashing. Taking advantage of\n large memory machines available nowadays, CONSULT libraries accommodate\n tens of thousands of microbial species. Our results show that CONSULT has\n higher true-positive and lower false-positive rates of contamination\n detection than leading methods such as Kraken-II and improves distance\n calculation from genome skims. We also demonstrate that CONSULT can\n distinguish organelle reads from nuclear reads, leading to dramatic\n improvements in skim-based mitochondrial assemblies.","descriptionType":"Abstract"},{"description":"# Access to the data used for CONSULT benchmarking Date belonging to the\n following paper: * Rachtman, E., Bafna, V., \u0026amp; Mirarab, S. (2021).\n CONSULT: accurate contamination removal using locality-sensitive hashing.\n NAR Genomics and Bioinformatics.\n [doi:10.1093/nargab/lqab071](https://doi.org/10.1093/nargab/lqab071) ##\n Description of the data and file structure ## Drosophila data Genome and\n genome skims used for real Drosophila data analysis are provided. ###\n Before clean-up #### `Dros_fastq_af_bbmerge.tar` This file contains\n deduplicated reads for Drosophila species before clean-up It contains the\n following Drosophila species in fq format: *\n `sub_Drosophila_ananassae_2.fq.gz`: Drosophila ananassae *\n `sub_Drosophila_biarmipes_2.fq.gz`: Drosophila biarmipes *\n `sub_Drosophila_bipectinata_2.fq.gz`: Drosophila bipectinata *\n `sub_Drosophila_erecta_2.fq.gz`: Drosophila erecta *\n `sub_Drosophila_eugracilis_2.fq.gz`: Drosophila eugracilis *\n `sub_Drosophila_mauritiana_2.fq.gz`: Drosophila mauritiana *\n `sub_Drosophila_mojavensis_2.fq.gz`: Drosophila mojavensis *\n `sub_Drosophila_persimilis_2.fq.gz`: Drosophila persimilis *\n `sub_Drosophila_pseudoobscura_2.fq.gz`: Drosophila pseudoobscura *\n `sub_Drosophila_sechellia_2.fq.gz`: Drosophila sechellia *\n `sub_Drosophila_simulans_2.fq.gz`: Drosophila simulans *\n `sub_Drosophila_virilis_2.fq.gz`: Drosophila virilis *\n `sub_Drosophila_willistoni_2.fq.gz`: Drosophila willistoni *\n `sub_Drosophila_yakuba_2.fq.gz`: Drosophila yakuba ####\n `Dros_fastq_af_human_removed.tar` This file contains reads for Drosophila\n species before clean-up but after the removal of human reads. It contains\n the following Drosophila species in fq format: *\n `ucseq_sub_Drosophila_ananassae_2.fq.gz`: Drosophila ananassae *\n `ucseq_sub_Drosophila_biarmipes_2.fq.gz`: Drosophila biarmipes *\n `ucseq_sub_Drosophila_bipectinata_2.fq.gz`: Drosophila bipectinata *\n `ucseq_sub_Drosophila_erecta_2.fq.gz`: Drosophila erecta *\n `ucseq_sub_Drosophila_eugracilis_2.fq.gz`: Drosophila eugracilis *\n `ucseq_sub_Drosophila_mauritiana_2.fq.gz`: Drosophila mauritiana *\n `ucseq_sub_Drosophila_mojavensis_2.fq.gz`: Drosophila mojavensis *\n `ucseq_sub_Drosophila_persimilis_2.fq.gz`: Drosophila persimilis *\n `ucseq_sub_Drosophila_pseudoobscura_2.fq.gz`: Drosophila pseudoobscura *\n `ucseq_sub_Drosophila_sechellia_2.fq.gz`: Drosophila sechellia *\n `ucseq_sub_Drosophila_simulans_2.fq.gz`: Drosophila simulans *\n `ucseq_sub_Drosophila_virilis_2.fq.gz`: Drosophila virilis *\n `ucseq_sub_Drosophila_willistoni_2.fq.gz`: Drosophila willistoni *\n `ucseq_sub_Drosophila_yakuba_2.fq.gz`: Drosophila yakuba ### After\n filtering #### `Dros_fastq_af_consult_filt.tar` This file contains\n Drosophila fastq after filtering with CONSULT. *\n `ucseq_ucseq_sub_Drosophila_ananassae_2.fq`: Drosophila ananassae *\n `ucseq_ucseq_sub_Drosophila_biarmipes_2.fq`: Drosophila biarmipes *\n `ucseq_ucseq_sub_Drosophila_bipectinata_2.fq`: Drosophila bipectinata *\n `ucseq_ucseq_sub_Drosophila_erecta_2.fq`: Drosophila erecta *\n `ucseq_ucseq_sub_Drosophila_eugracilis_2.fq`: Drosophila eugracilis *\n `ucseq_ucseq_sub_Drosophila_mauritiana_2.fq`: Drosophila mauritiana *\n `ucseq_ucseq_sub_Drosophila_mojavensis_2.fq`: Drosophila mojavensis *\n `ucseq_ucseq_sub_Drosophila_persimilis_2.fq`: Drosophila persimilis *\n `ucseq_ucseq_sub_Drosophila_pseudoobscura_2.fq`: Drosophila pseudoobscura\n * `ucseq_ucseq_sub_Drosophila_sechellia_2.fq`: Drosophila sechellia *\n `ucseq_ucseq_sub_Drosophila_simulans_2.fq`: Drosophila simulans *\n `ucseq_ucseq_sub_Drosophila_virilis_2.fq`: Drosophila virilis *\n `ucseq_ucseq_sub_Drosophila_willistoni_2.fq`: Drosophila willistoni *\n `ucseq_ucseq_sub_Drosophila_yakuba_2.fq`: Drosophila yakuba ####\n `Dros_fastq_af_consult_GTDBfilt_p3c1.tar.gz` This file contains Drosophila\n species filtered with consult against GTDB with settings p = 3, c = 1 *\n `ucseq_ucseq_sub_Drosophila_ananassae_2.fq`: Drosophila ananassae *\n `ucseq_ucseq_sub_Drosophila_biarmipes_2.fq`: Drosophila biarmipes *\n `ucseq_ucseq_sub_Drosophila_bipectinata_2.fq`: Drosophila bipectinata *\n `ucseq_ucseq_sub_Drosophila_erecta_2.fq`: Drosophila erecta *\n `ucseq_ucseq_sub_Drosophila_eugracilis_2.fq`: Drosophila eugracilis *\n `ucseq_ucseq_sub_Drosophila_mauritiana_2.fq`: Drosophila mauritiana *\n `ucseq_ucseq_sub_Drosophila_mojavensis_2.fq`: Drosophila mojavensis *\n `ucseq_ucseq_sub_Drosophila_persimilis_2.fq`: Drosophila persimilis *\n `ucseq_ucseq_sub_Drosophila_pseudoobscura_2.fq`: Drosophila pseudoobscura\n * `ucseq_ucseq_sub_Drosophila_sechellia_2.fq`: Drosophila sechellia *\n `ucseq_ucseq_sub_Drosophila_simulans_2.fq`: Drosophila simulans *\n `ucseq_ucseq_sub_Drosophila_virilis_2.fq`: Drosophila virilis *\n `ucseq_ucseq_sub_Drosophila_willistoni_2.fq`: Drosophila willistoni *\n `ucseq_ucseq_sub_Drosophila_yakuba_2.fq`: Drosophila yakuba ####\n `Dros_fastq_af_consult_GTDBfilt_p3c2.tar.gz` This file contains Drosophila\n species filtered with consult against GTDB with settings p=3, c = 2 *\n `ucseq_ucseq_sub_Drosophila_ananassae_2.fq`: Drosophila ananassae *\n `ucseq_ucseq_sub_Drosophila_biarmipes_2.fq`: Drosophila biarmipes *\n `ucseq_ucseq_sub_Drosophila_bipectinata_2.fq`: Drosophila bipectinata *\n `ucseq_ucseq_sub_Drosophila_erecta_2.fq`: Drosophila erecta *\n `ucseq_ucseq_sub_Drosophila_eugracilis_2.fq`: Drosophila eugracilis *\n `ucseq_ucseq_sub_Drosophila_mauritiana_2.fq`: Drosophila mauritiana *\n `ucseq_ucseq_sub_Drosophila_mojavensis_2.fq`: Drosophila mojavensis *\n `ucseq_ucseq_sub_Drosophila_persimilis_2.fq`: Drosophila persimilis *\n `ucseq_ucseq_sub_Drosophila_pseudoobscura_2.fq`: Drosophila pseudoobscura\n * `ucseq_ucseq_sub_Drosophila_sechellia_2.fq`: Drosophila sechellia *\n `ucseq_ucseq_sub_Drosophila_simulans_2.fq`: Drosophila simulans *\n `ucseq_ucseq_sub_Drosophila_virilis_2.fq`: Drosophila virilis *\n `ucseq_ucseq_sub_Drosophila_willistoni_2.fq`: Drosophila willistoni *\n `ucseq_ucseq_sub_Drosophila_yakuba_2.fq`: Drosophila yakuba ####\n `Dros_fastq_af_consult_GTDBfilt_p4c2.tar` This file contains Drosophila\n species filtered with consult against GTDB with settings p = 4, c = 2 *\n `ucseq_ucseq_sub_Drosophila_ananassae_2.fq`: Drosophila ananassae *\n `ucseq_ucseq_sub_Drosophila_biarmipes_2.fq`: Drosophila biarmipes *\n `ucseq_ucseq_sub_Drosophila_bipectinata_2.fq`: Drosophila bipectinata *\n `ucseq_ucseq_sub_Drosophila_erecta_2.fq`: Drosophila erecta *\n `ucseq_ucseq_sub_Drosophila_eugracilis_2.fq`: Drosophila eugracilis *\n `ucseq_ucseq_sub_Drosophila_mauritiana_2.fq`: Drosophila mauritiana *\n `ucseq_ucseq_sub_Drosophila_mojavensis_2.fq`: Drosophila mojavensis *\n `ucseq_ucseq_sub_Drosophila_persimilis_2.fq`: Drosophila persimilis *\n `ucseq_ucseq_sub_Drosophila_pseudoobscura_2.fq`: Drosophila pseudoobscura\n * `ucseq_ucseq_sub_Drosophila_sechellia_2.fq`: Drosophila sechellia *\n `ucseq_ucseq_sub_Drosophila_simulans_2.fq`: Drosophila simulans *\n `ucseq_ucseq_sub_Drosophila_virilis_2.fq`: Drosophila virilis *\n `ucseq_ucseq_sub_Drosophila_willistoni_2.fq`: Drosophila willistoni *\n `ucseq_ucseq_sub_Drosophila_yakuba_2.fq`: Drosophila yakuba ####\n `Dros_fastq_af_kraken_filt.tar` This file contains Drosophila fastq after\n filtering with Kraken * `ucseq_ucseq_sub_Drosophila_ananassae_2.fq`:\n Drosophila ananassae * `ucseq_ucseq_sub_Drosophila_biarmipes_2.fq`:\n Drosophila biarmipes * `ucseq_ucseq_sub_Drosophila_bipectinata_2.fq`:\n Drosophila bipectinata * `ucseq_ucseq_sub_Drosophila_erecta_2.fq`:\n Drosophila erecta * `ucseq_ucseq_sub_Drosophila_eugracilis_2.fq`:\n Drosophila eugracilis * `ucseq_ucseq_sub_Drosophila_mauritiana_2.fq`:\n Drosophila mauritiana * `ucseq_ucseq_sub_Drosophila_mojavensis_2.fq`:\n Drosophila mojavensis * `ucseq_ucseq_sub_Drosophila_persimilis_2.fq`:\n Drosophila persimilis * `ucseq_ucseq_sub_Drosophila_pseudoobscura_2.fq`:\n Drosophila pseudoobscura * `ucseq_ucseq_sub_Drosophila_sechellia_2.fq`:\n Drosophila sechellia * `ucseq_ucseq_sub_Drosophila_simulans_2.fq`:\n Drosophila simulans * `ucseq_ucseq_sub_Drosophila_virilis_2.fq`:\n Drosophila virilis * `ucseq_ucseq_sub_Drosophila_willistoni_2.fq`:\n Drosophila willistoni * `ucseq_ucseq_sub_Drosophila_yakuba_2.fq`:\n Drosophila yakuba #### `Dros_fastq_af_kraken_GTDBfilt_c0.00.tar` This file\n contains Drosophila reads after filtering with Kraken against GTDB with\n confidence 0.0: * `dimtrx_kraken_gtdbcusTax_Drosfilt_conf0.0.txt`:\n Distance matrix * `ucseq_ucseq_sub_Drosophila_ananassae_2.fq`: Drosophila\n ananassae * `ucseq_ucseq_sub_Drosophila_biarmipes_2.fq`: Drosophila\n biarmipes * `ucseq_ucseq_sub_Drosophila_bipectinata_2.fq`: Drosophila\n bipectinata * `ucseq_ucseq_sub_Drosophila_erecta_2.fq`: Drosophila erecta\n * `ucseq_ucseq_sub_Drosophila_eugracilis_2.fq`: Drosophila eugracilis *\n `ucseq_ucseq_sub_Drosophila_mauritiana_2.fq`: Drosophila mauritiana *\n `ucseq_ucseq_sub_Drosophila_mojavensis_2.fq`: Drosophila mojavensis *\n `ucseq_ucseq_sub_Drosophila_persimilis_2.fq`: Drosophila persimilis *\n `ucseq_ucseq_sub_Drosophila_pseudoobscura_2.fq`: Drosophila pseudoobscura\n * `ucseq_ucseq_sub_Drosophila_sechellia_2.fq`: Drosophila sechellia *\n `ucseq_ucseq_sub_Drosophila_simulans_2.fq`: Drosophila simulans *\n `ucseq_ucseq_sub_Drosophila_virilis_2.fq`: Drosophila virilis *\n `ucseq_ucseq_sub_Drosophila_willistoni_2.fq`: Drosophila willistoni *\n `ucseq_ucseq_sub_Drosophila_yakuba_2.fq`: Drosophila yakuba ####\n `Dros_fastq_af_kraken_GTDBfilt_c0.04.tar.gz` This file contains Drosophila\n reads after filtering with Kraken against GTDB with confidence 0.4: *\n `ucseq_ucseq_sub_Drosophila_ananassae_2.fq`: Drosophila ananassae *\n `ucseq_ucseq_sub_Drosophila_biarmipes_2.fq`: Drosophila biarmipes *\n `ucseq_ucseq_sub_Drosophila_bipectinata_2.fq`: Drosophila bipectinata *\n `ucseq_ucseq_sub_Drosophila_erecta_2.fq`: Drosophila erecta *\n `ucseq_ucseq_sub_Drosophila_eugracilis_2.fq`: Drosophila eugracilis *\n `ucseq_ucseq_sub_Drosophila_mauritiana_2.fq`: Drosophila mauritiana *\n `ucseq_ucseq_sub_Drosophila_mojavensis_2.fq`: Drosophila mojavensis *\n `ucseq_ucseq_sub_Drosophila_persimilis_2.fq`: Drosophila persimilis *\n `ucseq_ucseq_sub_Drosophila_pseudoobscura_2.fq`: Drosophila pseudoobscura\n * `ucseq_ucseq_sub_Drosophila_sechellia_2.fq`: Drosophila sechellia *\n `ucseq_ucseq_sub_Drosophila_simulans_2.fq`: Drosophila simulans *\n `ucseq_ucseq_sub_Drosophila_virilis_2.fq`: Drosophila virilis *\n `ucseq_ucseq_sub_Drosophila_willistoni_2.fq`: Drosophila willistoni *\n `ucseq_ucseq_sub_Drosophila_yakuba_2.fq`: Drosophila yakuba ####\n `Dros_fastq_af_kraken_GTDBfilt_c0.05.tar` This file contains Drosophila\n reads after filtering with Kraken against GTDB with confidence 0.05: *\n `ucseq_ucseq_sub_Drosophila_ananassae_2.fq`: Drosophila ananassae *\n `ucseq_ucseq_sub_Drosophila_biarmipes_2.fq`: Drosophila biarmipes *\n `ucseq_ucseq_sub_Drosophila_bipectinata_2.fq`: Drosophila bipectinata *\n `ucseq_ucseq_sub_Drosophila_erecta_2.fq`: Drosophila erecta *\n `ucseq_ucseq_sub_Drosophila_eugracilis_2.fq`: Drosophila eugracilis *\n `ucseq_ucseq_sub_Drosophila_mauritiana_2.fq`: Drosophila mauritiana *\n `ucseq_ucseq_sub_Drosophila_mojavensis_2.fq`: Drosophila mojavensis *\n `ucseq_ucseq_sub_Drosophila_persimilis_2.fq`: Drosophila persimilis *\n `ucseq_ucseq_sub_Drosophila_pseudoobscura_2.fq`: Drosophila pseudoobscura\n * `ucseq_ucseq_sub_Drosophila_sechellia_2.fq`: Drosophila sechellia *\n `ucseq_ucseq_sub_Drosophila_simulans_2.fq`: Drosophila simulans *\n `ucseq_ucseq_sub_Drosophila_virilis_2.fq`: Drosophila virilis *\n `ucseq_ucseq_sub_Drosophila_willistoni_2.fq`: Drosophila willistoni *\n `ucseq_ucseq_sub_Drosophila_yakuba_2.fq`: Drosophila yakuba ### GORG\n Dataset We provide query summary reports (Kraken output) for GORG samples\n searched against TOL, GTDB and Bact/Arch Kraken using Kraken: *\n `gorg_conf0.00_kraken.tar.gz`: conf = 0.00 *\n `gorg_conf0.02_kraken.tar.gz`: conf = 0.02 *\n `gorg_conf0.05_kraken.tar.gz`: conf = 0.05 *\n `gorg_conf0.04_krakenGTDB.tar.gz`: conf = 0.04; note that, as shown in the\n paper, this threshold was only run for the GTDB dataset, which did not\n have enough coverage with other thresholds. Thus, only GTDB is included in\n this directory. We also provide the query sets used during testing: *\n `gorg_all_queries.tar.gz`: Contains the fastq files including the GORG\n query set. Each `.fq` file is one query from GORG. ### Mitochondrial\n Dataset: #### Original sequencing data: * `filt_fastq.tar.gz`: Contains\n filtered reads. Each fastq file is one query. These are after filtering,\n as described in the paper. * `unfiltered_fastq.tar.gz`: Contains\n unfiltered reads ## Chloroplast Data: Original sequencing data: *\n `fastq_affilt.tar.gz`: Contains filtered reads Chloroplast assemblies\n using various assembly methods: * filtered_spades: Spades applied to\n filtered reads * seed: Seed and extend method * base directory: get\n organelle Each nested folder of the following directories includes results\n of getOrganelle (log files, assemblies in fasta format, etc.). **Note:**\n refer to\n [https://github.com/Kinggerm/GetOrganelle](https://github.com/Kinggerm/GetOrganelle) for the description of files included in GetOrganelle results. * `getorganelle_afterfilt.tar.gz`: obtained from filtered reads using getOrganelle * `getorganelle_beforefilt.tar.gz`: obtained from unfiltered reads using getOrganelle Chloroplast annotations: * `annotations.tar.gz`: outputs of annotation software GeSeq; * For each of the eight samples that failed to be assembled fully without filtering (ERR2114804, ERR2114804, SRR2531285, SRR5500897, SRR7685402, SRR2531285, SRR5500897, SRR7685402), we show results of both filtered `filt` and unfiltered `_unfilt` annotations. * `*.fa` files show assemblies, `.gb` shows annotation results, and `.jpg` are drawings of the annotations. ### Bacterial Simulated queries #### `excluded_fna_fq_downSmpl10M.tar` This file contains query samples for TOL query set used in the study * `10x_Cca.fq`: Carya cathayensis * `10x_Cil.fq`: Carya illinoinensis * `10x_Oryza_sativa.fq`: Plant Oryza sativa * `10x_Prunus_persica.fq`: Plant Prunus persica * `15x_Arabidopsis_lyrata.fq`: Plant Arabidopsis lyrata * `20x_Arabidopsis_thaliana.fq`: Plant Arabidopsis thaliana * `250x_Bathycoccus_prasinos.fq`: Plant Bathycoccus prasinos * `2x_Nicotiana_sylvestris.fq`: Plant Nicotiana sylvestris * `2x_Zea_mays.fq`: Plant Zea mays * `5x_Coffee_arabica.fq`: Plant Coffee arabica * `G000007005.fq`: Bacterial/Archaeal species * `G000007185.fq`: Bacterial/Archaeal species * `G000009965.fq`: Bacterial/Archaeal species * `G000011125.fq`: Bacterial/Archaeal species * `G000016385.fq`: Bacterial/Archaeal species * `G000016525.fq`: Bacterial/Archaeal species * `G000017185.fq`: Bacterial/Archaeal species * `G000018365.fq`: Bacterial/Archaeal species * `G000019605.fq`: Bacterial/Archaeal species * `G000022365.fq`: Bacterial/Archaeal species * `G000024305.fq`: Bacterial/Archaeal species * `G000091665.fq`: Bacterial/Archaeal species * `G000145295.fq`: Bacterial/Archaeal species * `G000151105.fq`: Bacterial/Archaeal species * `G000166095.fq`: Bacterial/Archaeal species * `G000173675.fq`: Bacterial/Archaeal species * `G000186365.fq`: Bacterial/Archaeal species * `G000189555.fq`: Bacterial/Archaeal species * `G000190155.fq`: Bacterial/Archaeal species * `G000195935.fq`: Bacterial/Archaeal species * `G000204585.fq`: Bacterial/Archaeal species * `G000215995.fq`: Bacterial/Archaeal species * `G000220645.fq`: Bacterial/Archaeal species * `G000221185.fq`: Bacterial/Archaeal species * `G000223395.fq`: Bacterial/Archaeal species * `G000231015.fq`: Bacterial/Archaeal species * `G000242875.fq`: Bacterial/Archaeal species * `G000243455.fq`: Bacterial/Archaeal species * `G000245135.fq`: Bacterial/Archaeal species * `G000253055.fq`: Bacterial/Archaeal species * `G000264495.fq`: Bacterial/Archaeal species * `G000302455.fq`: Bacterial/Archaeal species * `G000307305.fq`: Bacterial/Archaeal species * `G000317795.fq`: Bacterial/Archaeal species * `G000363885.fq`: Bacterial/Archaeal species * `G000375685.fq`: Bacterial/Archaeal species * `G000389735.fq`: Bacterial/Archaeal species * `G000399765.fq`: Bacterial/Archaeal species * `G000402095.fq`: Bacterial/Archaeal species * `G000421185.fq`: Bacterial/Archaeal species * `G000422285.fq`: Bacterial/Archaeal species * `G000437835.fq`: Bacterial/Archaeal species * `G000446015.fq`: Bacterial/Archaeal species * `G000495715.fq`: Bacterial/Archaeal species * `G000730285.fq`: Bacterial/Archaeal species * `G000746745.fq`: Bacterial/Archaeal species * `G000770635.fq`: Bacterial/Archaeal species * `G000816105.fq`: Bacterial/Archaeal species * `G000830275.fq`: Bacterial/Archaeal species * `G000830295.fq`: Bacterial/Archaeal species * `G000875775.fq`: Bacterial/Archaeal species * `G000955905.fq`: Bacterial/Archaeal species * `G000966265.fq`: Bacterial/Archaeal species * `G001004105.fq`: Bacterial/Archaeal species * `G001189275.fq`: Bacterial/Archaeal species * `G001315825.fq`: Bacterial/Archaeal species * `G001316025.fq`: Bacterial/Archaeal species * `G001316045.fq`: Bacterial/Archaeal species * `G001316145.fq`: Bacterial/Archaeal species * `G001316265.fq`: Bacterial/Archaeal species * `G001317345.fq`: Bacterial/Archaeal species * `G001399695.fq`: Bacterial/Archaeal species * `G001399795.fq`: Bacterial/Archaeal species * `G001402855.fq`: Bacterial/Archaeal species * `G001412615.fq`: Bacterial/Archaeal species * `G001438895.fq`: Bacterial/Archaeal species * `G001481595.fq`: Bacterial/Archaeal species * `G001484685.fq`: Bacterial/Archaeal species * `G001507935.fq`: Bacterial/Archaeal species * `G001508175.fq`: Bacterial/Archaeal species * `G001510225.fq`: Bacterial/Archaeal species * `G001510275.fq`: Bacterial/Archaeal species * `G001510295.fq`: Bacterial/Archaeal species * `G001515215.fq`: Bacterial/Archaeal species * `G001516665.fq`: Bacterial/Archaeal species * `G001516725.fq`: Bacterial/Archaeal species * `G001516745.fq`: Bacterial/Archaeal species * `G001560165.fq`: Bacterial/Archaeal species * `G001560565.fq`: Bacterial/Archaeal species * `G001563335.fq`: Bacterial/Archaeal species * `G001577775.fq`: Bacterial/Archaeal species * `G001587655.fq`: Bacterial/Archaeal species * `G001593925.fq`: Bacterial/Archaeal species * `G001595885.fq`: Bacterial/Archaeal species * `G001627075.fq`: Bacterial/Archaeal species * `G001628455.fq`: Bacterial/Archaeal species * `G001628475.fq`: Bacterial/Archaeal species * `G001674955.fq`: Bacterial/Archaeal species * `G001679155.fq`: Bacterial/Archaeal species * `G001685465.fq`: Bacterial/Archaeal species * `G001717005.fq`: Bacterial/Archaeal species * `G001723845.fq`: Bacterial/Archaeal species * `G001729285.fq`: Bacterial/Archaeal species * `G001776015.fq`: Bacterial/Archaeal species * `G001856825.fq`: Bacterial/Archaeal species * `G001870125.fq`: Bacterial/Archaeal species * `G001887595.fq`: Bacterial/Archaeal species * `G001914405.fq`: Bacterial/Archaeal species * `G001918455.fq`: Bacterial/Archaeal species * `G001918475.fq`: Bacterial/Archaeal species * `G001919175.fq`: Bacterial/Archaeal species * `G001920575.fq`: Bacterial/Archaeal species * `G001940645.fq`: Bacterial/Archaeal species * `G001940655.fq`: Bacterial/Archaeal species * `G001940665.fq`: Bacterial/Archaeal species * `G002009975.fq`: Bacterial/Archaeal species * `G002011035.fq`: Bacterial/Archaeal species * `G002011075.fq`: Bacterial/Archaeal species * `G900109425.fq`: Bacterial/Archaeal species * `G900156635.fq`: Bacterial/Archaeal species ### Reference Libraries Custom Kraken libraries constructed using different genomic reference sets are provided * `kraken_db_gtdb_genomes_reps_r95_k35l31s7_cp.tar.gz`: GTDB datasets with default Kraken taxonomy; this file is too big to be included here and is instead made available on [https://skmer.ucsd.edu/data/consult/kraken/](https://skmer.ucsd.edu/data/consult/kraken/) * `tree_of_life_noViral_unmasked_k35_l31_s7_cp.tar.gz`: TOL with default Kraken taxonomy * `tree_of_life_noViral_unmasked_k35_l31_s7_customtax_cp.tar.gz`: TOL with custom taxonomy ## Sharing/Access information See more on * [CONSULT software repository](https://github.com/noraracht/CONSULT) * [Raw data repository](https://github.com/noraracht/lsh_raw_data) * [Scripts repository](https://github.com/noraracht/lsh_scripts) * [Website with reference files](https://skmer.ucsd.edu/data/consult/)","descriptionType":"TechnicalInfo"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"IIS-1815485","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"ACI-1548562","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D19P44","contentUrl":null,"metadataVersion":5,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":33,"downloadCount":1,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-10-04T22:46:56Z","registered":"2023-10-04T22:46:57Z","published":null,"updated":"2026-01-28T15:40:56Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1fg62","type":"dois","attributes":{"doi":"10.6076/d1fg62","identifiers":[],"creators":[{"name":"Rachtman, Eleonora","nameType":"Personal","givenName":"Eleonora","familyName":"Rachtman","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-6104-5750","nameIdentifierScheme":"ORCID"}]},{"name":"Balaban, Metin","nameType":"Personal","givenName":"Metin","familyName":"Balaban","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Bafna, Vineet","nameType":"Personal","givenName":"Vineet","familyName":"Bafna","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-5810-6241","nameIdentifierScheme":"ORCID"}]},{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"Genetics","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"Ecology, Evolution, Behavior and Systematics"},{"subject":"Biotechnology","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"}],"contributors":[],"dates":[{"date":"2023-08-16T04:44:26Z","dateType":"Created"},{"date":"2023-08-16T07:30:11Z","dateType":"Submitted"},{"date":"2023-08-23T00:00:00Z","dateType":"Issued"},{"date":"2023-08-23T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1111/1755-0998.13135","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["97516203146 bytes"],"formats":[],"version":"3","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"The ability to detect the identity of a sample obtained from its\n environment is a cornerstone of molecular ecological research. Thanks to\n the falling price of shotgun sequencing, genome skimming, the acquisition\n of short reads spread across the genome at low coverage, is emerging as an\n alternative to traditional barcoding. By obtaining far more data across\n the whole genome, skimming has the promise to increase the precision of\n sample identification beyond traditional barcoding while keeping the costs\n manageable. While methods for assembly-free sample identification based on\n genome skims are now available, little is known about how these methods\n react to the presence of DNA from organisms other than the target species.\n In this paper, we show that the accuracy of distances computed between a\n pair of genome skims based on k-mer similarity can degrade dramatically if\n the skims include contaminant reads; i.e., any reads originating from\n other organisms. We establish a theoretical model of the impact of\n contamination. We then suggest and evaluate a solution to the\n contamination problem: Query reads in a genome skim against an extensive\n database of possible contaminants (e.g., all microbial organisms) and\n filter out any read that matches. We evaluate the effectiveness of this\n strategy when implemented using Kraken-II, in detailed analyses. Our\n results show substantial improvements in accuracy as a result of filtering\n but also point to limitations, including a need for relatively close\n matches in the contaminant database.","descriptionType":"Abstract"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"IIS-1815485","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1FG62","contentUrl":null,"metadataVersion":6,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":124,"downloadCount":2,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-08-23T21:59:58Z","registered":"2023-08-23T21:59:59Z","published":null,"updated":"2026-01-28T15:40:10Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1tp4s","type":"dois","attributes":{"doi":"10.6076/d1tp4s","identifiers":[],"creators":[{"name":"Pommier, Anne","nameType":"Personal","givenName":"Anne","familyName":"Pommier","affiliation":["Carnegie Institution for Science"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0003-3182-1912","nameIdentifierScheme":"ORCID"}]},{"name":"Tauber, Michael","nameType":"Personal","givenName":"Michael","familyName":"Tauber","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Pirotte, Hadrien","nameType":"Personal","givenName":"Hadrien","familyName":"Pirotte","affiliation":["University of Liège"],"nameIdentifiers":[]},{"name":"Cody, George","nameType":"Personal","givenName":"George","familyName":"Cody","affiliation":["Carnegie Institution for Science"],"nameIdentifiers":[]},{"name":"Steele, Andrew","nameType":"Personal","givenName":"Andrew","familyName":"Steele","affiliation":["Carnegie Institution for Science"],"nameIdentifiers":[]},{"name":"Bullock, Emma","nameType":"Personal","givenName":"Emma","familyName":"Bullock","affiliation":["Carnegie Institution for Science"],"nameIdentifiers":[]},{"name":"Charlier, Bernard","nameType":"Personal","givenName":"Bernard","familyName":"Charlier","affiliation":["University of Liège"],"nameIdentifiers":[]},{"name":"Mysen, Bjorn","nameType":"Personal","givenName":"Bjorn","familyName":"Mysen","affiliation":["Carnegie Institution for Science"],"nameIdentifiers":[]}],"titles":[{"title":"Transport and structural properties of reduced S-bearing glasses and melts"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Earth and related environmental sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Earth and related environmental sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"Mercury"},{"subject":"silicate glasses"},{"subject":"silicate melts"},{"subject":"Sulfur","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"Impedance spectroscopy","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"NMR spectroscopy","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"Raman spectroscopy","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"}],"contributors":[],"dates":[{"date":"2023-07-05T16:36:47Z","dateType":"Created"},{"date":"2023-09-28T09:21:16Z","dateType":"Submitted"},{"date":"2023-10-03T00:00:00Z","dateType":"Issued"},{"date":"2023-10-03T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1016/j.gca.2023.10.027","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["1095728 bytes"],"formats":[],"version":"6","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Elucidating the role of sulfur on the structure of silicate glasses and\n melts at elevated pressures and temperatures is important for\n understanding transport properties, such as electrical conductivity and\n viscosity, of magma oceans and mantle-derived melts. These properties are\n fundamental to modeling the evolution of terrestrial planets and moons.\n Despite several investigations of sulfur speciation in glasses, questions\n remain regarding the effect of S on complex glasses at highly reducing\n conditions relevant to Mercury. Glasses were synthetized with compositions\n representative of the Northern Volcanic Plains of Mercury and containing\n quantities of S as high as 5 wt.%. Multiple spectroscopic methods and\n microprobe analyses were employed to probe the glasses, including in situ\n impedance spectroscopy at 2- and 4-GPa pressures and temperatures up to\n 1740 K using a multi-anvil press, 29Si NMR spectroscopy, and Raman\n spectroscopy. Electrical activation energies (Ea) in the glassy state\n range from 0.56 to 1.10 eV, in agreement with sodium as the main charge\n carrier. The electrical measurements suggest that sulfide improves Na+\n transport and may overcome a known impeding effect of the divalent cation\n Ca2+. The glass transition temperature lies between 700-750 K, and for\n temperatures up to 970 K Ea decreases (0.35-0.68 eV) and the\n conductivities of the samples converge (~5-8 ×10-3 S/m). At Tquench, the\n melt fraction is 50-70% and melt conductivity varies from 0.7 to 2.2 S/m,\n with the sample containing 5 wt.% S the most conductive among the set.\n 29Si NMR spectra reveal that a high fraction of S bonds with Si in these\n complex glasses, an important insight that has not been recognized\n previously. Raman spectra and maps reveal regions rich in Ca-S or Mg-S\n bonds. The evidence of sulfide interactions with both Si and Ca/Mg suggest\n that alkaline earth sulfides can be considered weak network modifiers in\n these glasses, under highly reduced conditions.  ","descriptionType":"Abstract"},{"description":"Experimental data from impedance spectroscopy, NMR spectroscopy,\n Raman spectroscopy and electron microprobe analyses. \n The description of the experiments and analyses is explained the\n manuscript.","descriptionType":"Methods"},{"description":"# Title of Dataset Data used in manuscript GCA-D-23-00571 entitled\n “Experimental Investigation of the Bonding of Sulfur in Highly Reduced\n Silicate Glasses and Melts” by A. Pommier, M. J. Tauber, H. Pirotte, G.D.\n Cody, A. Steele, E.S. Bullock, B. Charlier, and B.O. Mysen (Geochimica et\n Cosmochimica Acta). The spreadsheet lists all the measurements shown in\n the figures of the manuscript. Each tab correspond to a figure: -Figures 2\n and 3: electron microprobe analyses. -Figures 4 and 5: impedance\n spectroscopy -Figure 6: NMR spectroscopy -Figures 8 and 9: Raman\n spectroscopy -Figure 10: NBO/T estimates The reader is referred to the\n manuscript for details about the experimental procedures and results. ##\n Description of the data and file structure * Figure 2 tab: Each sample\n name starts with BBC. For each sample, electron microprobe analyses are\n shown for traverses across the sample and the content of each oxide is in\n wt.%. * Figure 3 tab: Microprobe traverse in sample BBC16. The content of\n each oxide is in wt.%. * Figure 4: Impedance spectra of samples BBC13 and\n BBC17 at selected temperatures. For each temperature, the different\n columns correspond to the frequency, time, real component (Z') and\n imaginary component (Z\"). * Figure 5 tab: electrical resistance (R)\n and conductivity (EC) of different samples as a function of temperature\n (T). Each sample name starts with BBC. For each sample, the different\n columns correspond to temperature (in degC and K), inverse T, resistance,\n conductivity and Ln (conductivity). * Figure 6: NMR spectra for four\n starting glasses (VT48, 52, 53, 54). The first column is the chemical\n shift, and the other columns correspond to the intensity of each sample. *\n Figure 7: Raman spectrum of starting glass VT55. The first column is the\n Raman shift, and the second column corresponds to the intensity of the\n sample. * Figure 8: Raman spectra of sulfide components in starting glass\n VT52 and samples from experiments BBC17 and BBC18. For each sample, the\n first column is the Raman shift, and the second column corresponds to the\n intensity of the sample.","descriptionType":"TechnicalInfo"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"U.S. National Science Foundation","awardNumber":"EAR-1750746","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1TP4S","contentUrl":null,"metadataVersion":8,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":69,"downloadCount":8,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-10-03T08:03:23Z","registered":"2023-10-03T08:03:24Z","published":null,"updated":"2026-01-28T15:21:59Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d16w2h","type":"dois","attributes":{"doi":"10.6076/d16w2h","identifiers":[],"creators":[{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]},{"name":"Yin, John","nameType":"Personal","givenName":"John","familyName":"Yin","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Zhang, Chao","nameType":"Personal","givenName":"Chao","familyName":"Zhang","affiliation":["University of California San Diego"],"nameIdentifiers":[]}],"titles":[{"title":"ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"ASTARL"},{"subject":"species tree estimation"},{"subject":"Simphy simulations"},{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"}],"contributors":[],"dates":[{"date":"2023-06-30T00:44:07Z","dateType":"Created"},{"date":"2023-06-30T00:55:22Z","dateType":"Submitted"},{"date":"2023-07-05T00:00:00Z","dateType":"Issued"},{"date":"2023-07-05T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/bioinformatics/btz211","relatedIdentifierType":"DOI"},{"relationType":"IsDerivedFrom","relatedIdentifier":"https://github.com/smirarab/ASTRAL/tree/MP","relatedIdentifierType":"URL"}],"relatedItems":[],"sizes":["7966541184 bytes"],"formats":[],"version":"3","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Motivation Evolutionary histories can change from one part of the genome\n to another. The potential for discordance between the gene trees has\n motivated the development of summary methods that reconstruct a species\n tree from an input collection of gene trees. ASTRAL is a widely used\n summary method and has been able to scale to relatively large datasets.\n However, the size of genomic datasets is quickly growing. Despite its\n relative efficiency, the current single-threaded implementation of ASTRAL\n is falling behind the data growth trends and is not able to analyze the\n largest available datasets in a reasonable time.   Results ASTRAL\n uses dynamic programing and is not trivially parallel. In this paper, we\n introduce ASTRAL-MP, the first version of ASTRAL that can exploit\n parallelism and also uses randomization techniques to speed up some of its\n steps. Importantly, ASTRAL-MP can take advantage of not just multiple CPU\n cores but also one or several graphics processing units (GPUs). The\n ASTRAL-MP code scales very well with increasing CPU cores, and its GPU\n version, implemented in OpenCL, can have up to 158× speedups compared to\n ASTRAL-III. Using GPUs and multiple cores, ASTRAL-MP is able to analyze\n datasets with 10,000 species or datasets with more than 100,000 genes in\n \u0026lt;2 days.  Availability and implementation ASTRAL-MP is\n available at https://github.com/smirarab/ASTRAL/tree/MP. ","descriptionType":"Abstract"},{"description":"In testing the efficiency of\n ASTRAL-MP, we use several simulated and real datasets (see Table). The\n datasets range in the number of species (\u003cem\u003en\u003c/em\u003e) between\n 48 and 1,000 and have between 1,000 and 14,446 gene trees\n (\u003cem\u003ek\u003c/em\u003e).  \n Name\n Original publication # Species\n (n) # Genes (k) Type\n # Generations Contraction\n threshold # Reps. \n SV \n Mirarab and\n Warnow (2015)  100, 200, 500,\n 1000  1000 \n Simulated  2×1062×106  Fully resolved  10  Avian  Mirarab \u003cem\u003eet al.\u003c/em\u003e (2014a)  48  14 446, 1000  Real  Unknown (order: 10\u003csup\u003e7\u003c/sup\u003e)  Full, 0, 33, 50, 75%  1, 10  Insects  Sayyari \u003cem\u003eet al.\u003c/em\u003e (2017)  144  1478  Real  Unknown  Fully resolved  1  \u003cem\u003eNote\u003c/em\u003e: For SV, some outlier replicates have fewer than 1m000 genes because poorly resolved gene trees are removed. For avian, the full dataset is subsampled randomly to create 10 inputs with 1m000 gene trees. In addition, to test limits of \u003cem\u003en\u003c/em\u003e, we used an existing simulated dataset (20 replicates) with 10\u003csup\u003e4\u003c/sup\u003e species and 1000 gene trees similarly to the SV1000 dataset.  To test limits of \u003cem\u003ek\u003c/em\u003e, we used an insect transcriptomic dataset (Misof \u003cem\u003eet al.\u003c/em\u003e, 2014; Sayyari \u003cem\u003eet al.\u003c/em\u003e, 2017) with 144 taxa and 1,478 genes, each with 100 bootstrapped gene trees. ","descriptionType":"Methods"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"1565862","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"ACI-1053575","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D16W2H","contentUrl":null,"metadataVersion":6,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":177,"downloadCount":27,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-07-05T14:41:04Z","registered":"2023-07-05T14:41:05Z","published":null,"updated":"2026-01-28T15:17:24Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1bp4f","type":"dois","attributes":{"doi":"10.6076/d1bp4f","identifiers":[],"creators":[{"name":"Healy, Timothy","nameType":"Personal","givenName":"Timothy","familyName":"Healy","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-1880-6913","nameIdentifierScheme":"ORCID"}]},{"name":"Burton, Ronald","nameType":"Personal","givenName":"Ronald","familyName":"Burton","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-6995-5329","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"Data from: Genetic incompatibilities in reciprocal hybrids between populations of Tigriopus californicus with low to moderate mitochondrial sequence divergence"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"copepod"},{"subject":"Mitonuclear"},{"subject":"Mitochondria","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"Coevolution","schemeUri":"https://github.com/PLOS/plos-thesaurus","subjectScheme":"PLOS Subject Area Thesaurus"},{"subject":"PoolSeq"},{"subject":"intergenomic"}],"contributors":[],"dates":[{"date":"2023-06-28T17:29:16Z","dateType":"Created"},{"date":"2023-06-28T17:31:33Z","dateType":"Submitted"},{"date":"2023-07-02T00:00:00Z","dateType":"Issued"},{"date":"2023-07-02T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/2022.09.19.508600","relatedIdentifierType":"DOI"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/evolut/qpad122","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["2204575570 bytes"],"formats":[],"version":"2","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"All mitochondrial-encoded proteins and RNAs function through interactions\n with nuclear-encoded proteins, which are critical for mitochondrial\n performance and eukaryotic fitness. Coevolution maintains inter-genomic\n (i.e., mitonuclear) compatibility within a taxon, but hybridization can\n disrupt coevolved interactions, resulting in hybrid breakdown. Thus,\n mitonuclear incompatibilities may be important mechanisms underlying\n reproductive isolation and, potentially, speciation. Here we utilize\n Pool-seq to assess the effects of mitochondrial genotype on nuclear allele\n frequencies in fast- and slow-developing reciprocal inter-population F2\n hybrids between relatively low-divergence populations of the intertidal\n copepod Tigriopus californicus. We show that mitonuclear interactions lead\n to elevated frequencies of coevolved (i.e., maternal) nuclear alleles on\n two chromosomes in crosses between populations with 1.5% or 9.6% fixed\n differences in mitochondrial DNA nucleotide sequence. However, we also\n find evidence of excess mismatched (i.e., non-coevolved) alleles on three\n or four chromosomes per cross, respectively, and of allele frequency\n differences consistent with effects involving only nuclear loci (i.e.,\n unaffected by mitochondrial genotype). Thus, our results for\n low-divergence crosses suggest an underlying role for mitonuclear\n interactions in variation in hybrid developmental rate, but despite\n substantial effects of mitonuclear coevolution on individual chromosomes,\n no clear bias favouring coevolved interactions overall.","descriptionType":"Abstract"},{"description":"Pool-seq count data - collected by isolating DNA from pools of\n fast- or slow-developing \u003cem\u003eTigriopus\n californicus \u003c/em\u003ecopepodids from reciprocal hybrids lines; DNA was\n sequenced by whole-genome sequencing on a NovaSeq 6000; reads were mapped\n to a hybrid reference genome and counted using \u003cem\u003ebwa\u003c/em\u003e;\n allele-specific read counts were determined with\n \u003cem\u003eSamtools\u003c/em\u003e and\n \u003cem\u003epopoolation.\u003c/em\u003e Population-specific\n reference genomes - re-sequenced genome sequences for individual\n \u003cem\u003eTigriopus californicus\u003c/em\u003e populations; obtained by\n mapping whole-genome sequencing reads to the previously published San\n Diego, California reference sequence; population-specific consensus\n sequences were determined\n with \u003cem\u003ebwa\u003c/em\u003e, \u003cem\u003eSamtools\u003c/em\u003e\n and \u003cem\u003epopoolation\u003c/em\u003e. PE mtDNA sequence\n - consensus sequence from WGS reads from the Pescadero Beach, California\n \u003cem\u003eTigriopus californicus\u003c/em\u003e population; reads were mapped\n to a previously published mtDNA sequence for a population from Santa Cruz,\n California with \u003cem\u003eCLC Genomics Workbench\u003c/em\u003e which was also\n used to determine the consensus PE sequence. Hybrid\n reference genomes - assembled from previously published sequences and the\n PE mtDNA sequence using \u003cem\u003eR\u003c/em\u003e. Fixed\n SNP lists - list of single-nucleotide polymorphisms displaying fixed\n differences between pairs of \u003cem\u003eTigriopus californicus\u003c/em\u003e\n populations; determined by reciprocal whole-genome sequencing read mapping\n and processing\n with \u003cem\u003ebwa\u003c/em\u003e, \u003cem\u003eSamtools\u003c/em\u003e\n and \u003cem\u003epopoolation\u003c/em\u003e.","descriptionType":"Methods"},{"description":"All files can be opened in text editors, spreadsheet programs or\n the statistical software R.","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"DEB1556466","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"IOS1754347","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1BP4F","contentUrl":null,"metadataVersion":7,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":150,"downloadCount":20,"referenceCount":0,"citationCount":2,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-07-03T01:44:54Z","registered":"2023-07-03T01:44:55Z","published":null,"updated":"2026-01-28T15:17:00Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1m59n","type":"dois","attributes":{"doi":"10.6076/d1m59n","identifiers":[],"creators":[{"name":"Balaban, Metin","nameType":"Personal","givenName":"Metin","familyName":"Balaban","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-6947-5915","nameIdentifierScheme":"ORCID"}]},{"name":"Jiang, Yueyu","nameType":"Personal","givenName":"Yueyu","familyName":"Jiang","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-8425-7556","nameIdentifierScheme":"ORCID"}]},{"name":"Roush, Daniel","nameType":"Personal","givenName":"Daniel","familyName":"Roush","affiliation":["Arizona State University"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-8025-2117","nameIdentifierScheme":"ORCID"}]},{"name":"Zhu, Qiyun","nameType":"Personal","givenName":"Qiyun","familyName":"Zhu","affiliation":["Arizona State University"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0002-3568-6271","nameIdentifierScheme":"ORCID"}]},{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"Fast and accurate distance‐based phylogenetic placement using divide and conquer"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"phylogenetic placement"},{"subject":"Distance-based methods"},{"subject":"divide-and-conquer"}],"contributors":[],"dates":[{"date":"2023-06-27T17:00:51Z","dateType":"Created"},{"date":"2023-06-27T17:22:32Z","dateType":"Submitted"},{"date":"2023-07-06T00:00:00Z","dateType":"Issued"},{"date":"2023-07-06T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1111/1755-0998.13527","relatedIdentifierType":"DOI"},{"relationType":"IsDerivedFrom","relatedIdentifier":"https://github.com/balabanmetin/apples2-data","relatedIdentifierType":"URL"}],"relatedItems":[],"sizes":["12879371295 bytes"],"formats":[],"version":"4","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Phylogenetic placement of query samples on an existing phylogeny is\n increasingly used in molecular ecology, including sample identification\n and microbiome environmental sampling. As the size of available reference\n trees used in these analyses continues to grow, there is a growing need\n for methods that place sequences on ultra-large trees with high accuracy.\n Distance-based placement methods have recently emerged as a path to\n provide such scalability while allowing flexibility to analyse both\n assembled and unassembled environmental samples. In this study, we\n introduce a distance-based phylogenetic placement method, APPLES-2, that\n is more accurate and scalable than existing distance-based methods and\n even some of the leading maximum-likelihood methods. This scalability is\n owed to a divide-and-conquer technique that limits distance calculation\n and phylogenetic placement to parts of the tree most relevant to each\n query. The increased scalability and accuracy enable us to study the\n effectiveness of APPLES-2 for placing microbial genomes on a data set of\n 10,575 microbial species using subsets of 381 marker genes. APPLES-2 has\n very high accuracy in this setting, placing 97% of query genomes within\n three branches of the optimal position in the species tree using 50 marker\n genes. Our proof-of-concept results show that APPLES-2 can quickly place\n metagenomic scaffolds on ultra-large backbone trees with high accuracy as\n long as a scaffold includes tens of marker genes. These results pave the\n path for a more scalable and widespread use of distance-based placement in\n various areas of molecular ecology.","descriptionType":"Abstract"},{"description":"See README file for details. We include both simulated datasets\n and real datasets modeifed from the WoL resources. ","descriptionType":"Methods"},{"description":"See the README file for links to tools, referenced here: https://github.com/balabanmetin/apples2-data","descriptionType":"Other"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"IIS-1845967","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"1815485","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1M59N","contentUrl":null,"metadataVersion":6,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":144,"downloadCount":9,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-07-06T16:12:46Z","registered":"2023-07-06T16:12:47Z","published":null,"updated":"2026-01-28T15:03:19Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1qw25","type":"dois","attributes":{"doi":"10.6076/d1qw25","identifiers":[],"creators":[{"name":"Balaban, Metin","nameType":"Personal","givenName":"Metin","familyName":"Balaban","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]}],"titles":[{"title":"Phylogenetic double placement of mixed samples"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"genome skimming"},{"subject":"hybrid detection"},{"subject":"mixture analyses"},{"subject":"phlogenetic placement"},{"subject":"Distance-based phylogenetics"}],"contributors":[],"dates":[{"date":"2023-09-07T15:52:27Z","dateType":"Created"},{"date":"2023-09-07T18:39:02Z","dateType":"Submitted"},{"date":"2023-11-17T00:00:00Z","dateType":"Issued"},{"date":"2023-11-17T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/bioinformatics/btaa489","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["124438363255 bytes"],"formats":[],"version":"5","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Motivation Consider a simple computational problem. The inputs are (i) the\n set of mixed reads generated from a sample that combines two organisms and\n (ii) separate sets of reads for several reference genomes of known\n origins. The goal is to find the two organisms that constitute the mixed\n sample. When constituents are absent from the reference set, we seek to\n phylogenetically position them with respect to the underlying tree of the\n reference species. This simple yet fundamental problem (which we call\n phylogenetic double-placement) has enjoyed surprisingly little attention\n in the literature. As genome skimming (low-pass sequencing of genomes at\n low coverage, precluding assembly) becomes more prevalent, this problem\n finds wide-ranging applications in areas as varied as biodiversity\n research, food production and provenance, and evolutionary reconstruction.\n Results We introduce a model that relates distances between a mixed sample\n and reference species to the distances between constituents and reference\n species. Our model is based on Jaccard indices computed between each\n sample represented as k-mer sets. The model, built on several assumptions\n and approximations, allows us to formalize the phylogenetic\n double-placement problem as a non-convex optimization problem that\n decomposes mixture distances and performs phylogenetic placement\n simultaneously. Using a variety of techniques, we are able to solve this\n optimization problem numerically. We test the resulting method, called\n MIxed Sample Analysis tool (MISA), on a varied set of simulated and\n biological datasets. Despite all the assumptions used, the method performs\n remarkably well in practice. ","descriptionType":"Abstract"},{"description":"# Data from: Phylogenetic double placement of mixed samples ## Citation\n Balaban, M., \u0026amp; Mirarab, S. (2020). Phylogenetic double placement\n of mixed samples. Bioinformatics (Oxford, England), 36(1), i335–i343.\n [doi:10.1093/bioinformatics/btaa489](https://doi.org/10.1093/bioinformatics/btaa489) ## Description of the data and file structure In all the datasets, files called `*results*.csv` have the following columns: * 1st column: `query` gives the query name, * 2nd column: is one of * `alien` is when both parents are removed from ref * `partial` is when one parent is removed from ref * `present` is when neither parent is removed from ref * 3rd column: the name of the method * 4th column: Either Primary or Secondary, for the two placements; primary is always the one with lower error * 5th column: Placement error in edges * [optional] 6th column: the `k` value used ### Columbicola (Lice) dataset (simulated mixture) To evaluate the accuracy of our method on genome skimming data, we use a set of 61 genome skims by Boyd et al. (2017) (PRJNA296666), including 45 known Lice species (some represented multiple times) and seven undescribed species. We use randomly subsampled genome-skims of 4 Gb. We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Then, we create five replicates each containing 20 organisms sampled from the full dataset at random. For each replicate, we simulate five mixtures with A and B chosen uniformly at random. We simulate mixtures by simply combining preprocessed genome skims of the two constituents. The exact coverage of the genome skims is unknown but is estimated to range between 4X and 15X by Skmer. The following archives are provided: * Each `SRR*.fastq.bz2` gives the preprocessed genome skim of one lice sample. These are the genome skims of lice used in this study, adapted from [Boyd et al](http://www.ncbi.nlm.nih.gov/pubmed/28108601). In contrast to the original genomes, these files are preprocessed using BBTools. * `gold.tree`: The reference tree of the samples, used as the gold standard #### `lice.tar.gz` Once you untar the file, all the files are under a `oasis/projects/nsf/uot138/balaban/mixture/` folder. These files are related to actual leave-out experiments with different values of `k` (e.g., 21, 27, ..., 31). Recall that for each, we do 5 replicates of subsampling of backbone and for each, we have 5 replicates of queries. Under this folder, we have the following files. * `lice/ktest/all_results.csv`: A summary of all placement accuracy results for all methods across all tests with k=21 and k=31 * `lice/ktest/additivity_eror.sh`: A small tester script used to find additivity error * `lice/ktest/[k]/skmer:` * `diagreport.txt`: Error using APPLES various criteria (FM, etc.) * `dist.mat`: Skmer distance matrix * `jaccard.txt`: similarity matrix according to Jaccard * `meta_backbone.tree`: backbone tree * `library`: includes a `CONFIG` file giving skmer configuration. In addition, for each skim, we have: * `*.dat`: the skmer estimation of parameters such as coverage, length, etc. * `*.hist`: repeat spectra * `*.msh`: mash sketch * `lice/ktest/[k]/exp-data/[sample replicate]`: * `species.txt`: list of species included in this sample replicate * `diameter.txt`: Diameter of the tree * `true.tree`: true tree in newick format * `meta_backbone.tree`: true tree with branch lengths recomputed * `queries/[query rep]`: * `things.txt`: name of query genomes (mixture) in this replicate * `dist.mat` and/or `dist.txt`: the distance from the mixture to each reference * Three folders: 1. `alien` is when both parents are removed from ref 2. `partial` is when one parent is removed from ref 3. `present` is when neither parent is removed from ref Each of these folders includes these files: * `results_[method].csv`: placement error of different methods * `[method].nwk`: results of all methods in newick format * `backbone.tree`: the backbone tree used in analyses * `baseline.*`: the best * `lice/scripts/`: helper scripts used to run analyses, packaged for future reference * `extract_error_from_jplace.py`: given jplace, extracts the error field output by APPLES * `misa-lice.sh`: runs misa on lice * `j2d.py`: translated Jaccard to phylogenetic distance * `push_backbones.sh`: creates the backbone for each replicate * `reference-skim-parallel.sh`: Run skmer to create the skmer libraries ### Yeast dataset (real hybridization) In addition to simulated mixtures, we create a dataset of real hybrid yeast species. We select representative genomes for eight non-hybrid Saccharomyces species with assemblies available on NCBI. We also created a second extended dataset where we included seven more species from Genera Naumovozyma, Nakaseomyces, and Candida (see Supplementary Table S2 for accession numbers). We curate four assembled and two unassembled strains of hybrid yeast species, some of which were previously analyzed by Langdon et al. (2018). Unassembled hybrid strains muri (Krogerus et al., 2018) and YMD3265 are subsampled from NCBI SRA to 100Mb and filtered for contaminants in the same fashion as the previous dataset. We do not include strains such as Saccharomyces bayanus which are conjectured to be a hybrid of three species (Libkind et al., 2011). For each hybrid species, the hypothesized ancestors are known from the literature (Krogerus et al., 2018; Langdon et al., 2018, 2019) and NCBI Taxonomy annotation, and we use these postulated ancestors as the ground truth. The archive `yeast.tar.gz` is provided. All experiment intermediate and output files, scripts, and Skmer sketches for all k=[21,23,25,27,29,31]. The archive has the following subdirectories: The file includes (all prefixed by `oasis/projects/nsf/uot138/balaban/mixture/yeast/`): * k-mer size `k=[21,23,25,27,29,31]`, * `[query]` being one of the genomes, * `cond` being either `present` (both ancestors present) or `partial` (one ancestor present) or `alien` (no ancestor present). * `[db]` is either `base` for the smaller datasets of relevant yeast, or `extended` for the larger dataset with all the yeasts * `method` being one of the methods, APPLES, MISA, or TOP2 * `[data type]` being one of `assembly` for assemblies and `genome-skim` for genome skims. The files provided include: * `ktest`: Each experiment directory for parameters: * `ktest/all_results.csv`: the errors of methods across all the analyses * `ktest/meta_backbone.tree`: please ignore this file. Backbone trees specific to each k are given under `skmer` library. * `ktest/[k]/exp-data/all_results.csv`: the error values for this particular value of `k` * `ktest/[k]/exp-data/[data type]/[query]/dist.\\*.mat`or`dist.\\*txt`: gives the full distance matrix from this query to all references * `ktest/[k]/exp-data/[data type]/[query]/[query].fna`or`[query].fastq`: The genome in fna or genome skims in `fastq` formats * `ktest/[k]/exp-data/[data type]/[query]/things.txt`: name of query genomes (mixture) in this replicate * `ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/results\\_[method].csv`: gives the error for a condition * `ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/[method].nwk`or`[method].jplace`: gives the actual result of each method in newick or jplace formats * `ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/backbone.tree`: the backbone tree after removing queries * `ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/true.tree`: the tree with correct placements marked for queries * `ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/log.out`or`log.err`: the log file giving details of each run * `ktest/[k]/skmer`: Has the skmer library used in the analyses, including the * `.dat` files (library info), * library config (`CONFIG`), * the mash sketches (`.msh`), * reference trees (`meta\\_backbone.tree`) * distance matrices (`ref-dist-mat.txt\\`) * `genomes`: Yeast genome assemblies. * For each genome, we give the `.fna` file * For hybrid genomes (named in `genomes/hybrids.txt`) we also give the names of the ancestors (`genomes/[genome]/things.txt`). For non-hybrids (`genomes/nonhybrids.txt`) this is meaningless. * `genomes/nonhybrids_and_outlier` lists non-hybrids including the outgroups (the extended set mentioned above), which are species not in the Saccharomyces genus. * The genomes are also available at [doi: 10.5281/zenodo.6974987](https://doi.org/10.5281/zenodo.6974987) * SRA-subsample: Genome created by subsampling SRAs for the genome assemblies; here, * `dist-[query].txt` gives the distance matrix obtained * `misa.jplace`: the MISA results in jplace format * `log.out`, `[query].log` and `log.err`: log files of the experiment * `fastq` and `meta_backbone.tree` files give the input data (subsampled reads and the backbone tree) ### Drosophila dataset (simulated mixture) We use a set of 14 Drosophila assemblies published by Miller et al. (2018) (Supplementary Table S1) to evaluate the accuracy of our approach in an ideal setting where the mixed sample consists of the concatenation of the assemblies. We test 20 simulated mixtures of randomly chosen species in three scenarios where none, one, or both of the constituents are present in the reference library. The following archives are provided under `oasis/projects/nsf/uot138/balaban/mixture/drosophila` in `drosophila.tar.gz`. All experiment intermediate and output files, scripts, and Skmer sketches for all * `k` is one of 21,23,25,27,29, or 31 * `cond` being either `present` (both ancestors present) or `partial` (one ancestor present) or `alien` (no ancestor present). * `method` being one of the methods, APPLES, MISA, or TOP2; note that `baseline` also represents APPLES The archive has the following subdirectories: * `assembly`: Drosophila genomes published by Miller et al. (2018). * `topo.tree`: The gold standard phylogeny for Drosophila (i.e. backbone tree.) * `ktest`: Each experiment directory for parameters: * `ktest/all_results.csv`: the errors of methods across all the analyses * `ktest/dist.mat`: please ignore. The distance matrices for each analysis are given below. * `ktest/[k]/exp-data/all_results.csv: the error values of the analyses for this particular `k`-`ktest/[k]/exp-data/[query]/all\\_results.csv\\`: error values pertaining to this query * `ktest/[k]/exp-data/[query]/species.txt`: list of all the species, same order * `ktest/[k]/exp-data/[query]/dist.*.mat` or `dist.*txt`: gives the full distance matrix from this query to all references * `ktest/[k]/exp-data/[query]/things.txt`: name of query genomes (mixture) in this replicate * `ktest/[k]/exp-data/[query]/[cond]/results_[method].csv`: gives the error for a condition * `ktest/[k]/exp-data/[query]/[cond]/[method].nwk` or `[method].jplace`: gives the actual result of each method in newick or jplace formats * `ktest/[k]/exp-data/[query]/[cond]/backbone.tree`: the backbone tree after removing queries * `ktest/[k]/exp-data/[query]/[cond]/log.out` or `log.err`: the log file giving details of each run * `ktest/[k]/skmer`: Has the skmer library used in the analyses, including: * the library info such as coverage (`.dat`), * the mash sketches (`.msh`), * library config (`CONFIG`), * reference trees (`meta_backbone.tree`), * the FASTME log file (`dist.mat_fastme_stat.txt`), * distance matrices (`ref-dist-mat.txt`) ## Sharing/Access information See more on: * [MISA code](https://github.com/balabanmetin/misa) * [MISA dataset github](https://github.com/balabanmetin/misa-data).","descriptionType":"TechnicalInfo"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"NSF-1815485","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"NSF-1845967","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1QW25","contentUrl":null,"metadataVersion":6,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":96,"downloadCount":29,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-10-04T22:45:22Z","registered":"2023-10-04T22:45:23Z","published":null,"updated":"2026-01-28T15:02:24Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d10c7c","type":"dois","attributes":{"doi":"10.6076/d10c7c","identifiers":[],"creators":[{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]},{"name":"Warnow, Tandy","nameType":"Personal","givenName":"Tandy","familyName":"Warnow","affiliation":["University of Illinois Urbana-Champaign"],"nameIdentifiers":[]}],"titles":[{"title":"ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"Phylogenomics"},{"subject":"ASTRAL"},{"subject":"Simphy"}],"contributors":[],"dates":[{"date":"2023-06-03T00:27:31Z","dateType":"Submitted"},{"date":"2023-06-08T00:00:00Z","dateType":"Issued"},{"date":"2023-06-08T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/bioinformatics/btv234","relatedIdentifierType":"DOI"},{"relationType":"IsDerivedFrom","relatedIdentifier":"https://github.com/smirarab/ASTRAL","relatedIdentifierType":"URL"},{"relationType":"IsDerivedFrom","relatedIdentifier":"https://github.com/smirarab/astral2sims","relatedIdentifierType":"URL"}],"relatedItems":[],"sizes":["16834952730 bytes"],"formats":[],"version":"4","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Motivation: The estimation of species phylogenies requires multiple loci,\n since different loci can have different trees due to incomplete lineage\n sorting, modeled by the multi-species coalescent model. We recently\n developed a coalescent-based method, ASTRAL, which is statistically\n consistent under the multi-species coalescent model and which is more\n accurate than other coalescent-based methods on the datasets we examined.\n ASTRAL runs in polynomial time, by constraining the search space using a\n set of allowed ‘bipartitions’. Despite the limitation to allowed\n bipartitions, ASTRAL is statistically consistent. Results: We present a\n new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has\n substantial advantages over ASTRAL: it is faster, can analyze much larger\n datasets (up to 1000 species and 1000 genes) and has substantially better\n accuracy under some conditions. ASTRAL’s running time is O(n^2k|X|^2), and\n ASTRAL-II’s running time is O(nk|X|^2), where n is the number of species,\n k is the number of loci and X is the set of allowed bipartitions for the\n search space.","descriptionType":"Abstract"},{"description":"We used SimPhy (https://github.com/adamallo/SimPhy) to simulate species trees and gene trees and used Indelible (Fletcher and Yang, 2009) to simulate nucleotide sequences down the gene trees with varying length and model parameters. We estimated gene trees on these simulated gene alignments, which we then used in coalescent-based analyses. We simulated 11 model conditions, which we divide into two datasets, with one model condition appearing in both datasets. We used SimPhy to simulate species trees according to the Yule process, characterized by the number of taxa, maximum tree length, and the speciation rate (this combination defines a model condition). Dataset 1: In six model conditions, we fixed the number of taxa to 200 and varied tree length (500 K, 2 M and 10 M generations) and speciation rates (1e-6 and 1e-7 per generation). The tree length impacts the amount of ILS, with lower length resulting in shorter branches, and therefore higher levels of ILS. Speciation rate impacts whether speciation events tend to happen close to the tips (1e-6) or close to the base (1e-7). Different tree shapes (i.e. combinations of tree length and speciation rate) produce different levels of ILS starting from relatively low [roughly 10% distance between true gene trees and the species tree, measured by the Robinson–Foulds (RF) distance] and going up to very high (roughly 70% RF). Dataset II: we fixed the tree shape to 2M/1e-6 and set the number of taxa to 10, 50, 100, 200, 500, and 1000.  The model condition with 200 taxa and the 2 M/1e-6 tree shape appears in both datasets. For each model condition, we simulated 50 species trees, forming 50 replicates. On each species tree, 1000 gene trees were simulated according to the multi-species coalescent model with the population size fixed to 200 000.  We simulated indel-free gene alignments using Indelible and under the GTR + Γ model. First, for each replicate, two parameters, μ and σ, were drawn uniformly from (5.7,7.3)(5.7,7.3) and (0,0.3)(0,0.3) respectively. Then, the sequence length for each gene in that replicate was drawn from a log-normal distribution with μ and σ parameters (the average sequence length is uniformly distributed between 300 bp and 1500 bp). GTR + Γ parameters were drawn from Dirichlet distributions that had parameters estimated using ML from a collection of real biological datasets (details given in the paper).  We used FastTree to estimate the 550 000 gene trees ranging from 10 to 1000 species.","descriptionType":"Methods"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"0733029","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"1461364","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"1062335","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D10C7C","contentUrl":null,"metadataVersion":6,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":371,"downloadCount":80,"referenceCount":0,"citationCount":2,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-06-08T15:36:07Z","registered":"2023-06-08T15:36:08Z","published":null,"updated":"2026-01-28T14:54:44Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d14599","type":"dois","attributes":{"doi":"10.6076/d14599","identifiers":[],"creators":[{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]},{"name":"Sayyari, Erfan","nameType":"Personal","givenName":"Erfan","familyName":"Sayyari","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Whitfield, James B.","nameType":"Personal","givenName":"James B.","familyName":"Whitfield","affiliation":["University of Illinois Urbana-Champaign"],"nameIdentifiers":[]}],"titles":[{"title":"Fragmentary gene sequences negatively impact gene tree and species tree reconstruction"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"species tree estimation"},{"subject":"gene trees"},{"subject":"Phylogenomics"},{"subject":"fragmentary data"},{"subject":"insect phylogeny"},{"subject":"ASTRAL"}],"contributors":[],"dates":[{"date":"2023-06-02T22:27:30Z","dateType":"Submitted"},{"date":"2023-06-07T00:00:00Z","dateType":"Issued"},{"date":"2023-06-07T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/molbev/msx261","relatedIdentifierType":"DOI"},{"relationType":"IsSupplementedBy","relatedIdentifier":"https://github.com/esayyari/Fragments","relatedIdentifierType":"URL"}],"relatedItems":[],"sizes":["3290461524 bytes"],"formats":[],"version":"3","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Species tree reconstruction from genome-wide data is increasingly being\n attempted, in most cases using a two-step approach of first estimating\n individual gene trees and then summarizing them to obtain a species tree.\n The accuracy of this approach, which promises to account for gene tree\n discordance, depends on the quality of the inferred gene trees. At the\n same time, phylogenomic and phylotranscriptomic analyses typically use\n involved bioinformatics pipelines for data preparation. Errors and\n shortcomings resulting from these preprocessing steps may impact the\n species tree analyses at the other end of the pipeline. In this article,\n we first show that the presence of fragmentary data for some species in a\n gene alignment, as often seen on real data, can result in substantial\n deterioration of gene trees, and as a result, the species tree. We then\n investigate a simple filtering strategy where individual fragmentary\n sequences are removed from individual genes but the rest of the gene is\n retained. Both in simulations and by reanalyzing a large insect\n phylotranscriptomic data set, we show the effectiveness of this simple\n filtering strategy.","descriptionType":"Abstract"},{"description":"Analysis Pipeline for Insect Data\n Set We used the amino\n acid sequence data provided by Misof et al. (2014) as\n “supplementary 7.” Filtering Strategy Before identifying fragmentary sequences, we first remove extremely gappy sites, defined as those with \u0026gt;90% gaps. We then remove species that have \u0026lt;20% (1/5), 25% (1/4), 33% (1/3), 50% (1/2), 66% (2/3), 75% (3/4), or 80% (4/5) amino-acids (i.e., characters other than gaps). We use a tool called seqtools, implemented as part of the PASTA. Next, we reestimate gene trees. In order to track the occupancy and bootstrap support, we use in-house scripts, available online https://github.com/esayyari/discoVista. Gene Trees and Species Trees Gene trees are estimated using FastTree2 (Price et al. 2010) using its default amino acid substitution model JTT (Jones et al. 1992) or RAxML (Stamatakis 2014) with the automatic amino acid model selection. We use RAxML (Stamatakis 2014), version 8.2.9 with ten runs of inference using different starting trees. We used RAxML’s automatic model selection approach. When several species have identical sequences for a gene, we keep only one of them (i.e., remove redundant ones) in our RAxML runs and add the removed species back to the final inferred gene tree as a polytomy. For performing gene tree bootstrapping using FastTree, we first generate bootstrap sequences using RAxML and then run FastTree on those to estimate the bootstrapped gene trees. We then draw those bootstrap gene trees on ML gene tree branches using the Newick utility (Junier and Zdobnov 2010). For RAxML gene trees, we use the rapid bootstrapping option on reduced sequences (after removing identical sequences). After gene tree estimations, we add back the identical species and draw these bootstrap gene trees on the best ML gene trees (RAxML) following the same procedure using the Newick utility. We use ASTRAL-II to estimate the species trees summarizing gene trees with at least four taxa left after filtering. Simulation Procedure We use one model condition of a previously simulated data set from Mirarab and Warnow (2015) ASTRAL-II paper with 100 ingroup taxa and one outgroup. For each of the 50 replicates in this data set, Simphy (Mallo et al. 2016) was used to simulate a species tree according to the Yule model, and then 1,000 gene trees were simulated using the MSC model which captures ILS. The data set has moderate levels of ILS; the average distance between true gene trees and true species trees is 0.33. We subsampled genes to create three different data sets with 50, 200, or 1,000 genes. DNA sequences of varying lengths were simulated down the gene trees using Indelible (Fletcher and Yang 2009) with GTR parameters and stationary distributions estimated from published biological data sets, as detailed by Mirarab and Warnow (2015). Note that simulated sequences did not include any indels and thus were already aligned. Mirarab and Warnow (2015) suggested removing two replicates that include almost no phylogenetic signal, and we use the same strategy, leaving us with 48 replicates. This creates our unfiltered base data set. We add fragmentation to our complete simulated data set using a procedure that seeks to emulate patterns of fragmentation in the insect biological data set. 1) For each replicate, we order species in the biological data set and the simulated data set with respect to the tip-to-root distances. 2) We randomly select 100 of the biological species and map them to the simulated species with the same position in the order. The main outgroup (\u003cem\u003eIxodes scapularis\u003c/em\u003e) in the biological and simulated data sets always map to each other. 3) For each replicate in the simulated data set, we randomly sample (with replacement) 1,000 genes in the insect data sets that have at least 101 species, including the main outgroup. 4) For each species in each simulated gene, we compute the portion of gap sites in the corresponding gene alignment for the corresponding species in the biological data and remove the same portion of sites in the simulated data set at random positions. When a species is missing from a gene in the biological data set, we use the same species from another randomly chosen gene. Filtering Fragments and Gene Trees and Species Trees We first remove sites with \u0026gt;90% gaps, removing between 0.0% and 2.0% of the total number of characters in all sequences. We then remove from each gene any species that has less than a certain fraction (e.g., 10–80%) of the full gene. For example, at 10%, we remove only sequences that have 90% or more gaps. For each threshold, we then estimate gene trees using both RAxML (Stamatakis 2014) version 8.2.9 with two starting trees and FastTree (Price et al. 2010) version 2.1.9 Double precision using the GTR + Γ model of sequence evolution (Tavaré 1986). We infer the species tree using ASTRAL-II (Mirarab and Warnow 2015) version 4.11.1. We build species trees using all 1,000 genes or using randomly chosen subsets of 200 or 50 genes.","descriptionType":"Methods"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"IIS-1565862","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D14599","contentUrl":null,"metadataVersion":6,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":147,"downloadCount":13,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-06-07T16:09:58Z","registered":"2023-06-07T16:09:59Z","published":null,"updated":"2026-01-28T14:54:33Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d17w2t","type":"dois","attributes":{"doi":"10.6076/d17w2t","identifiers":[],"creators":[{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]},{"name":"Rabiee, Maryam","nameType":"Personal","givenName":"Maryam","familyName":"Rabiee","affiliation":["University of California San Diego"],"nameIdentifiers":[]},{"name":"Sayyari, Erfan","nameType":"Personal","givenName":"Erfan","familyName":"Sayyari","affiliation":["University of California San Diego"],"nameIdentifiers":[]}],"titles":[{"title":"Multi-allele species reconstruction using ASTRAL"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"Phylogenomics"},{"subject":"species tree inference"},{"subject":"ASTRAL"}],"contributors":[],"dates":[{"date":"2023-07-15T00:05:38Z","dateType":"Created"},{"date":"2023-06-02T22:29:35Z","dateType":"Submitted"},{"date":"2023-06-08T00:00:00Z","dateType":"Issued"},{"date":"2023-06-08T00:00:00Z","dateType":"Available"},{"date":"2023-07-17T00:00:00Z","dateType":"Updated"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1016/j.ympev.2018.10.033","relatedIdentifierType":"DOI"},{"relationType":"IsSupplementedBy","relatedIdentifier":"https://gitlab.com/mrabiee/ASTRAL-multiind/","relatedIdentifierType":"URL"},{"relationType":"IsCitedBy","relatedIdentifier":"10.1101/439489","relatedIdentifierType":"DOI"}],"relatedItems":[],"sizes":["24002922594 bytes"],"formats":[],"version":"5","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Genome-wide phylogeny reconstruction is becoming increasingly common, and\n one driving factor behind these phylogenomic studies is the promise that\n the potential discordance between gene trees and the species tree can be\n modeled. Incomplete lineage sorting is one cause of discordance that\n bridges population genetic and phylogenetic processes. ASTRAL is a species\n tree reconstruction method that seeks to find the tree with minimum\n quartet distance to an input set of inferred gene trees. However, the\n published ASTRAL algorithm only works with one sample per species. To\n account for polymorphisms in present-day species, one can sample multiple\n individuals per species to create multi-allele datasets. Here, we\n introduce how ASTRAL can handle multi-allele datasets. We show that the\n quartet-based optimization problem extends naturally, and we introduce\n heuristic methods for building the search space specifically for the case\n of multi-individual datasets. We study the accuracy and scalability of the\n multi-individual version of ASTRAL-III using extensive simulation studies\n and compare it to NJst, the only other scalable method that can handle\n these datasets. We do not find strong evidence that using multiple\n individuals dramatically improves accuracy. When we study the trade-off\n between sampling more genes versus more individuals, we find that sampling\n more genes is more effective than sampling more individuals, even under\n conditions that we study where trees are shallow (median length: ≈ 1Ne)\n and ILS is extremely high.","descriptionType":"Abstract"},{"description":"We simulate two new datasets\n (see Table below): a heterogeneous dataset\n (\u003cem\u003eD1\u003c/em\u003e) where many parameters are simultaneously changed\n and ILS levels are extremely high. a more homogeneous\n dataset (\u003cem\u003eD2\u003c/em\u003e) where parameters are less varied and the\n amount of ILS is controlled to create three model conditions.\n We use SimPhy to generate species trees and gene\n trees according to the MSC model. All replicates have 5 individuals per\n species. \u003cem\u003eD1.\u003c/em\u003e This dataset includes 330\n replicates. The number of genes was uniformly sampled between 50 and 1000\n per replicate. The number of species was uniformly sampled between 20 and\n 200. The species tree birth rate parameter is randomly sampled from a log\n uniform distribution in [10\u003csup\u003e-7\u003c/sup\u003e,10\u003csup\u003e-6\u003c/sup\u003e], and the death rate is sampled from a log uniform distribution, bounded from below by 10\u003csup\u003e-7\u003c/sup\u003e and bounded by the birth rate parameter from above. The population size is sampled from a uniform distribution in [10\u003csup\u003e5\u003c/sup\u003e,10\u003csup\u003e6\u003c/sup\u003e]. We sampled a maximum species tree height for each replicate from a log-normal distribution with an expected value of 500000 generations (ranging between 0.19M and 1M in 90% of replicates).  \u003cem\u003eD2.\u003c/em\u003e This dataset has three model conditions, varying number of generations: 0.5M, 1M, or 2M. Each has 50 replicates with 200 species (and one outgroup) and 1000 genes with species birth rate set to 10\u003csup\u003e-6\u003c/sup\u003e under a birth-only model. The population size is 200,000. We also create two versions of it where only one or two individuals per species are randomly sub-sampled.","descriptionType":"Methods"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"IIS- 1565862","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D17W2T","contentUrl":null,"metadataVersion":7,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":164,"downloadCount":26,"referenceCount":0,"citationCount":2,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-06-08T15:35:54Z","registered":"2023-06-08T15:35:54Z","published":null,"updated":"2026-01-28T14:54:27Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}},{"id":"10.6076/d1cp4r","type":"dois","attributes":{"doi":"10.6076/d1cp4r","identifiers":[],"creators":[{"name":"Mirarab, Siavash","nameType":"Personal","givenName":"Siavash","familyName":"Mirarab","affiliation":["University of California San Diego"],"nameIdentifiers":[{"schemeUri":"https://orcid.org","nameIdentifier":"https://orcid.org/0000-0001-5410-1518","nameIdentifierScheme":"ORCID"}]},{"name":"Rabiee, Maryam","nameType":"Personal","givenName":"Maryam","familyName":"Rabiee","affiliation":["University of California San Diego"],"nameIdentifiers":[]}],"titles":[{"title":"QuCo: quartet-based co-estimation of species trees and gene trees"}],"publisher":"Dryad","container":{},"publicationYear":2023,"subjects":[{"subject":"FOS: Biological sciences","schemeUri":"https://web-archive.oecd.org/2012-06-15/138575-38235147.pdf","subjectScheme":"fos"},{"subject":"FOS: Biological sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"species tree inference"},{"subject":"Co-Estimation Methods"},{"subject":"Bayesian phylogenetics"},{"subject":"Phylogenomics"}],"contributors":[],"dates":[{"date":"2023-11-20T18:58:10Z","dateType":"Created"},{"date":"2023-11-20T19:06:46Z","dateType":"Submitted"},{"date":"2023-11-29T00:00:00Z","dateType":"Issued"},{"date":"2023-11-29T00:00:00Z","dateType":"Available"}],"language":"en","types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceType":"dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[{"relationType":"IsCitedBy","relatedIdentifier":"10.1093/bioinformatics/btac265","relatedIdentifierType":"DOI"},{"relationType":"IsDerivedFrom","relatedIdentifier":"https://github.com/maryamrabiee/quco","relatedIdentifierType":"URL"}],"relatedItems":[],"sizes":["6048100889 bytes"],"formats":[],"version":"3","rightsList":[{"rights":"Creative Commons Zero v1.0 Universal","rightsUri":"https://creativecommons.org/publicdomain/zero/1.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc0-1.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Motivation: Phylogenomics faces a dilemma: on the one hand, the most\n accurate species and gene tree estimation methods are those that\n co-estimate them; on the other hand, these co-estimation methods do not\n scale to moderately large numbers of species. The summary-based methods,\n which first infer gene trees independently and then combine them, are much\n more scalable but are prone to gene tree estimation error, which is\n inevitable when inferring trees from limited-length data. Gene tree\n estimation error is not just random noise and can create biases such as\n long-branch attraction. Results: We introduce a scalable likelihood-based\n approach to co-estimation under the multi-species coalescent model. The\n method, called quartet co-estimation (QuCo), takes as input independently\n inferred distributions over gene trees and computes the most likely\n species tree topology and internal branch length for each quartet,\n marginalizing over gene tree topologies and ignoring branch lengths by\n making several simplifying assumptions. It then updates the gene tree\n posterior probabilities based on the species tree. The focus on gene tree\n topologies and the heuristic division to quartets enables fast likelihood\n calculations. We benchmark our method with extensive simulations for\n quartet trees in zones known to produce biased species trees and further\n with larger trees. We also run QuCo on a biological dataset of bees. Our\n results show better accuracy than the summary-based approach ASTRAL run on\n estimated gene trees.  Availability and implementation: QuCo is\n available on https://github.com/maryamrabiee/quco. Supplementary\n information Supplementary data are available at Bioinformatics online.","descriptionType":"Abstract"},{"description":"Data are simulated by us and provided here for\n reproducibility. ","descriptionType":"Methods"},{"description":"# Quco Dataset --- Date belonging\n to the following paper: Rabiee, Maryam, and Siavash Mirarab. “QuCo:\n Quartet-Based Co-Estimation of Species Trees and Gene Trees.”\n Bioinformatics 38, no. Supplement1 (June 24, 2022): i413–21.\n . ## Description of\n the data and file structure There are several files: ###\n `Quartet-simulation-sequences.tar.gz` The main simulations presented in\n the paper, which involve Felsenstein’s zone quartets. Here, we provide the\n simulated sequences. Files are of the form:\n `rep.[CU]d/R[long]l-[short]s/[rep]/seq[seqlength]/sequences.tar.gz` where\n * `[seqlength]` is the sequence length and is either 1600, 800, 400, 200 *\n `[rep]` is the replicate number, which is between 1 and 20 * `[short]` is\n the length of the short branch, and varies between 0.01, 0.02, 0.04, 0.08.\n * `[long]` is the length of the long branch and varies between 0.1, 0.2,\n 0.3, and 0.4. * `[CU]` is the length of the internal branch length in\n coalescent units (CU) and varies between 0.1, 0.2, and 0.3. Each file\n includes the simulated sequences in the fasta format for all genes of each\n repliate. ### `Anomaly-simulations-mrbayes-outputs.tar.gz` This includes\n the results of the anomaly zone simulations, and specifically the output\n of MrBayes. For each of the 50 replicated simulations, we include: *\n `[id]/mrbayes-outputs.tar.gz` Inside each archive, we have MrBayes MCMC\n sample from 200 loci. The files are named as follows, where `locus id` is\n the name of the locus and we have results of four chains (runs 1--4). Each\n .t file includes the MCMC samples in nexus format, as generated by\n MrBayes. * `seq600/[locus id]/[locus id].nex.run[1/2/3/4].t` ###\n `BiologicalDataset_rj_tree_distribution.tar.gz` Biological dataset of\n Bossert et al. (2021) with 32 species and 1291 UCEs\n (). Here, for\n reproducabililty, we provide MrBayes tree distributions. For each of the\n 1291 loci, you can find the following files: * `uce-[ucid].run[1/2].t`:\n The MCM sample in nexus format from MrBayes, for chain (run) 1 or 2. ##\n Sharing/Access information The rest of the data is available on *\n [GitLab](https://gitlab.com/mrabiee/quo-data) ## Code/Software The\n simulated data are generated using Simphy","descriptionType":"TechnicalInfo"}],"geoLocations":[],"fundingReferences":[{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"IIS-1845967","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"},{"schemeUri":"https://ror.org","funderName":"National Science Foundation","awardNumber":"ACI-1053575","funderIdentifier":"https://ror.org/021nxhr62","funderIdentifierType":"ROR"}],"url":"https://datadryad.org/dataset/doi:10.6076/D1CP4R","contentUrl":null,"metadataVersion":6,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":49,"downloadCount":9,"referenceCount":0,"citationCount":1,"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2023-10-04T22:21:45Z","registered":"2023-10-04T22:21:45Z","published":null,"updated":"2026-01-28T14:54:11Z"},"relationships":{"client":{"data":{"id":"dryad.dryad","type":"clients"}}}}],"meta":{"total":90,"totalPages":4,"page":1},"links":{"self":"https://api.datacite.org/dois?prefix=10.6076","next":"https://api.datacite.org/dois?page%5Bnumber%5D=2\u0026page%5Bsize%5D=25\u0026prefix=10.6076"}}