10.5061/DRYAD.MN338
Borner, Janus
University of Hamburg
Burmester, Thorsten
University of Hamburg
Data from: Parasite infection of public databases: a data mining approach
to identify apicomplexan contaminations in animal genome and transcriptome
assemblies
Dryad
dataset
2018
database analysis
Haemosporida
Parasites
Coccidia
Gregarinasina
Malaria
Piroplasmida
contamination
2018-01-13T00:00:00Z
2018-01-13T00:00:00Z
en
https://doi.org/10.1186/s12864-017-3504-1
73002656 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Background: Contaminations from various exogenous sources are a common
problem in next-generation sequencing. Another possible source of
contaminating DNA are endogenous parasites. On the one hand, undiscovered
contaminations of animal sequence assemblies may lead to erroneous
interpretation of data; on the other hand, when identified,
parasite-derived sequences may provide a valuable source of information.
Results: Here we show that sequences deriving from apicomplexan parasites
can be found in many animal genome and transcriptome projects, which in
most cases derived from an infection of the sequenced host specimen. The
apicomplexan sequences were extracted from the sequence assemblies using a
newly developed bioinformatic pipeline (ContamFinder) and tentatively
assigned to distinct taxa employing phylogenetic methods. We analysed 920
assemblies and found 20,907 contigs of apicomplexan origin in 51 of the
datasets. The contaminating species were identified as members of the
apicomplexan taxa Gregarinasina, Coccidia, Piroplasmida, and Haemosporida.
For example, in the platypus genome assembly, we found a high number of
contigs derived from a piroplasmid parasite (presumably Theileria
ornithorhynchi). For most of the infecting parasite species, no molecular
data had been available previously, and some of the datasets contain
sequences representing large amounts of the parasite’s gene repertoire.
Conclusion: Our study suggests that parasite-derived contaminations
represent a valuable source of information that can help to discover and
identify new parasites, and provide information on previously unknown
host-parasite interactions. We, therefore, argue that uncurated assembly
data should routinely be made available in addition to the final
assemblies.
extracted contigsFasta files containing the extracted, parasite-derived
contigs. Contigs from each Assembly are stored in a separate
file.extracted_contigs.zippredicted amino acidsFasta files containing the
predicted amino acid sequences based on the extracted contigs. Sequences
from each Assembly are stored in a separate file.predicted_aa.zipdataset 1
single genesFasta files containing the single gene amino acid alignments
of dataset 1 prior to processing by
Gblocks.dataset_1_single_genes.zipdataset 1 single genes after
gblocksFasta files containing the single gene amino acid alignments of
dataset 1 after processing by
Gblocks.dataset_1_single_genes_gblocks.zipdataset 2 single genesFasta
files containing the single gene amino acid alignments of dataset 2 prior
to processing by Gblocksdataset_2_single_genes.zipdataset 2 single genes
after gblocksFasta files containing the single gene amino acid alignments
of dataset 2 after processing by
Gblocks.dataset_2_single_genes_gblocks.zipdataset 1 superalignment in
FASTA formatConcatenated superalignment of all 1420 single gene amino acid
alignments of dataset 1 after processing by
Gblocks.dataset_1_superalignment.fadataset 2 superalignment in FASTA
formatConcatenated superalignment of all 301 single gene amino acid
alignments of dataset 2 after processing by
Gblocks.dataset_2_superalignment.famitochondrial sequences from gorilla
Plasmodium in FASTA formatNucleotide alignment of mitochondrial Plasmodium
sequences including two sequences that were extraceted from the gorilla
genome. The alignment is based on data from Liu et al. (2010) and only
contains sequences from Clades G1 and C1.mito_gorilla.fa18S rRNA
Piroplasmida in FASTA formatNucleotide alignment of 18s rRNA sequences
from Piroplasmida including a sequences that was extraceted from the
platypus genome. The alignmnet is based on data from Paparini et al.
(2015) and was processed by Gblocks.18s_piroplasmida.fa