10.5061/DRYAD.RF7G5
Breinholt, Jesse W.
Florida Museum of Natural History
Earl, Chandra
Florida Museum of Natural History
Lemmon, Alan R.
Department of Scientific Computing,
Lemmon, Emily Moriarty
Florida State University
Xiao, Lei
Florida Museum of Natural History
Kawahara, Akito Y.
Florida Museum of Natural History
Data from: Resolving relationships among the megadiverse butterflies and
moths with a novel pipeline for Anchored Phylogenomics
Dryad
dataset
2017
hybrid enrichment
Bombycoidea
exon
2017-05-02T15:51:40Z
2017-05-02T15:51:40Z
en
https://doi.org/10.1093/sysbio/syx048
49057380538 bytes
2
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
The advent of next-generation sequencing technology has allowed for the
collection of large portions of the genome for phylogenetic analysis.
Hybrid enrichment and transcriptomics are two techniques that leverage
next-generation sequencing and have shown much promise. However, methods
for processing hybrid enrichment data are still limited. We developed a
pipeline for anchored hybrid enrichment (AHE) read assembly, orthology
determination, contamination screening, and data processing for sequences
flanking the target “probe” region. We apply this approach to study the
phylogeny of butterflies and moths (Lepidoptera), a megadiverse group of
more than 157,000 described species with poorly understood deep-level
phylogenetic relationships. We introduce a new, 855 locus anchored hybrid
enrichment kit for Lepidoptera phylogenetics and compare resulting trees
to those from transcriptomes. The enrichment kit was designed from
existing genomes, transcriptomes and expressed sequence tag (EST) data and
was used to capture sequence data from 54 species from 23 lepidopteran
families. Phylogenies estimated from AHE data were largely congruent with
trees generated from transcriptomes, with strong support for relationships
at all but the deepest taxonomic levels. We combine AHE and transcriptomic
data to generate a new Lepidoptera phylogeny, representing 76 exemplar
species in 42 families. The tree provides robust support for many
relationships, including those among the seven butterfly families. The
addition of AHE data to an existing transcriptomic dataset lowers node
support along the Lepidoptera backbone, but firmly places taxa with AHE
data on the phylogeny. To examine the efficacy of AHE at different
taxonomic levels, phylogenetic analyses were also conducted on a sister
group representing a more recent divergence, the Saturniidae and
Sphingidae. These analyses utilized sequences from the probe region and
data flanking it, nearly doubled the size of the dataset; all resulting
trees were well supported. We hope that our data processing pipeline,
hybrid enrichment gene set, and approach of combining AHE data with
transcriptomes will be useful for the broader systematics community.
READMEREADME File containing list of files and script contained in this
dryad packageBreinholt_et_al_Supplementary_Figure_S1Supplementary Figure
S1 from Breinholt et al.
(2017)Breinholt_et_al_Supplementary_Figure_S2Supplementary Figure S2 from
Breinholt et al.
(2017)Breinholt_et_al_Supplementary_Figure_S3Supplementary Figure S3 from
Breinholt et al.
(2017)Breinholt_et_al_Supplementary_Figure_S4Supplementary Figure S4 from
Breinholt et al.
(2017)Breinholt_et_al_Supplementary_Figure_S5Supplementary Figure S5 from
Breinholt et al.
(2017)Breinholt_et_al_Supplementary_File_1__S1-S11Supplementary File 1:
Microsoft excel document including Supplementary Table S1-S11 from
Breinholt et al.
(2017)Breinholt_et_al_Supplementary_File_1_S1-S11.xlsxBreinholt_et_al_Supplementary_File_2_Lep1Specification file for the Lep1 probe set used to order probes from Agilent Technologies (http://www.agilent.com/)Breinholt_et_al_Supplementary_File_3Word document that expands discussion of Breinholt et al. (2017) and discusses Lepidopteran relationships in more detailsLep1_refCompressed file containing the data for each reference for each loci in the Lep1 kit as well as used in the IBA assembly.JAVA_SourceCodeCompressed directory holding A.R.L (alemmon@evotutor.org) java source code This directory contains readme and instructions for use and to compile the java code for IdentifySpacedKmers7, QuickScan5, and ShallowMapper4. It also contain the Lep1_ProbeDesign directory used with the java programs to design the Lep1 probe set (IdentifySpacedKmers7, IdentifySpacedKmers7_readme.txt, Lep1_ProbeDesign, LepRefFiles.txt, QuickScan5_readme.txt, QuickScan5.java, ShallowMapper4_readme.txt, ShallowMapper4.java) ShallowMapper4: java script by A.R.L used to identify intron boundaries in genes for five reference taxa by mapping raw genomic reads to the corresponding transcriptomic sequences QuickScan5: java script by A.R.L used to scan the additional 23 transcriptomes and ESTs by generating reference kmers using the 5-species alignments and using those kmers to map contig sequences from the transcriptomes to the candidate locus setBreinholt_et_al_LOG_COMMANDSSet of commands used to run the bioinformatic pipeline to generate data for Breinholt et al. 2017Scripts_READMEDescription of the python scripts and direction how to run them.IBApython script to assemble AHE data loci by lociIBA_transpython script to assemble AHE data loci by loci for using a fastq file from transcriptome dataextract_probe_regionpython script to split alignment into head, probe, and tail regions based on the beginning and end of a reference sequence in the alignments_hit_checkerpython script to process the output of BLAST to find sequences that fit the single hit criteraortholog_filterpython script to process the output of BLAST to find if the location of the best hit on the genome is the same location as the probe target from that genome.splitpython script to split a single line fasta file with many loci into locus specific fasta filesalignment_DE_trimpython script to trim alignments by density and entropyflank_dropperpython script to remove poorly aligned sequences in the flanking head and tail regionscounting_monsterpython script to count the loci per taxa and put into a tab separated matrixremovelistpython script to remove list of sequences from a fasta filegetlistpython script to get list of sequences from a fasta filecontamination_filterpython script to process blast results of blasting sequences from each loci against themselves using usearch to identify contaminationremove_duplicatespython script to identify and remove sequences for each taxon that had more than one sequence per locustaxa_listList of Sample ID's used in nexus files and corresponding species names in tab-delimited textBreinholt_et_al_RAW_DATA.tar.gzcompressed file containing the raw Illumina (2X100) AHE dataBreinholtetal_RAW_DATA.tar.gzfinal_soap_FG120036BAssembly of Apatelodes pithala from Genbank SRA accession #SRR1794032, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.final_soap_calo2Assembly of Caloptilia triadicae from Genbank SRA accession #SRR1794032, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.final_soap_GV120010BAssembly of Urbanus proteus from Genbank SRA accession #SRR1794082 , using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See Breinholt et al. (2017) for more details.Breinholt_et_al_acrossLep_full_assemblies_all_lociFasta formatted sequence file containing sequences that pass pipeline step 1-6 for all loci and taxa in dataset 1-3. This file can be split using the split.py to separate into fasta files of individual loci.Breinholt_et_al_shallow_full_assemblies_all_lociFasta formatted sequence file containing sequences that pass pipeline step 1-6 for all loci and taxa in dataset 4-6. This file can be split using the split.py to separate into fasta files of individual loci.Breinholt_et_al_allcodonpostion123_acrossLepNexus file containing codon position 1 & 2 & 3 for 557 loci and 75 taxa used to make dataset 1-3. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.Breinholt_et_al_degen12_DS1Dataset 1 (acrossLEP_AHE). Nexus file containing codon position 1 & 2 for 557 loci and 23 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.Breinholt_et_al_aminoacid_DS1Nexus file containing codon position 1 & 2 for 557 loci and 23 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.Breinholt_et_al_degen12_DS2Dataset 2 (acrossLEP_AHE+PARTtrans). Nexus file containing codon position 1 & 2 for 557 loci and 75 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.Breinholt_et_al_aminoacid_DS2Nexus file containing amino acid data for 557 loci and 75 taxa. See taxa_list.txt for species names of each taxon, this is an amino acid nexus file with a CHARSET that defines each loci. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.Breinholt_et_al_degen12_DS3Dataset 3 (acrossLEP_AHE+ALLtrans ). Nexus file consists of both AHE and the transcriptomic data of Kawahara and Breinholt 2015. The file contains codon position 1 & 2 for 2948 loci and 76 taxa. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene that starts with codon position 1. Synonymous signal was removed using degen v1.4 Perl script (http://www.phylotools.com), and the third codon has been removed. Loci names correspond to Loci numbers in the Lep1 enrichment kit included in this DRAYD package. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.Breinholt_et_al_DS4Dataset 4 (shallow_probe+flanks). Nexus file containing 749 loci and 48 taxa. Alignments were trimmed with a density of 60% and entropy of 1.5 using alignment_DE_trim.py and flacking regions were processed with the flank_dropper.py to remove head or tail sequences using 2 standard deviations for both the head and tail. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.Breinholt_et_al_DS5Dataset 5 (shallow_probe). Nexus file containing 749 loci and 48 taxa. The Extract_probe_region.py script was used on Dataset 4 to isolate data coming from the probe region. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.Breinholt_et_al_DS6Dataset 6 (shallow_flanks). Nexus file containing 749 loci and 35 taxa. The Extract_probe_region.py script was used on Dataset 4 to isolate data coming from the flanking regions region. See taxa_list.txt for species names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. For further information see Breinholt et al. (2017) and Breinholt_et_al_Supplementary_File_1_S1-S11.xlsx in this Dryad package for more details.