10.5061/DRYAD.RN745
Meyer, Wynn K.
University of Chicago
Venkat, Aarti
University of Chicago
Kermany, Amir R.
University of Chicago
van de Geijn, Bryce
University of Chicago
Zhang, Sidi
University of Chicago
Przeworski, Molly
University of Chicago
Data from: Evolutionary history inferred from the de novo assembly of a
non-model organism, the blue-eyed black lemur
Dryad
dataset
2015
Eulemur macaco
Eulemur flavifrons
2015-07-22T14:45:19Z
2015-07-22T14:45:19Z
en
https://doi.org/10.1111/mec.13327
3382131528 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Lemurs, the living primates most distantly related to humans, demonstrate
incredible diversity in behaviour, life history patterns and adaptive
traits. Although many lemur species are endangered within their native
Madagascar, there is no high-quality genome assembly from this taxon,
limiting population and conservation genetic studies. One critically
endangered lemur is the blue-eyed black lemur Eulemur flavifrons. This
species is fixed for blue irises, a convergent trait that evolved at least
four times in primates and was subject to positive selection in humans,
where 5′ regulatory variation of OCA2 explains most of the brown/blue eye
colour differences. We built a de novo genome assembly for E. flavifrons,
providing the most complete lemur genome to date, and a high confidence
consensus sequence for close sister species E. macaco, the (brown-eyed)
black lemur. From diversity and divergence patterns across the genomes, we
estimated a recent split time of the two species (160 Kya) and temporal
fluctuations in effective population sizes that accord with known
environmental changes. By looking for regions of unusually low diversity,
we identified potential signals of directional selection in E. flavifrons
at MITF, a melanocyte development gene that regulates OCA2 and has
previously been associated with variation in human iris colour, as well as
at several other genes involved in melanin biosynthesis in mammals. Our
study thus illustrates how whole-genome sequencing of a few individuals
can illuminate the demographic and selection history of nonmodel species.
EfEmSangerSequencingForDryadSanger sequencing results for up to 16
individuals (9 blue-eyed black lemurs and 7 black lemurs from the Duke
Lemur Center) sequenced at 17 amplicons throughout scaffold2503 (the
scaffold containing the orthologs of HERC2/OCAs) and aligned using
EBioXVariable sites VCF filesThese files contain sites with a high
posterior probability (> 0.8) of being polymorphic in the 8-sample
dataset (4 samples from each species), along with their genotype
likelihoods in these samples as estimated by ANGSD
(http://popgen.dk/wiki/index.php/ANGSD). Genotype likelihoods were
estimated using the GATK model (-GL 2). The probability of a site being
variable was estimated using ngsStat
(https://github.com/mfumagalli/ngsTools), based on posterior probabilities
calculated using the genotype likelihoods and the site frequency spectrum
estimated from high quality sites (minimum mapping quality 1, minimum
quality 20, minimum 3 samples with data, minimum depth 9) within each
species.VariableSitesVCFFiles1.tar.gzVariable sites VCF filesThese files
contain sites with a high posterior probability (> 0.8) of being
polymorphic in the 8-sample dataset (4 samples from each species), along
with their genotype likelihoods in these samples as estimated by ANGSD
(http://popgen.dk/wiki/index.php/ANGSD). Genotype likelihoods were
estimated using the GATK model (-GL 2). The probability of a site being
variable was estimated using ngsStat
(https://github.com/mfumagalli/ngsTools), based on posterior probabilities
calculated using the genotype likelihoods and the site frequency spectrum
estimated from high quality sites (minimum mapping quality 1, minimum
quality 20, minimum 3 samples with data, minimum depth 9) within each
species.VariableSitesVCFFiles2.tar.gzEulemur flavifrons fastq file for
PSMCThis fastq file was generated from the all-sites vcf file for Harlow
(the E. flavifrons individual used for the de novo assembly), and was used
to infer historic Ne using PSMC (Li and Durbin 2011). The vcf was
generated by running GATK (McKenna et al. 2010; DePristo et al. 2011) on
the filtered bam file (mapping quality 20 or above, PE reads aligned to
the same scaffold with correctly oriented read pairs mapping within three
standard deviations of the mean, duplicates removed), with the
EMIT_ALL_SITES option and a minimum base quality of 20. The fastq was
generated from this all-sites vcf using the vcf2fq function within the
vcfutils.pl script, using a minimum depth of 26 and a maximum depth of 104
(0.5 and 2x the mean depth, respectively), and a minimum quality (QUAL*GQ)
of
20.Lemur.allLibs.mergedtoInsert_Sorted_Quake_PutativeSE_K33.gapClosed.scafSeqUG1AllSites.dpfilterL26H104Q20.fq.gzEulemur macaco fastq files for PSMCThese two files should be combined into one fastq representing the whole genome. The combined fastq file was generated from the all-sites vcf file for Harmonia (the E. macaco individual used for the de novo assembly), and was used to infer historic Ne using PSMC (Li and Durbin 2011). The vcf was generated by running GATK (McKenna et al. 2010; DePristo et al. 2011) on the filtered bam file (mapping quality 10 or above, duplicates removed), with the EMIT_ALL_SITES option and a minimum base quality of 20. The fastq was generated from this all-sites vcf using a modification of the vcf2fq function within the vcfutils.pl script (vcf2fqnonref, in https://github.com/sorrywm/genome_analysis/vcfutils_mod.pl), using a minimum depth of 10 and a maximum depth of 41 (0.5 and 2x the mean depth, respectively), and a minimum quality (QUAL*GQ) of 20.MergedBamsFromBWADefaultOneRound_PEandSE_sortedtoInsert_Sorted_Quake_PutativeSE_K33.gapClosed.scafSeqUG1AllSites.dpfilterL10H41Q20.1.fastq.gzEulemur macaco fastq files for PSMCThese two files should be combined into one fastq representing the whole genome. The combined fastq file was generated from the all-sites vcf file for Harmonia (the E. macaco individual used for the de novo assembly), and was used to infer historic Ne using PSMC (Li and Durbin 2011). The vcf was generated by running GATK (McKenna et al. 2010; DePristo et al. 2011) on the filtered bam file (mapping quality 10 or above, duplicates removed), with the EMIT_ALL_SITES option and a minimum base quality of 20. The fastq was generated from this all-sites vcf using a modification of the vcf2fq function within the vcfutils.pl script (vcf2fqnonref, in https://github.com/sorrywm/genome_analysis/vcfutils_mod.pl), using a minimum depth of 10 and a maximum depth of 41 (0.5 and 2x the mean depth, respectively), and a minimum quality (QUAL*GQ) of 20.MergedBamsFromBWADefaultOneRound_PEandSE_sortedtoInsert_Sorted_Quake_PutativeSE_K33.gapClosed.scafSeqUG1AllSites.dpfilterL10H41Q20.2.fastq.gzDataset S1: Annotated transcripts in the 1% FST tailThis list contains all transcripts mapped to regions in the 1% tail of FST from the full dataset in either species. Columns represent Ensembl transcript ID, region of high FST where the transcript was annotated, start position of the transcript within the region, end position of the transcript within the region, proportion of the transcript mapped to the region, mean percent identity for parts of the transcript that mapped, Ensembl gene ID, and gene symbol. We annotated genes in the 1% tail of all 20 kb non-overlapping windows by aligning human protein sequences to the blue-eyed black lemur genome. We obtained protein sequences for human genome build hg18 and used TBLASTN version 2.2.22+ (Altschul et al. 1990, 1997), with an e-value threshold of 5 x 10-5 to identify orthologs within the regions of the blue-eyed black lemur reference genome corresponding to the 1% FST tail. We then took the list of all human proteins with hits within candidate regions and performed TBLASTN for these proteins against the entire lemur genome. We retained proteins whose best genome-wide match (containing the lowest e-value or maximum mean percent identity) for any subset of the protein sequence overlapped the candidate region. In cases in which multiple proteins mapped to the same location (>50% protein length overlapping, presumably representing multiple transcripts of the same gene or multiple genes in the same family), we retained the protein with the largest total length spanned by initial TBLASTN hits or the largest mean percent identity.MeyerVenkatEtAlDatasetS1New.txt