{
"id": "https://doi.org/10.6084/m9.figshare.9900401.v1",
"doi": "10.6084/M9.FIGSHARE.9900401.V1",
"url": "https://figshare.com/articles/dataset/Genome-wide_copy_numbers_for_enomes_of_nine_Macaca_species_based_on_1kb_windows/9900401/1",
"types": {
"ris": "DATA",
"bibtex": "misc",
"citeproc": "dataset",
"schemaOrg": "Dataset",
"resourceType": "Dataset",
"resourceTypeGeneral": "Dataset"
},
"creators": [
{
"name": "Li, Jing",
"nameType": "Personal",
"givenName": "Jing",
"familyName": "Li",
"affiliation": [],
"nameIdentifiers": []
},
{
"name": "Zhenxin Fan",
"affiliation": [],
"nameIdentifiers": []
},
{
"name": "Feichen Shen",
"affiliation": [],
"nameIdentifiers": []
},
{
"name": "Pendleton, Amanda",
"nameType": "Personal",
"givenName": "Amanda",
"familyName": "Pendleton",
"affiliation": [],
"nameIdentifiers": []
},
{
"name": "Song, Yang",
"nameType": "Personal",
"givenName": "Yang",
"familyName": "Song",
"affiliation": [],
"nameIdentifiers": []
},
{
"name": "Jinchuan Xing",
"affiliation": [],
"nameIdentifiers": []
},
{
"name": "Bisong Yue",
"affiliation": [],
"nameIdentifiers": []
},
{
"name": "Kidd, Jeffrey M",
"nameType": "Personal",
"givenName": "Jeffrey M",
"familyName": "Kidd",
"affiliation": [],
"nameIdentifiers": []
},
{
"name": "Li, Jing",
"nameType": "Personal",
"givenName": "Jing",
"familyName": "Li",
"affiliation": [],
"nameIdentifiers": []
}
],
"titles": [
{
"title": "Genome-wide copy numbers for enomes of nine Macaca species based on 1kb windows"
}
],
"publisher": {
"name": "figshare"
},
"container": {},
"subjects": [
{
"subject": "60102 Bioinformatics",
"subjectScheme": "FOR"
},
{
"subject": "FOS: Computer and information sciences",
"schemeUri": "http://www.oecd.org/science/inno/38235147.pdf",
"subjectScheme": "Fields of Science and Technology (FOS)"
},
{
"subject": "FOS: Computer and information sciences",
"subjectScheme": "Fields of Science and Technology (FOS)"
}
],
"contributors": [],
"dates": [
{
"date": "2019-09-25",
"dateType": "Created"
},
{
"date": "2020-09-12",
"dateType": "Updated"
},
{
"date": "2019",
"dateType": "Issued"
}
],
"publicationYear": 2019,
"identifiers": [],
"sizes": [
"105216175 Bytes"
],
"formats": [],
"rightsList": [
{
"rights": "Creative Commons Attribution 4.0 International",
"rightsUri": "https://creativecommons.org/licenses/by/4.0/legalcode",
"schemeUri": "https://spdx.org/licenses/",
"rightsIdentifier": "cc-by-4.0",
"rightsIdentifierScheme": "SPDX"
}
],
"descriptions": [
{
"description": "This is the genome-wide copy numbers for every 1kb window in the genomes of nine Macaca species, including the Chinese rhesus (M. mulatta lasiota, CR), cynomolgus (M. fascicularis, CE), Tibetan (M. thibetana, TM), stump-tailed (M. arctoides, SM), southern pig-tailed (M. nemestrina, PM), Japanese (M. fuscata, JM), Taiwanese (M. cyclopis, TwM), Barbary (M. sylvanus, BM), and lion-tailed (M. Silenus, LM) macaques.
Methods: We employed FastQC (v0.11.8) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) to do quality control checks on raw data of resequencing genomes, then used Trimmomatic (v0.36) (Bolger et al. 2014) to filter and trim the reads. The cleaned reads were aligned to the Mmul_8 reference genome (Zimin et al. 2014) with BWA mem (Li and Durbin, 2009).Two steps were implemented to estimate CNs with the fastCN pipeline, which utilizes RD information. The program fastCN was designed to efficiently estimate genome copy number from short read data (https://github.com/KiddLab/fastCN) (Pendleton et al. 2018). This method is built upon the mrsFAST aligner (Hach et al. 2014), and divides reads into 36-bp subreads and determines all possible matching locations on the reference genome with fewer than two substitutions, then reports per-bp read depth in an efficient compressed binary format. First, we performed the GC correction step using custom-defined control regions. The aim of GC correction is to remove the GC bias introduced by PCR during library preparation and sequencing. However, due to the lack of suitable control regions across macaques, we created a two-step process to retrieve copy number invariable (or control) regions in the diploid genome. The second step included the estimation of genome-wide copy numbers based on RDs using aligned BAM files based on one-kilobase (kb) windows. RD values were converted to CNs using a correction factor (CF) calculated from average RD of the control regions. The calculation function is as follows:CF = RDctl / 2CN = RD / CFwhere CF stands for the correction factor, RD represents the read depth of specific genomic window, and RDctl is the mean read depth of the control region. Unplaced contigs were merged as ‘chrUn’ in data processing to decrease the CPU time. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114-2120.Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE, et al. (2014) mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic acids research 42: W494-W500Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics 25: 1754-1760.Pendleton AL, Shen F, Taravella AM, Emery S, Veeramah KR, et al. (2018) Comparison of village dog and wolf genomes highlights the role of the neural crest in dog domestication. BMC biology 16: 64Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, et al. (2014) A new rhesus macaque assembly and annotation for next-generation sequencing analyses. Biology direct 9: 1-15.",
"descriptionType": "Abstract"
}
],
"geoLocations": [],
"fundingReferences": [],
"relatedIdentifiers": [
{
"relationType": "IsPreviousVersionOf",
"relatedIdentifier": "10.6084/m9.figshare.9900401",
"relatedIdentifierType": "DOI"
}
],
"schemaVersion": "http://datacite.org/schema/kernel-4",
"providerId": "otjm",
"clientId": "figshare.ars",
"agency": "datacite",
"state": "findable"
}