10.5061/DRYAD.P4S57
Hurst, Laurence D.
University of Bath
Ghanbarian, Avazeh T.
University of Bath
Forrest, Alistair R. R.
Consortium, Fantom
Huminiecki, Lukasz
Stockholm University
Science for Life Laboratory
Karolinska Institute
Data from: The constrained maximal expression level owing to haploidy
shapes gene content on the mammalian X chromosome
Dryad
dataset
2016
expression pattern evolution
maximal expression
Homo Sapiens
Sex chromosomes
chromosome X
expression breadth
2016-11-04T00:00:00Z
2016-11-04T00:00:00Z
en
https://doi.org/10.1371/journal.pbio.1002315
297940752 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
X chromosomes are unusual in many regards, not least of which is their
nonrandom gene content. The causes of this bias are commonly discussed in
the context of sexual antagonism and the avoidance of activity in the male
germline. Here, we examine the notion that, at least in some taxa,
functionally biased gene content may more profoundly be shaped by limits
imposed on gene expression owing to haploid expression of the X
chromosome. Notably, if the X, as in primates, is transcribed at rates
comparable to the ancestral rate (per promoter) prior to the X chromosome
formation, then the X is not a tolerable environment for genes with very
high maximal net levels of expression, owing to transcriptional traffic
jams. We test this hypothesis using The Encyclopedia of DNA Elements
(ENCODE) and data from the Functional Annotation of the Mammalian Genome
(FANTOM5) project. As predicted, the maximal expression of human X-linked
genes is much lower than that of genes on autosomes: on average, maximal
expression is three times lower on the X chromosome than on autosomes.
Similarly, autosome-to-X retroposition events are associated with lower
maximal expression of retrogenes on the X than seen for X-to-autosome
retrogenes on autosomes. Also as expected, X-linked genes have a lesser
degree of increase in gene expression than autosomal ones (compared to the
human/Chimpanzee common ancestor) if highly expressed, but not if lowly
expressed. The traffic jam model also explains the known lower breadth of
expression for genes on the X (and the Z of birds), as genes with broad
expression are, on average, those with high maximal expression. As then
further predicted, highly expressed tissue-specific genes are also rare on
the X and broadly expressed genes on the X tend to be lowly expressed,
both indicating that the trend is shaped by the maximal expression level
not the breadth of expression per se. Importantly, a limit to the maximal
expression level explains biased tissue of expression profiles of X-linked
genes. Tissues whose tissue-specific genes are very highly expressed
(e.g., secretory tissues, tissues abundant in structural proteins) are
also tissues in which gene expression is relatively rare on the X
chromosome. These trends cannot be fully accounted for in terms of
alternative models of biased expression. In conclusion, the notion that it
is hard for genes on the Therian X to be highly expressed, owing to
transcriptional traffic jams, provides a simple yet robustly supported
rationale of many peculiar features of X’s gene content, gene expression,
and evolution.
chicken.all_samples.galGal3.tpm.refgene.oscData for the analysis of the
chicken chromosome Z. FANTOM5 chicken libraries consisted of 25 CAGE
libraries including: chicken aortic smooth muscles, hepatocytes,
mesenchymal stem cells, leg buds, wing buds, embryo extra-embryonic tissue
(day 7 and day 15), and whole body developmental time course (from 5 hours
30 minutes to 20 days). The number of available datapoints to which TPM
was normalized was limited by the number of annotated chicken RefSeq
transcripts (which was approximately six times smaller than human, N =
4,426 on autosomes, and N = 241 on chromosome Z). Consequently, the cutoff
for a gene to be classified as “on” was adjusted six times higher to 60
TPM.human.primary_cell.hCAGE.hg19.tpm.refgene.oscThe FANTOM5 dataset for
human primary cells.human.cell_line.hCAGE.hg19.tpm.refgene.oscThe FANTOM5
dataset for human cancer
cell-lines.human.tissue.hCAGE.hg19.tpm.refgene.oscThe FANTOM5 dataset for
human tissue. CAGE tags were mapped to RefSeq transcripts +/-500 base
pairs (bps) from their TSSes and normalized to tags per million (TPM), as
previously described [37,45]. The signal of ten TPM was chosen as the
cutoff for a gene to be classified as “on” (this cutoff was accepted as
the standard for human data throughout the consortium). FANTOM5 is the
most comprehensive expression dataset ever generated, including 952 human
and 396 mouse tissues, primary cells and cancer cell-lines. FANTOM5 is
based on cap analysis of gene expression (CAGE) a unique technology that
characterizes TSSes across the entire genome in an unbiased fashion and at
a single-base resolution level [21]. CAGE automatically sums expression
levels of all transcripts beginning at a given transcription start
site.raw_Z_Exp_Anc_LData for Fig 2 "The comparison of change in gene
expression (Z) since the human-Chimpanzee common ancestor for five somatic
tissues."SUPPLEMENTARY TABLESData in Table S3 underlies Figure 4.
Data in Table S7 partially underlies Fig 1. Data in Tables S4 underlies
Fig 3. Data in Tables S10-12 underlies Fig S1.data for Fig1R environment
containing data underlying Fig1. The environment contains the following
variables sorted identically as the gene list in refSeqs: chromosome
(chromosomal location), chromosome_short (location on autosomes,chrX, or
chrY?), data_matrix (F5 data matrix in TPM for human tissues)‚ MAX
(maximal expression for each RefSeq)‚ max (maximal expression for each
tissue)‚ strata_classification (strata classification for genes on
chromosome X)‚ refSeqs_2entrezIDs (entrez ids mapped to refseqs)‚ boe (the
breadth of expression)env_fig1GC-contents data for for Fig S6 and S7This R
environment contains GC-contents data for either proximal promoters or
isochore around the TSS (marked as big). The data is calculated for either
masked or unmasked genome seqeuence.env_gc_contentsdata for Fig S3numbers
of ENCODE transcription factor binding sites mapped to TSSes of RefSeq
genes in symmetrical windows of different sizes (from 250 to 20000 bps)
and depending on ENCODE quality cut-off (strict or all).FigS3_data.txtdata
underlying Fig S8Breadth of expression and maximal expression is compared
in three groups of observations: (1) autosomal paralogs of X-linked genes,
(2) other autosomal paralogs matched by age, (3) X-linked paralogs. Newly
formed paralogs are defined as those mapped by phylogenetic timing to taxa
Theria or younger. Pre-existing duplications are defined as those
descending from duplication notes mapped by phylogenetic timing to taxa
Amniota or older.FigS8_data.txtdata underlying Fig7Fig7_data.txtTreeFam
data for timing of gene duplications in R environmentsThese files are R
environments. Use load() to load them into your R session! You ls() to
view contents. You may use attach() syntax to load the namespace or access
data members of the environment using the "$" reference
operator. There is no warranty for this
softwareenv_duplicator_baseAdditional TreeFam gene duplication data with
duplication timingenv_duplicator_vectors