Genome Statistics

From GersteinInfo
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
  • From ENCODE paper

These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas the other 15Mb reside in 30 regions chosen by a stratified random-sampling method (see http://www.genome.gov/10506161).

To begin with, our studies show that 14.7%of the bases represented in the unbiased tiling arrays are transcribed in at least one tissue sample. Consistent with previous work14,15, many (63%) TxFrags reside outside of GENCODE annotations

399 protein-coding loci (those loci found entirely in ENCODE regions)

Remarkably, 93% of bases are represented in a primary transcript identified by at least two independent observations (but potentially using the same technology); this figure is reduced to 74% in the case of primary transcripts detected by at least two different technologies.

To understand better transcriptional regulation, ....We analysed over 150 data sets, mainly from ChIP-chip...

From analysing multiple data sets, we find 4,491 known and novel TSSs in the ENCODE regions, almost tenfold more than the number of established genes Of this: Known GENCODE 5' ends 1,730

Fig10: 5% bases constrained. Of these 40% unannotated, 32% coding, 8%+20% UTRs + expt. annotation

Fig 11: Fraction of experimental annotation overlapping constrained sequence: ~90% CDS , ~10% Un.TxFrags, 55% RFBRs


  • Zhang et al. paper

analysed 105 datasets ~15K total "hits", averaging ~150/set How many total base pairs covered.


  • DART paper

By this method, 14% of Un.TxFrags could be assigned to annotated loci, and 21% could be clustered into 200 novel loci (with an average of ,7 TxFrags per locus).


  • SNPs

1/300 between people 1/100 between human and chimp

Sequencing error rate 1/10000 (ref?)

  • GRGENEREV

(The GENCODE annotation currently contains on average 5.4 transcripts per locus). This was underscored by the subsequent sequencing of the human genome where it was shown that only 1.2% of the DNA bases code for exons (Lander et al. 2001; Venter et al., 2001).... Moreover, comparison of the human, dog, and mouse and other vertebrate genomes showed that a large fraction of these was conserved, with ~5% under negative selection since the divergence of these species (Waterston et al., 2002; Lindblad-Toh et al., 2005).


Is this true: Approximately another 20% of the constrained elements overlap with experimentally annotated regulatory regions. Therefore, a similar fraction of of constrained elements (40% in terms of bases) are located in protein-coding regions as unannotated noncoding regions (100% - 40% coding - 20% regulatory regions), suggesting that the latter is at least as functionally important as the former.


Some human genome statistics

compiled by ZDZ


Nulcear genome

  • Size: ~ 3.2 Gb
  • Chromosomes: 1–22, X, Y, all linear
  • Associated protein: several classes of histone and nonhistone protein
  • Euchromatin: ~ 2.9–3.0 Gb
  • Constitutive heterochromatin: > 0.2 Gb
  • Highly conserved:
    • Coding DNA: ~ 50 Mb (~1.5%)
    • Other (regulatory etc.): ~ 100 Mb (3%)
  • Repetitive DNA: > 50%
  • Segmental duplication: > 150 Mb (> 5%)


Gene number

  • Mitochondrial genome: 37
  • Nuclear genome: 30,000
    • ~ 1,400 per chromosome; but dependent on chromosome length and also on chromosome type
    • ~ 60 per chromosome band in a 550-band chromosome preparation


Gene density

  • one per 1 kb in the mitochondrial genome
  • one per 100 kb in the nuclear genome


Gene size

  • ~ 27 kb, but enormous variation


Intergenic distance

  • ~75 kb in nuclear genome


Exon number

  • ~ 9, generally correlated with gene length
  • Wide variation from small genes with a single exon to large genes with numerous exons
  • The dystrophin gene (DMD) has 79 exons


Exon size

  • ~ 122 bp with comparatively little length variation
  • Coding sequence exons are a bit shorter on average
  • Exons containing 3' UTR sequences are considerably longer
  • Some exceptionally long exons have been reported:
    • exon 26 of the apoB gene (APOB), 7.6 kb
    • exon 15 of the adenomatous polyposis coli gene (APC), 6.5 kb
    • exon 11 of the BRCA1 breast cancer gene, 3.4 kb


Intron size

  • Enormous variation
  • Strong direct correlation with gene size
  • Examples of typical intron sizes are as follows:
    • globin gene (HBB; 1.6 kb) 0.5 kb
    • myoglobin gene (MB; 10.4 kb) 4.7 kb
    • dystrophin gene (DMD; 2.5 Mb) 30.0 kb


mRNA size

  • ~ 2.5 kb, but considerable variation
  • 5' UTR: ~ 0.2–0.3 kb
  • CDS: 1.5–1.8 kb (500–600 codons)
  • 3' UTR: ~ 0.8 kb (a likely underestimation due to underreporting of genes with long 3' UTRs)


    • At statistics (src is DZ)
  • >21K genes
  • ~26K pgenes from pipeline (3K from tair)
  • 5 chr
  • 260 Mb
    • rice
  • 12 chr, 390 Mb


General genome statistics

GENOME sizes

  • E.coli 4.6 million bp
  • Yeast 12 million bp
  • Worm 100 million bp
  • Fruit Fly 133 million bp
  • Human 3.3 billion bp
  • Mouse 3.4 billion bp
  • Red Viscacha Rat 8.2 billion bp
  • Mountain Grasshopper 16.5 billion bp

Number of genes per GENOME

  • Yeast 6,530 (known) 167 (novel) 21 (pseudogenes)
  • Worm 20,049 (known) 20 (novel) 1,150 (pseudogenes)
  • Fruit Fly 4,751 (known) 9,288 (novel) 52 (pseudogenes)
  • Human 21,667 (known) 1,013 (novel) 1,040 (pseudogenes)
  • Mouse 22,723 (known) 1,395 (novel) 1,350 (pseudogenes)

(src Ensembl)