Genome Statistics

From GersteinInfo

Jump to: navigation, search
  • From ENCODE paper

These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas the other 15Mb reside in 30 regions chosen by a stratified random-sampling method (see http://www.genome.gov/10506161).

To begin with, our studies show that 14.7%of the bases represented in the unbiased tiling arrays are transcribed in at least one tissue sample. Consistent with previous work14,15, many (63%) TxFrags reside outside of GENCODE annotations

399 protein-coding loci (those loci found entirely in ENCODE regions)

Remarkably, 93% of bases are represented in a primary transcript identified by at least two independent observations (but potentially using the same technology); this figure is reduced to 74% in the case of primary transcripts detected by at least two different technologies.

To understand better transcriptional regulation, ....We analysed over 150 data sets, mainly from ChIP-chip...

From analysing multiple data sets, we find 4,491 known and novel TSSs in the ENCODE regions, almost tenfold more than the number of established genes Of this: Known GENCODE 5' ends 1,730

Fig10: 5% bases constrained. Of these 40% unannotated, 32% coding, 8%+20% UTRs + expt. annotation

Fig 11: Fraction of experimental annotation overlapping constrained sequence: ~90% CDS , ~10% Un.TxFrags, 55% RFBRs


  • Zhang et al. paper

analysed 105 datasets ~15K total "hits", averaging ~150/set How many total base pairs covered.


  • DART paper

By this method, 14% of Un.TxFrags could be assigned to annotated loci, and 21% could be clustered into 200 novel loci (with an average of ,7 TxFrags per locus).


  • SNPs

1/300 between people 1/100 between human and chimp

Sequencing error rate 1/10000 (ref?)

  • GRGENEREV

(The GENCODE annotation currently contains on average 5.4 transcripts per locus). This was underscored by the subsequent sequencing of the human genome where it was shown that only 1.2% of the DNA bases code for exons (Lander et al. 2001; Venter et al., 2001).... Moreover, comparison of the human, dog, and mouse and other vertebrate genomes showed that a large fraction of these was conserved, with ~5% under negative selection since the divergence of these species (Waterston et al., 2002; Lindblad-Toh et al., 2005).


Is this true: Approximately another 20% of the constrained elements overlap with experimentally annotated regulatory regions. Therefore, a similar fraction of of constrained elements (40% in terms of bases) are located in protein-coding regions as unannotated noncoding regions (100% - 40% coding - 20% regulatory regions), suggesting that the latter is at least as functionally important as the former.


Some human genome statistics

compiled by ZDZ


Nulcear genome

  • Size: ~ 3.2 Gb
  • Chromosomes: 1–22, X, Y, all linear
  • Associated protein: several classes of histone and nonhistone protein
  • Euchromatin: ~ 2.9–3.0 Gb
  • Constitutive heterochromatin: > 0.2 Gb
  • Highly conserved:
    • Coding DNA: ~ 50 Mb (~1.5%)
    • Other (regulatory etc.): ~ 100 Mb (3%)
  • Repetitive DNA: > 50%
  • Segmental duplication: > 150 Mb (> 5%)


Gene number

  • Mitochondrial genome: 37
  • Nuclear genome: 30,000
    • ~ 1,400 per chromosome; but dependent on chromosome length and also on chromosome type
    • ~ 60 per chromosome band in a 550-band chromosome preparation


Gene density

  • one per 1 kb in the mitochondrial genome
  • one per 100 kb in the nuclear genome


Gene size

  • ~ 27 kb, but enormous variation


Intergenic distance

  • ~75 kb in nuclear genome


Exon number

  • ~ 9, generally correlated with gene length
  • Wide variation from small genes with a single exon to large genes with numerous exons
  • The dystrophin gene (DMD) has 79 exons


Exon size

  • ~ 122 bp with comparatively little length variation
  • Coding sequence exons are a bit shorter on average
  • Exons containing 3' UTR sequences are considerably longer
  • Some exceptionally long exons have been reported:
    • exon 26 of the apoB gene (APOB), 7.6 kb
    • exon 15 of the adenomatous polyposis coli gene (APC), 6.5 kb
    • exon 11 of the BRCA1 breast cancer gene, 3.4 kb


Intron size

  • Enormous variation
  • Strong direct correlation with gene size
  • Examples of typical intron sizes are as follows:
    • globin gene (HBB; 1.6 kb) 0.5 kb
    • myoglobin gene (MB; 10.4 kb) 4.7 kb
    • dystrophin gene (DMD; 2.5 Mb) 30.0 kb


mRNA size

  • ~ 2.5 kb, but considerable variation
  • 5' UTR: ~ 0.2–0.3 kb
  • CDS: 1.5–1.8 kb (500–600 codons)
  • 3' UTR: ~ 0.8 kb (a likely underestimation due to underreporting of genes with long 3' UTRs)


    • At statistics (src is DZ)
  • >21K genes
  • ~26K pgenes from pipeline (3K from tair)
  • 5 chr
  • 260 Mb
    • rice
  • 12 chr, 390 Mb


General genome statistics

GENOME sizes

  • E.coli 4.6 million bp
  • Yeast 12 million bp
  • Worm 100 million bp
  • Fruit Fly 133 million bp
  • Human 3.3 billion bp
  • Mouse 3.4 billion bp
  • Red Viscacha Rat 8.2 billion bp
  • Mountain Grasshopper 16.5 billion bp

Number of genes per GENOME

  • Yeast 6,530 (known) 167 (novel) 21 (pseudogenes)
  • Worm 20,049 (known) 20 (novel) 1,150 (pseudogenes)
  • Fruit Fly 4,751 (known) 9,288 (novel) 52 (pseudogenes)
  • Human 21,667 (known) 1,013 (novel) 1,040 (pseudogenes)
  • Mouse 22,723 (known) 1,395 (novel) 1,350 (pseudogenes)

(src Ensembl)

Personal tools