Genome Statistics
From GersteinInfo
(Created page with '* From ENCODE paper These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas…')
Latest revision as of 13:58, 10 June 2010
- From ENCODE paper
 
These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas the other 15Mb reside in 30 regions chosen by a stratified random-sampling method (see http://www.genome.gov/10506161).
To begin with, our studies show that 14.7%of the bases represented in the unbiased tiling arrays are transcribed in at least one tissue sample. Consistent with previous work14,15, many (63%) TxFrags reside outside of GENCODE annotations
399 protein-coding loci (those loci found entirely in ENCODE regions)
Remarkably, 93% of bases are represented in a primary transcript identified by at least two independent observations (but potentially using the same technology); this figure is reduced to 74% in the case of primary transcripts detected by at least two different technologies.
To understand better transcriptional regulation, ....We analysed over 150 data sets, mainly from ChIP-chip...
From analysing multiple data sets, we find 4,491 known and novel TSSs in the ENCODE regions, almost tenfold more than the number of established genes Of this: Known GENCODE 5' ends 1,730
Fig10: 5% bases constrained. Of these 40% unannotated, 32% coding, 8%+20% UTRs + expt. annotation
Fig 11: Fraction of experimental annotation overlapping constrained sequence: ~90% CDS , ~10% Un.TxFrags, 55% RFBRs
- Zhang et al. paper
 
analysed 105 datasets ~15K total "hits", averaging ~150/set How many total base pairs covered.
- DART paper
 
By this method, 14% of Un.TxFrags could be assigned to annotated loci, and 21% could be clustered into 200 novel loci (with an average of ,7 TxFrags per locus).
- SNPs
 
1/300 between people 1/100 between human and chimp
Sequencing error rate 1/10000 (ref?)
- GRGENEREV
 
(The GENCODE annotation currently contains on average 5.4 transcripts per locus). This was underscored by the subsequent sequencing of the human genome where it was shown that only 1.2% of the DNA bases code for exons (Lander et al. 2001; Venter et al., 2001).... Moreover, comparison of the human, dog, and mouse and other vertebrate genomes showed that a large fraction of these was conserved, with ~5% under negative selection since the divergence of these species (Waterston et al., 2002; Lindblad-Toh et al., 2005).
Is this true:
Approximately another 20% of the constrained elements overlap with experimentally annotated regulatory regions.  Therefore, a similar fraction of of constrained elements (40% in terms of bases) are located in protein-coding regions as unannotated noncoding regions (100% - 40% coding - 20% regulatory regions), suggesting that the latter is at least as functionally important as the former.
Some human genome statistics
compiled by ZDZ
Nulcear genome
- Size: ~ 3.2 Gb
 - Chromosomes: 1–22, X, Y, all linear
 - Associated protein: several classes of histone and nonhistone protein
 - Euchromatin: ~ 2.9–3.0 Gb
 - Constitutive heterochromatin: > 0.2 Gb
 -  Highly conserved:
- Coding DNA: ~ 50 Mb (~1.5%)
 - Other (regulatory etc.): ~ 100 Mb (3%)
 
 - Repetitive DNA: > 50%
 - Segmental duplication: > 150 Mb (> 5%)
 
Gene number
- Mitochondrial genome: 37
 -  Nuclear genome: 30,000
- ~ 1,400 per chromosome; but dependent on chromosome length and also on chromosome type
 - ~ 60 per chromosome band in a 550-band chromosome preparation
 
 
Gene density
- one per 1 kb in the mitochondrial genome
 - one per 100 kb in the nuclear genome
 
Gene size
- ~ 27 kb, but enormous variation
 
Intergenic distance
- ~75 kb in nuclear genome
 
Exon number
- ~ 9, generally correlated with gene length
 - Wide variation from small genes with a single exon to large genes with numerous exons
 - The dystrophin gene (DMD) has 79 exons
 
Exon size
- ~ 122 bp with comparatively little length variation
 - Coding sequence exons are a bit shorter on average
 - Exons containing 3' UTR sequences are considerably longer
 -  Some exceptionally long exons have been reported:
- exon 26 of the apoB gene (APOB), 7.6 kb
 - exon 15 of the adenomatous polyposis coli gene (APC), 6.5 kb
 - exon 11 of the BRCA1 breast cancer gene, 3.4 kb
 
 
Intron size
- Enormous variation
 - Strong direct correlation with gene size
 -  Examples of typical intron sizes are as follows:
- globin gene (HBB; 1.6 kb) 0.5 kb
 - myoglobin gene (MB; 10.4 kb) 4.7 kb
 - dystrophin gene (DMD; 2.5 Mb) 30.0 kb
 
 
mRNA size
- ~ 2.5 kb, but considerable variation
 - 5' UTR: ~ 0.2–0.3 kb
 - CDS: 1.5–1.8 kb (500–600 codons)
 - 3' UTR: ~ 0.8 kb (a likely underestimation due to underreporting of genes with long 3' UTRs)
 
- At statistics (src is DZ)
 
- >21K genes
 - ~26K pgenes from pipeline (3K from tair)
 - 5 chr
 - 260 Mb
 
- rice
 
- 12 chr, 390 Mb
 
General genome statistics
GENOME sizes
- E.coli 4.6 million bp
 - Yeast 12 million bp
 - Worm 100 million bp
 - Fruit Fly 133 million bp
 - Human 3.3 billion bp
 - Mouse 3.4 billion bp
 
- Red Viscacha Rat 8.2 billion bp
 - Mountain Grasshopper 16.5 billion bp
 
Number of genes per GENOME
- Yeast 6,530 (known) 167 (novel) 21 (pseudogenes)
 - Worm 20,049 (known) 20 (novel) 1,150 (pseudogenes)
 - Fruit Fly 4,751 (known) 9,288 (novel) 52 (pseudogenes)
 - Human 21,667 (known) 1,013 (novel) 1,040 (pseudogenes)
 - Mouse 22,723 (known) 1,395 (novel) 1,350 (pseudogenes)
 
(src Ensembl)
