Genome Statistics
From GersteinInfo
- From ENCODE paper
These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas the other 15Mb reside in 30 regions chosen by a stratified random-sampling method (see http://www.genome.gov/10506161).
To begin with, our studies show that 14.7%of the bases represented in the unbiased tiling arrays are transcribed in at least one tissue sample. Consistent with previous work14,15, many (63%) TxFrags reside outside of GENCODE annotations
399 protein-coding loci (those loci found entirely in ENCODE regions)
Remarkably, 93% of bases are represented in a primary transcript identified by at least two independent observations (but potentially using the same technology); this figure is reduced to 74% in the case of primary transcripts detected by at least two different technologies.
To understand better transcriptional regulation, ....We analysed over 150 data sets, mainly from ChIP-chip...
From analysing multiple data sets, we find 4,491 known and novel TSSs in the ENCODE regions, almost tenfold more than the number of established genes Of this: Known GENCODE 5' ends 1,730
Fig10: 5% bases constrained. Of these 40% unannotated, 32% coding, 8%+20% UTRs + expt. annotation
Fig 11: Fraction of experimental annotation overlapping constrained sequence: ~90% CDS , ~10% Un.TxFrags, 55% RFBRs
- Zhang et al. paper
analysed 105 datasets ~15K total "hits", averaging ~150/set How many total base pairs covered.
- DART paper
By this method, 14% of Un.TxFrags could be assigned to annotated loci, and 21% could be clustered into 200 novel loci (with an average of ,7 TxFrags per locus).
- SNPs
1/300 between people 1/100 between human and chimp
Sequencing error rate 1/10000 (ref?)
- GRGENEREV
(The GENCODE annotation currently contains on average 5.4 transcripts per locus). This was underscored by the subsequent sequencing of the human genome where it was shown that only 1.2% of the DNA bases code for exons (Lander et al. 2001; Venter et al., 2001).... Moreover, comparison of the human, dog, and mouse and other vertebrate genomes showed that a large fraction of these was conserved, with ~5% under negative selection since the divergence of these species (Waterston et al., 2002; Lindblad-Toh et al., 2005).
Is this true:
Approximately another 20% of the constrained elements overlap with experimentally annotated regulatory regions. Therefore, a similar fraction of of constrained elements (40% in terms of bases) are located in protein-coding regions as unannotated noncoding regions (100% - 40% coding - 20% regulatory regions), suggesting that the latter is at least as functionally important as the former.
Some human genome statistics
compiled by ZDZ
Nulcear genome
- Size: ~ 3.2 Gb
- Chromosomes: 1–22, X, Y, all linear
- Associated protein: several classes of histone and nonhistone protein
- Euchromatin: ~ 2.9–3.0 Gb
- Constitutive heterochromatin: > 0.2 Gb
- Highly conserved:
- Coding DNA: ~ 50 Mb (~1.5%)
- Other (regulatory etc.): ~ 100 Mb (3%)
- Repetitive DNA: > 50%
- Segmental duplication: > 150 Mb (> 5%)
Gene number
- Mitochondrial genome: 37
- Nuclear genome: 30,000
- ~ 1,400 per chromosome; but dependent on chromosome length and also on chromosome type
- ~ 60 per chromosome band in a 550-band chromosome preparation
Gene density
- one per 1 kb in the mitochondrial genome
- one per 100 kb in the nuclear genome
Gene size
- ~ 27 kb, but enormous variation
Intergenic distance
- ~75 kb in nuclear genome
Exon number
- ~ 9, generally correlated with gene length
- Wide variation from small genes with a single exon to large genes with numerous exons
- The dystrophin gene (DMD) has 79 exons
Exon size
- ~ 122 bp with comparatively little length variation
- Coding sequence exons are a bit shorter on average
- Exons containing 3' UTR sequences are considerably longer
- Some exceptionally long exons have been reported:
- exon 26 of the apoB gene (APOB), 7.6 kb
- exon 15 of the adenomatous polyposis coli gene (APC), 6.5 kb
- exon 11 of the BRCA1 breast cancer gene, 3.4 kb
Intron size
- Enormous variation
- Strong direct correlation with gene size
- Examples of typical intron sizes are as follows:
- globin gene (HBB; 1.6 kb) 0.5 kb
- myoglobin gene (MB; 10.4 kb) 4.7 kb
- dystrophin gene (DMD; 2.5 Mb) 30.0 kb
mRNA size
- ~ 2.5 kb, but considerable variation
- 5' UTR: ~ 0.2–0.3 kb
- CDS: 1.5–1.8 kb (500–600 codons)
- 3' UTR: ~ 0.8 kb (a likely underestimation due to underreporting of genes with long 3' UTRs)
- At statistics (src is DZ)
- >21K genes
- ~26K pgenes from pipeline (3K from tair)
- 5 chr
- 260 Mb
- rice
- 12 chr, 390 Mb
General genome statistics
GENOME sizes
- E.coli 4.6 million bp
- Yeast 12 million bp
- Worm 100 million bp
- Fruit Fly 133 million bp
- Human 3.3 billion bp
- Mouse 3.4 billion bp
- Red Viscacha Rat 8.2 billion bp
- Mountain Grasshopper 16.5 billion bp
Number of genes per GENOME
- Yeast 6,530 (known) 167 (novel) 21 (pseudogenes)
- Worm 20,049 (known) 20 (novel) 1,150 (pseudogenes)
- Fruit Fly 4,751 (known) 9,288 (novel) 52 (pseudogenes)
- Human 21,667 (known) 1,013 (novel) 1,040 (pseudogenes)
- Mouse 22,723 (known) 1,395 (novel) 1,350 (pseudogenes)
(src Ensembl)