Infoadmin: Created page with '* From ENCODE paper These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas…'

2010-06-10T13:58:17Z

Created page with '* From ENCODE paper These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas…'

New page

* From ENCODE paper

These 30Mb are divided among 44 genomic regions;
approximately 15Mb reside in 14 regions for which there is already
substantial biological knowledge, whereas the other 15Mb reside in
30 regions chosen by a stratified random-sampling method (see
http://www.genome.gov/10506161).

To begin with, our studies show that 14.7%of the bases represented in
the unbiased tiling arrays are transcribed in at least one tissue sample.
Consistent with previous work14,15, many (63%) TxFrags reside outside
of GENCODE annotations

399 protein-coding loci
(those loci found entirely in ENCODE regions)

Remarkably, 93% of
bases are represented in a primary transcript identified by at least two
independent observations (but potentially using the same technology);
this figure is reduced to 74% in the case of primary transcripts
detected by at least two different technologies.

To understand better transcriptional regulation, ....We analysed over 150 data sets, mainly
from ChIP-chip...

From analysing multiple
data sets, we find 4,491 known and novel TSSs in the ENCODE
regions, almost tenfold more than the number of established genes
Of this: Known GENCODE 5' ends 1,730

Fig10: 5% bases constrained. Of these 40% unannotated, 32% coding, 8%+20% UTRs + expt. annotation

Fig 11: Fraction of experimental annotation
overlapping constrained sequence: ~90% CDS , ~10% Un.TxFrags, 55% RFBRs

* Zhang et al. paper
analysed 105 datasets
~15K total "hits", averaging ~150/set
How many total base pairs covered.

* DART paper
By this method, 14% of Un.TxFrags could be assigned to annotated
loci, and 21% could be clustered into 200 novel loci (with an average
of ,7 TxFrags per locus).

* SNPs
1/300 between people
1/100 between human and chimp

Sequencing error rate 1/10000 (ref?)

* GRGENEREV

(The GENCODE annotation currently contains on average 5.4 transcripts per locus).
This was underscored by the subsequent sequencing of the human genome where it was shown that only 1.2% of the DNA bases code for exons (Lander et al. 2001; Venter et al., 2001).... Moreover, comparison of the human, dog, and mouse and other vertebrate genomes showed that a large fraction of these was conserved, with ~5% under negative selection since the divergence of these species (Waterston et al., 2002; Lindblad-Toh et al., 2005).

Is this true:
Approximately another 20% of the constrained elements overlap with experimentally annotated regulatory regions. Therefore, a similar fraction of of constrained elements (40% in terms of bases) are located in protein-coding regions as unannotated noncoding regions (100% - 40% coding - 20% regulatory regions), suggesting that the latter is at least as functionally important as the former.

====Some human genome statistics====
compiled by ZDZ

'''Nulcear genome'''

* Size: ~ 3.2 Gb
* Chromosomes: 1–22, X, Y, all linear
* Associated protein: several classes of histone and nonhistone protein
* Euchromatin: ~ 2.9–3.0 Gb
* Constitutive heterochromatin: > 0.2 Gb
* Highly conserved:
** Coding DNA: ~ 50 Mb (~1.5%)
** Other (regulatory etc.): ~ 100 Mb (3%)
* Repetitive DNA: > 50%
* Segmental duplication: > 150 Mb (> 5%)

'''Gene number'''

* Mitochondrial genome: 37
* Nuclear genome: 30,000
** ~ 1,400 per chromosome; but dependent on chromosome length and also on chromosome type
** ~ 60 per chromosome band in a 550-band chromosome preparation

'''Gene density'''

* one per 1 kb in the mitochondrial genome
* one per 100 kb in the nuclear genome

'''Gene size'''

* ~ 27 kb, but enormous variation

'''Intergenic distance'''

* ~75 kb in nuclear genome

'''Exon number'''

* ~ 9, generally correlated with gene length
* Wide variation from small genes with a single exon to large genes with numerous exons
* The dystrophin gene (DMD) has 79 exons

'''Exon size'''

* ~ 122 bp with comparatively little length variation
* Coding sequence exons are a bit shorter on average
* Exons containing 3' UTR sequences are considerably longer
* Some exceptionally long exons have been reported:
** exon 26 of the apoB gene (APOB), 7.6 kb
** exon 15 of the adenomatous polyposis coli gene (APC), 6.5 kb
** exon 11 of the BRCA1 breast cancer gene, 3.4 kb

'''Intron size'''

* Enormous variation
* Strong direct correlation with gene size
* Examples of typical intron sizes are as follows:
** globin gene (HBB; 1.6 kb) 0.5 kb
** myoglobin gene (MB; 10.4 kb) 4.7 kb
** dystrophin gene (DMD; 2.5 Mb) 30.0 kb

'''mRNA size'''

* ~ 2.5 kb, but considerable variation
* 5' UTR: ~ 0.2–0.3 kb
* CDS: 1.5–1.8 kb (500–600 codons)
* 3' UTR: ~ 0.8 kb (a likely underestimation due to underreporting of genes with long 3' UTRs)

** At statistics (src is DZ)
* >21K genes
* ~26K pgenes from pipeline (3K from tair)
* 5 chr
* 260 Mb

** rice
* 12 chr, 390 Mb

====General genome statistics====

'''GENOME sizes'''

* E.coli 4.6 million bp
* Yeast 12 million bp
* Worm 100 million bp
* Fruit Fly 133 million bp
* Human 3.3 billion bp
* Mouse 3.4 billion bp

* Red Viscacha Rat 8.2 billion bp
* Mountain Grasshopper 16.5 billion bp

'''Number of genes per GENOME'''

* Yeast 6,530 (known) 167 (novel) 21 (pseudogenes)
* Worm 20,049 (known) 20 (novel) 1,150 (pseudogenes)
* Fruit Fly 4,751 (known) 9,288 (novel) 52 (pseudogenes)
* Human 21,667 (known) 1,013 (novel) 1,040 (pseudogenes)
* Mouse 22,723 (known) 1,395 (novel) 1,350 (pseudogenes)
(src Ensembl)

Genome Statistics - Revision history

Infoadmin: Created page with '* From ENCODE paper These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas…'