Genome Statistics

From GersteinInfo

(Difference between revisions)

Infoadmin (Talk | contribs)
(Created page with '* From ENCODE paper These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas…')

Latest revision as of 13:58, 10 June 2010

From ENCODE paper

These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas the other 15Mb reside in 30 regions chosen by a stratified random-sampling method (see http://www.genome.gov/10506161).

To begin with, our studies show that 14.7%of the bases represented in the unbiased tiling arrays are transcribed in at least one tissue sample. Consistent with previous work14,15, many (63%) TxFrags reside outside of GENCODE annotations

399 protein-coding loci (those loci found entirely in ENCODE regions)

Remarkably, 93% of bases are represented in a primary transcript identified by at least two independent observations (but potentially using the same technology); this figure is reduced to 74% in the case of primary transcripts detected by at least two different technologies.

To understand better transcriptional regulation, ....We analysed over 150 data sets, mainly from ChIP-chip...

From analysing multiple data sets, we find 4,491 known and novel TSSs in the ENCODE regions, almost tenfold more than the number of established genes Of this: Known GENCODE 5' ends 1,730

Fig10: 5% bases constrained. Of these 40% unannotated, 32% coding, 8%+20% UTRs + expt. annotation

Fig 11: Fraction of experimental annotation overlapping constrained sequence: ~90% CDS , ~10% Un.TxFrags, 55% RFBRs

Zhang et al. paper

analysed 105 datasets ~15K total "hits", averaging ~150/set How many total base pairs covered.

DART paper

By this method, 14% of Un.TxFrags could be assigned to annotated loci, and 21% could be clustered into 200 novel loci (with an average of ,7 TxFrags per locus).

SNPs

1/300 between people 1/100 between human and chimp

Sequencing error rate 1/10000 (ref?)

GRGENEREV

(The GENCODE annotation currently contains on average 5.4 transcripts per locus). This was underscored by the subsequent sequencing of the human genome where it was shown that only 1.2% of the DNA bases code for exons (Lander et al. 2001; Venter et al., 2001).... Moreover, comparison of the human, dog, and mouse and other vertebrate genomes showed that a large fraction of these was conserved, with ~5% under negative selection since the divergence of these species (Waterston et al., 2002; Lindblad-Toh et al., 2005).

Is this true: Approximately another 20% of the constrained elements overlap with experimentally annotated regulatory regions. Therefore, a similar fraction of of constrained elements (40% in terms of bases) are located in protein-coding regions as unannotated noncoding regions (100% - 40% coding - 20% regulatory regions), suggesting that the latter is at least as functionally important as the former.

Some human genome statistics

compiled by ZDZ

Nulcear genome

Size: ~ 3.2 Gb
Chromosomes: 1–22, X, Y, all linear
Associated protein: several classes of histone and nonhistone protein
Euchromatin: ~ 2.9–3.0 Gb
Constitutive heterochromatin: > 0.2 Gb
Highly conserved:
- Coding DNA: ~ 50 Mb (~1.5%)
- Other (regulatory etc.): ~ 100 Mb (3%)
Repetitive DNA: > 50%
Segmental duplication: > 150 Mb (> 5%)

Gene number

Mitochondrial genome: 37
Nuclear genome: 30,000
- ~ 1,400 per chromosome; but dependent on chromosome length and also on chromosome type
- ~ 60 per chromosome band in a 550-band chromosome preparation

Gene density

one per 1 kb in the mitochondrial genome
one per 100 kb in the nuclear genome

Gene size

~ 27 kb, but enormous variation

Intergenic distance

~75 kb in nuclear genome

Exon number

~ 9, generally correlated with gene length
Wide variation from small genes with a single exon to large genes with numerous exons
The dystrophin gene (DMD) has 79 exons

Exon size

~ 122 bp with comparatively little length variation
Coding sequence exons are a bit shorter on average
Exons containing 3' UTR sequences are considerably longer
Some exceptionally long exons have been reported:
- exon 26 of the apoB gene (APOB), 7.6 kb
- exon 15 of the adenomatous polyposis coli gene (APC), 6.5 kb
- exon 11 of the BRCA1 breast cancer gene, 3.4 kb

Intron size

Enormous variation
Strong direct correlation with gene size
Examples of typical intron sizes are as follows:
- globin gene (HBB; 1.6 kb) 0.5 kb
- myoglobin gene (MB; 10.4 kb) 4.7 kb
- dystrophin gene (DMD; 2.5 Mb) 30.0 kb

mRNA size

~ 2.5 kb, but considerable variation
5' UTR: ~ 0.2–0.3 kb
CDS: 1.5–1.8 kb (500–600 codons)
3' UTR: ~ 0.8 kb (a likely underestimation due to underreporting of genes with long 3' UTRs)

- At statistics (src is DZ)
>21K genes
~26K pgenes from pipeline (3K from tair)
5 chr
260 Mb

- rice
12 chr, 390 Mb

General genome statistics

GENOME sizes

E.coli 4.6 million bp
Yeast 12 million bp
Worm 100 million bp
Fruit Fly 133 million bp
Human 3.3 billion bp
Mouse 3.4 billion bp

Red Viscacha Rat 8.2 billion bp
Mountain Grasshopper 16.5 billion bp

Number of genes per GENOME

Yeast 6,530 (known) 167 (novel) 21 (pseudogenes)
Worm 20,049 (known) 20 (novel) 1,150 (pseudogenes)
Fruit Fly 4,751 (known) 9,288 (novel) 52 (pseudogenes)
Human 21,667 (known) 1,013 (novel) 1,040 (pseudogenes)
Mouse 22,723 (known) 1,395 (novel) 1,350 (pseudogenes)

(src Ensembl)

Genome Statistics

From GersteinInfo

Latest revision as of 13:58, 10 June 2010

Some human genome statistics

General genome statistics

Views

Personal tools

GersteinLab Public Wiki

Search

Toolbox