<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://info.gersteinlab.org/index.php?action=history&amp;feed=atom&amp;title=Genome_Statistics</id>
	<title>Genome Statistics - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://info.gersteinlab.org/index.php?action=history&amp;feed=atom&amp;title=Genome_Statistics"/>
	<link rel="alternate" type="text/html" href="https://info.gersteinlab.org/index.php?title=Genome_Statistics&amp;action=history"/>
	<updated>2026-05-30T09:09:32Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.42.6</generator>
	<entry>
		<id>https://info.gersteinlab.org/index.php?title=Genome_Statistics&amp;diff=96&amp;oldid=prev</id>
		<title>Infoadmin: Created page with &#039;* From ENCODE paper  These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas…&#039;</title>
		<link rel="alternate" type="text/html" href="https://info.gersteinlab.org/index.php?title=Genome_Statistics&amp;diff=96&amp;oldid=prev"/>
		<updated>2010-06-10T13:58:17Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;#039;* From ENCODE paper  These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas…&amp;#039;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;* From ENCODE paper&lt;br /&gt;
&lt;br /&gt;
These 30Mb are divided among 44 genomic regions;&lt;br /&gt;
approximately 15Mb reside in 14 regions for which there is already&lt;br /&gt;
substantial biological knowledge, whereas the other 15Mb reside in&lt;br /&gt;
30 regions chosen by a stratified random-sampling method (see&lt;br /&gt;
http://www.genome.gov/10506161).&lt;br /&gt;
&lt;br /&gt;
To begin with, our studies show that 14.7%of the bases represented in&lt;br /&gt;
the unbiased tiling arrays are transcribed in at least one tissue sample.&lt;br /&gt;
Consistent with previous work14,15, many (63%) TxFrags reside outside&lt;br /&gt;
of GENCODE annotations&lt;br /&gt;
&lt;br /&gt;
399 protein-coding loci&lt;br /&gt;
(those loci found entirely in ENCODE regions)&lt;br /&gt;
&lt;br /&gt;
Remarkably, 93% of&lt;br /&gt;
bases are represented in a primary transcript identified by at least two&lt;br /&gt;
independent observations (but potentially using the same technology);&lt;br /&gt;
this figure is reduced to 74% in the case of primary transcripts&lt;br /&gt;
detected by at least two different technologies.&lt;br /&gt;
&lt;br /&gt;
To understand better transcriptional regulation, ....We analysed over 150 data sets, mainly&lt;br /&gt;
from ChIP-chip...&lt;br /&gt;
&lt;br /&gt;
From analysing multiple&lt;br /&gt;
data sets, we find 4,491 known and novel TSSs in the ENCODE&lt;br /&gt;
regions, almost tenfold more than the number of established genes&lt;br /&gt;
Of this: Known GENCODE 5&amp;#039; ends 1,730&lt;br /&gt;
&lt;br /&gt;
Fig10: 5% bases constrained. Of these 40% unannotated, 32% coding, 8%+20% UTRs + expt. annotation&lt;br /&gt;
&lt;br /&gt;
Fig 11: Fraction of experimental annotation&lt;br /&gt;
overlapping constrained sequence: ~90% CDS , ~10% Un.TxFrags, 55% RFBRs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Zhang et al. paper&lt;br /&gt;
analysed 105 datasets &lt;br /&gt;
~15K total &amp;quot;hits&amp;quot;, averaging ~150/set&lt;br /&gt;
How many total base pairs covered. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* DART paper&lt;br /&gt;
By this method, 14% of Un.TxFrags could be assigned to annotated&lt;br /&gt;
loci, and 21% could be clustered into 200 novel loci (with an average&lt;br /&gt;
of ,7 TxFrags per locus).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* SNPs &lt;br /&gt;
1/300 between people&lt;br /&gt;
1/100 between human and chimp&lt;br /&gt;
&lt;br /&gt;
Sequencing error rate 1/10000 (ref?)&lt;br /&gt;
&lt;br /&gt;
* GRGENEREV&lt;br /&gt;
&lt;br /&gt;
(The GENCODE annotation currently contains on average 5.4 transcripts per locus).&lt;br /&gt;
This was underscored by the subsequent sequencing of the human genome where it was shown that only 1.2% of the DNA bases code for exons (Lander et al. 2001; Venter et al., 2001).... Moreover, comparison of the human, dog, and mouse and other vertebrate genomes showed that a large fraction of these was conserved, with ~5% under negative selection since the divergence of these species (Waterston et al., 2002; Lindblad-Toh et al., 2005).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Is this true:&lt;br /&gt;
Approximately another 20% of the constrained elements overlap with experimentally annotated regulatory regions.  Therefore, a similar fraction of of constrained elements (40% in terms of bases) are located in protein-coding regions as unannotated noncoding regions (100% - 40% coding - 20% regulatory regions), suggesting that the latter is at least as functionally important as the former.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Some human genome statistics====&lt;br /&gt;
compiled by ZDZ&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Nulcear genome&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* Size: ~ 3.2 Gb&lt;br /&gt;
* Chromosomes: 1–22, X, Y, all linear&lt;br /&gt;
* Associated protein: several classes of histone and nonhistone protein&lt;br /&gt;
* Euchromatin: ~ 2.9–3.0 Gb&lt;br /&gt;
* Constitutive heterochromatin: &amp;gt; 0.2 Gb&lt;br /&gt;
* Highly conserved:&lt;br /&gt;
** Coding DNA: ~ 50 Mb (~1.5%)&lt;br /&gt;
** Other (regulatory etc.):  ~ 100 Mb (3%)&lt;br /&gt;
* Repetitive DNA: &amp;gt; 50%&lt;br /&gt;
* Segmental duplication: &amp;gt; 150 Mb (&amp;gt; 5%)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Gene number&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* Mitochondrial genome: 37&lt;br /&gt;
* Nuclear genome: 30,000&lt;br /&gt;
** ~ 1,400 per chromosome; but dependent on chromosome length and also on chromosome type&lt;br /&gt;
** ~ 60 per chromosome band in a 550-band chromosome preparation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Gene density&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* one per 1 kb in the mitochondrial genome&lt;br /&gt;
* one per 100 kb in the nuclear genome&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Gene size&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* ~ 27 kb, but enormous variation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Intergenic distance&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* ~75 kb in nuclear genome&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Exon number&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* ~ 9, generally correlated with gene length&lt;br /&gt;
* Wide variation from small genes with a single exon to large genes with numerous exons&lt;br /&gt;
* The dystrophin gene (DMD) has 79 exons&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Exon size&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* ~ 122 bp with comparatively little length variation&lt;br /&gt;
* Coding sequence exons are a bit shorter on average&lt;br /&gt;
* Exons containing 3&amp;#039; UTR sequences are considerably longer&lt;br /&gt;
* Some exceptionally long exons have been reported:&lt;br /&gt;
** exon 26 of the apoB gene (APOB), 7.6 kb&lt;br /&gt;
** exon 15 of the adenomatous polyposis coli gene (APC), 6.5 kb&lt;br /&gt;
** exon 11 of the BRCA1 breast cancer gene, 3.4 kb&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Intron size&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* Enormous variation&lt;br /&gt;
* Strong direct correlation with gene size&lt;br /&gt;
* Examples of typical intron sizes are as follows:&lt;br /&gt;
** globin gene (HBB; 1.6 kb) 0.5 kb&lt;br /&gt;
** myoglobin gene (MB; 10.4 kb) 4.7 kb&lt;br /&gt;
** dystrophin gene (DMD; 2.5 Mb) 30.0 kb&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;mRNA size&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* ~ 2.5 kb, but considerable variation&lt;br /&gt;
* 5&amp;#039; UTR: ~ 0.2–0.3 kb&lt;br /&gt;
* CDS:  1.5–1.8 kb (500–600 codons)&lt;br /&gt;
* 3&amp;#039; UTR: ~ 0.8 kb (a likely underestimation due to underreporting of genes with long 3&amp;#039; UTRs)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** At statistics (src is DZ)&lt;br /&gt;
* &amp;gt;21K genes&lt;br /&gt;
* ~26K pgenes from pipeline (3K from tair)&lt;br /&gt;
* 5 chr&lt;br /&gt;
* 260 Mb &lt;br /&gt;
&lt;br /&gt;
** rice&lt;br /&gt;
* 12 chr, 390 Mb&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====General genome statistics====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;GENOME sizes&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* E.coli 4.6 million bp&lt;br /&gt;
* Yeast 12 million bp &lt;br /&gt;
* Worm 100 million bp&lt;br /&gt;
* Fruit Fly 133 million bp&lt;br /&gt;
* Human 3.3 billion bp&lt;br /&gt;
* Mouse 3.4 billion bp&lt;br /&gt;
&lt;br /&gt;
* Red Viscacha Rat 8.2 billion bp&lt;br /&gt;
* Mountain Grasshopper 16.5 billion bp&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Number of genes per GENOME&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
* Yeast  6,530 (known) 167 (novel) 21 (pseudogenes)&lt;br /&gt;
* Worm 20,049 (known) 20 (novel) 1,150 (pseudogenes)&lt;br /&gt;
* Fruit Fly 4,751 (known) 9,288 (novel) 52 (pseudogenes)&lt;br /&gt;
* Human 21,667 (known) 1,013 (novel) 1,040 (pseudogenes)&lt;br /&gt;
* Mouse 22,723 (known) 1,395 (novel) 1,350 (pseudogenes)&lt;br /&gt;
(src Ensembl)&lt;/div&gt;</summary>
		<author><name>Infoadmin</name></author>
	</entry>
</feed>