VAT
From GersteinInfo
(Difference between revisions)
Line 3: | Line 3: | ||
__TOC__ | __TOC__ | ||
- | |||
- | |||
- | |||
<br> | <br> | ||
== Data formats == | == Data formats == | ||
- | |||
<center>[[#top|Top]]</center> | <center>[[#top|Top]]</center> | ||
Line 16: | Line 12: | ||
=== Interval Format === | === Interval Format === | ||
+ | |||
+ | The Interval format consists of '''eight''' tab-delimited columns and is used to represent genomic intervals such as genes. | ||
+ | This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of [http://homes.gersteinlab.org/people/lh372/SOFT/bios/index.html BIOS]. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386]. | ||
+ | |||
+ | 1. Name of the interval | ||
+ | 2. Chromosome | ||
+ | 3. Strand | ||
+ | 4. Interval start (with respect to the "+") | ||
+ | 5. Interval end (with respect to the "+") | ||
+ | 6. Number of sub-intervals | ||
+ | 7. Sub-interval starts (with respect to the "+", comma-delimited) | ||
+ | 8. Sub-interval end (with respect to the "+", comma-delimited) | ||
+ | |||
+ | Example file: | ||
+ | |||
+ | uc001aaw.1 chr1 + 357521 358460 1 357521 358460 | ||
+ | uc001aax.1 chr1 + 410068 411702 3 410068,410854,411258 410159,411121,411702 | ||
+ | uc001aay.1 chr1 - 552622 554252 3 552622,553203,554161 553066,553466,554252 | ||
+ | uc001aaz.1 chr1 + 556324 557910 1 556324 557910 | ||
+ | uc001aba.1 chr1 + 558011 558705 1 558011 558705 | ||
+ | |||
+ | In this example the intervals represent a transcripts, while the sub-intervals denote exons. | ||
+ | |||
+ | Note: the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''. | ||
+ | |||
+ | <br> |
Revision as of 17:19, 6 March 2011
Contents |
Data formats
Variant Call Format (VCF)
Interval Format
The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. This format is closely associated with the intervalFind module, which is part of BIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" Bioinformatics 2007;23:1386-1393 [1].
1. Name of the interval 2. Chromosome 3. Strand 4. Interval start (with respect to the "+") 5. Interval end (with respect to the "+") 6. Number of sub-intervals 7. Sub-interval starts (with respect to the "+", comma-delimited) 8. Sub-interval end (with respect to the "+", comma-delimited)
Example file:
uc001aaw.1 chr1 + 357521 358460 1 357521 358460 uc001aax.1 chr1 + 410068 411702 3 410068,410854,411258 410159,411121,411702 uc001aay.1 chr1 - 552622 554252 3 552622,553203,554161 553066,553466,554252 uc001aaz.1 chr1 + 556324 557910 1 556324 557910 uc001aba.1 chr1 + 558011 558705 1 558011 558705
In this example the intervals represent a transcripts, while the sub-intervals denote exons.
Note: the coordinates in the Interval format are zero-based and the end coordinate is not included.