VAT
From GersteinInfo
Contents |
Introduction
The Variant Annotation Tool (VAT) consists of a set of modules to annotate genetic variants including SNPs and indels. This software package also contains a program to aggregate SNP and indel variants at the gene level. Subsequently, an image is generated for each gene to visualize the functional impact of these variants. This information can then be viewed and shared using a web-interface. In addition to annotation of the coding variants, this tool also integrates allele frequencies and genotype data providing population-specific information from published high quality variation databases such as 1000 Genomes Project.
Data formats
Variant Call Format (VCF)
The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including SNPs and Indels. This format was developed as part of the 1000 Genomes Project. A detailed summary of this file format can be found here.
Interval
The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. This format is closely associated with the intervalFind module, which is part of BIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" Bioinformatics 2007;23:1386-1393 [1].
1. Name of the interval 2. Chromosome 3. Strand 4. Interval start (with respect to the "+") 5. Interval end (with respect to the "+") 6. Number of sub-intervals 7. Sub-interval starts (with respect to the "+", comma-delimited) 8. Sub-interval end (with respect to the "+", comma-delimited)
Example file:
uc001aaw.1 chr1 + 357521 358460 1 357521 358460 uc001aax.1 chr1 + 410068 411702 3 410068,410854,411258 410159,411121,411702 uc001aay.1 chr1 - 552622 554252 3 552622,553203,554161 553066,553466,554252 uc001aaz.1 chr1 + 556324 557910 1 556324 557910 uc001aba.1 chr1 + 558011 558705 1 558011 558705
In this example the intervals represent a transcripts, while the sub-intervals denote exons.
Note: the coordinates in the Interval format are zero-based and the end coordinate is not included.
List of programs
VAT Core Modules
snpMapper
snpMapper a
Usage:
snpMapper <annotation.interval> <annotation.fa>
- Inputs:
- Outputs:
- Required arguments
- annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
- annotation.fa - File with the transcript sequences in FASTA format for each entry in specified in annotation.interval. This file is typically generated by using the interval2sequences program.
- Optional arguments
- None
snpMapperGeneric
txt
Usage:
cmd
- Inputs:
- Outputs:
- Required arguments
- None
- Optional arguments
- None
indelMapper
txt
Usage:
cmd
- Inputs:
- Outputs:
- Required arguments
- None
- Optional arguments
- None
vcfSummary
txt
Usage:
cmd
- Inputs:
- Outputs:
- Required arguments
- None
- Optional arguments
- None
vcf2images
txt
Usage:
cmd
- Inputs:
- Outputs:
- Required arguments
- None
- Optional arguments
- None
vcfSubsetByGene
txt
Usage:
cmd
- Inputs:
- Outputs:
- Required arguments
- None
- Optional arguments
- None
vcfModifyHeader
txt
Usage:
cmd
- Inputs:
- Outputs:
- Required arguments
- None
- Optional arguments
- None
Auxiliary Programs
gencode2interval
gencode2interval converts a GENCODE annotation file (in GTF format) to the Interval format.
Usage:
gencode2interval
- Inputs: Takes a GENCODE annotation file in GTF format from STDIN
- Outputs: Outputs the GENCODE annotation file in Interval format to STDOUT
- Required arguments
- None
- Optional arguments
- None
interval2sequences
Module to retrieve genomic/exonic sequences for an annotation set in Interval format.
Usage:
interval2sequences <file.2bit> <file.annotation> <exonic|genomic>
- Inputs: None
- Outputs: Reports the extracted sequences in FASTA format
- Required arguments
- file.2bit - genome reference sequence in 2bit format
- file.annotation - annotation set in Interval format (each line represents one transcript)
- < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
- Optional arguments
- None