VAT

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
(snpMapper)
(gencode2interval)
Line 224: Line 224:
* ''Optional arguments''
* ''Optional arguments''
** None
** None
 +
 +
Note: To obtain the coding sequences of the elements with gene_type ''protein_coding'' and transcript_type ''protein_coding'' the following command should be used:
 +
 +
awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
 +
gencode2interval < encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval
 +
<br>
<br>

Revision as of 14:03, 7 March 2011

VAT Main Page


Contents


Introduction

The Variant Annotation Tool (VAT) consists of a set of modules to annotate genetic variants including SNPs and indels. This software package also contains a program to aggregate SNP and indel variants at the gene level. Subsequently, an image is generated for each gene to visualize the functional impact of these variants. This information can then be viewed and shared using a web-interface. In addition to annotation of the coding variants, this tool also integrates allele frequencies and genotype data providing population-specific information from published high quality variation databases such as 1000 Genomes Project.



Data formats

Top

VCF

The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including SNPs and Indels. This format was developed as part of the 1000 Genomes Project. A detailed summary of this file format can be found here.


Top

Interval

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. This format is closely associated with the intervalFind module, which is part of BIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" Bioinformatics 2007;23:1386-1393 [1].

1.   Name of the interval
2.   Chromosome 
3.   Strand
4.   Interval start (with respect to the "+")
5.   Interval end (with respect to the "+")
6.   Number of sub-intervals
7.   Sub-interval starts (with respect to the "+", comma-delimited)
8.   Sub-interval end (with respect to the "+", comma-delimited)   

Example file:

uc001aaw.1      chr1    +       357521  358460  1       357521  358460
uc001aax.1      chr1    +       410068  411702  3       410068,410854,411258    410159,411121,411702
uc001aay.1      chr1    -       552622  554252  3       552622,553203,554161    553066,553466,554252
uc001aaz.1      chr1    +       556324  557910  1       556324  557910
uc001aba.1      chr1    +       558011  558705  1       558011  558705  

In this example the intervals represent a transcripts, while the sub-intervals denote exons.

Note: the coordinates in the Interval format are zero-based and the end coordinate is not included.



List of programs

VAT Core Modules

Top

snpMapper

snpMapper is a program to annotate a set of SNPs in VCF format

Usage:

snpMapper <annotation.interval> <annotation.fa>
  • Inputs: Takes a VCF input from STDIN
  • Outputs: Outputs annotated SNPs in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
  • Required arguments
    • annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
    • annotation.fa - File with the transcript sequences in FASTA format for each entry in specified in annotation.interval. This file is typically generated by the interval2sequences program using the 'exonic' mode.
  • Optional arguments
    • None

Note: The name field in the Interval must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using gencode2interval program ensures proper formatting:

ENSG00000002726|ENST00000360937|ABP1|ABP1-001	chr7	+	150553558	150558294	4	150553558,150555850,150557588,150558030	150555128,150556136,150557721,150558294
ENSG00000002726|ENST00000388969|ABP1|ABP1-201	chr7	+	150554980	150558046	5	150554980,150555850,150557588,150557674,150558030	150555128,150556136,150557670,150557721,150558046
ENSG00000002726|ENST00000416793|ABP1|ABP1-002	chr7	+	150553558	150558294	4	150553558,150555850,150557531,150558030	150555128,150556136,150557721,150558294
ENSG00000002726|ENST00000437714|ABP1|ABP1-202	chr7	+	150553558	150558294	6	150553558,150554888,150555850,150556896,150557588,150558030	150554654,150555128,150555900,150556904,150557721,150558294
ENSG00000002726|ENST00000460213|ABP1|ABP1-012	chr7	+	150553558	150554305	1	150553558	150554305
ENSG00000002726|ENST00000467291|ABP1|ABP1-010	chr7	+	150553558	150558294	4	150553558,150555850,150557588,150558030	150555128,150556136,150557721,150558294
ENSG00000002726|ENST00000483043|ABP1|ABP1-007	chr7	+	150553558	150554567	1	150553558	150554567
ENSG00000002726|ENST00000487631|ABP1|ABP1-011	chr7	+	150553558	150558294	4	150553558,150555850,150557588,150558030	150555128,150556136,150557721,150558294
ENSG00000002726|ENST00000493429|ABP1|ABP1-009	chr7	+	150553558	150558294	4	150553558,150555850,150557588,150558030	150555128,150556136,150557721,150558294
ENSG00000002745|ENST00000222462|WNT16|WNT16-001	chr7	+	120969346	120979396	4	120969346,120969620,120971731,120978934	120969441,120969871,120972018,120979396
ENSG00000002745|ENST00000361301|WNT16|WNT16-002	chr7	+	120965469	120979396	4	120965469,120969620,120971731,120978934	120965534,120969871,120972018,120979396
ENSG00000002745|ENST00000414945|WNT16|WNT16-201	chr7	+	120965454	120979396	4	120965454,120969620,120971731,120978934	120965534,120969871,120972018,120979396

The geneId is utilized to determine if multiple transcripts belong to the same gene model.


Top

snpMapperGeneric

txt

Usage:

cmd
  • Inputs:
  • Outputs:
  • Required arguments
    • None
  • Optional arguments
    • None


Top

indelMapper

txt

Usage:

cmd
  • Inputs:
  • Outputs:
  • Required arguments
    • None
  • Optional arguments
    • None


Top

vcfSummary

txt

Usage:

cmd
  • Inputs:
  • Outputs:
  • Required arguments
    • None
  • Optional arguments
    • None


Top

vcf2images

txt

Usage:

cmd
  • Inputs:
  • Outputs:
  • Required arguments
    • None
  • Optional arguments
    • None


Top

vcfSubsetByGene

txt

Usage:

cmd
  • Inputs:
  • Outputs:
  • Required arguments
    • None
  • Optional arguments
    • None


Top

vcfModifyHeader

txt

Usage:

cmd
  • Inputs:
  • Outputs:
  • Required arguments
    • None
  • Optional arguments
    • None


Auxiliary Programs

Top

gencode2interval

gencode2interval converts a GENCODE annotation file (in GTF format) to the Interval format.

Usage:

gencode2interval
  • Inputs: Takes a GENCODE annotation file in GTF format from STDIN
  • Outputs: Outputs the GENCODE annotation file in Interval format to STDOUT
  • Required arguments
    • None
  • Optional arguments
    • None

Note: To obtain the coding sequences of the elements with gene_type protein_coding and transcript_type protein_coding the following command should be used:

awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
gencode2interval < encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval


Top

interval2sequences

Module to retrieve genomic/exonic sequences for an annotation set in Interval format.

Usage:

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>
  • Inputs: None
  • Outputs: Reports the extracted sequences in FASTA format
  • Required arguments
    • file.2bit - genome reference sequence in 2bit format
    • file.annotation - annotation set in Interval format (each line represents one transcript)
    • < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
  • Optional arguments
    • None


Personal tools