VAT

From GersteinInfo

(Difference between revisions)

Latest revision as of 14:23, 16 June 2011

Data formats

VCF

The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the 1000 Genomes Project. A detailed summary of this file format can be found here. The annotation information is captured as part of the INFO field using the VA (Variant Annotation) tag. The string with the variant information has the following format:

AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}

All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:

The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by snpMapper): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by indelMapper), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).

Example 1: A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.

chr1	23112837	.	A	T	.	PASS	AA=A;AC=7;AN=118;DP=168;SF=2;VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*

Example 2: A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.

chr1	1110357	.	G	A	.	PASS	AA=G;AC=3;AN=118;DP=203;SF=2;VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H

Example 3: A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.

chr9	35819390	rs2381409	C	T	.	PASS	AA=N;AC=157;AN=240;DP=49;SF=0,1;VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106

Example 4: An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.

chr7	140118541	.	TACAACAACA	T,TACA	.	PASS	HP=1;VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L

Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).

VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using vcfModifyHeader. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.

@@ Line 3: / Line 3: @@
 __TOC__
-== Introduction ==
-The Variant Annotation Tool (VAT) consists of a set of modules to annotate genetic variants including SNPs and indels. This software package also contains a program to aggregate SNP and indel variants at the gene level.  Subsequently, an image is generated  for each gene to visualize the functional impact of these variants.  This information can then be viewed and shared using a web-interface. In addition to annotation of the coding variants, this tool also integrates allele frequencies and genotype data providing population-specific information from published high quality variation databases such as [http://www.1000genomes.org 1000 Genomes Project].
-<br><br>
 == Data formats ==
@@ Line 16: / Line 10: @@
 === VCF ===
-The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including SNPs and Indels. This format was developed as part of the [http://www.1000genomes.org 1000 Genomes Project].  A detailed summary of this file format can be found [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 here]. The annotation information is captured as part of the INFO field using the VA (Variant Annotation) tag. The string with the variant information has the following format:
+The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the [http://www.1000genomes.org 1000 Genomes Project].  A detailed summary of this file format can be found [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 here]. The annotation information is captured as part of the '''INFO field''' using the '''VA (Variant Annotation) tag'''. The string with the variant information has the following format:
+ '''AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}'''
+All annotated variant use the above format to capture information about the gene.  The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:
- AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}TranscriptName:TranscriptId:TranscriptLength_RelativePositionOfVariant_RelativePositionOfAminoAcid_Substitution
+[[File:VariantFormat.png|1000px]]
 The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles).  The type refers to the type of variant. For SNPs, the types can take on the following values (generated by [[#snpMapper|snpMapper]]): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap.  For indels (generated by [[#indelMapper|indelMapper]]), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS,  where FS denotes 'frameshift' and NFS indicates 'non-frameshift'.  The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).
@@ Line 45: / Line 43: @@
 Notice that multiple annotation entries are comma-separated.  Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).
+VAT also enables the grouping of samples.  For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using [[#vcfModifyHeader|vcfModifyHeader]]. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.
 <br>
@@ Line 52: / Line 52: @@
 === Interval ===
-The Interval format consists of '''eight''' tab-delimited columns and is used to represent genomic intervals such as genes.
+The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes.
-This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of [http://homes.gersteinlab.org/people/lh372/SOFT/bios/index.html BIOS]. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].
+This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].
 .   Name of the interval
@@ Line 64: / Line 64: @@
 .   Sub-interval end (with respect to the "+", comma-delimited)
-Note: For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using [[#gencode2interval|gencode2interval]] program ensures proper formatting:
+'''Note''': For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the [[#gencode2interval|gencode2interval]] program ensures proper formatting.
 Example file:
@@ Line 76: / Line 76: @@
 In this example, each interval (line) represents a transcript, while the sub-intervals denote exons.  The geneId is utilized to determine if multiple transcripts belong to the same gene model.
-Note: the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.
+'''Note''': the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.
 <br><br>
@@ Line 106: / Line 106: @@
 <center>[[#top|Top]]</center>
-==== snpMapperGeneric ====
+==== indelMapper ====
-snpMapperGeneric is a program to annotate a set of SNPs in [[#VCF|VCF]] format. The program checks for containment of a SNP in the specified annotation set.
+indelMapper is a program to annotate a set of indels in [[#VCF|VCF]] format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.
 '''Usage''':
-  snpMapper <annotation.interval> <nameFeature>
+  indelMapper <annotation.interval> <annotation.fa>
 * Inputs: Takes a [[#VCF|VCF]] input from STDIN
-* Outputs:  Outputs annotated SNPs in [[#VCF|VCF]] format.  The annotation information is captured as part of  the INFO field.  For details refer to the [[#VCF|VCF]] format specification.
+* Outputs:  Outputs annotated indels in [[#VCF|VCF]] format.  The annotation information is captured as part of  the INFO field.  For details refer to the [[#VCF|VCF]] format specification.
 * ''Required arguments''
 ** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
-** nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
+** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
 * ''Optional arguments''
 ** None
@@ Line 126: / Line 126: @@
 <center>[[#top|Top]]</center>
-==== indelMapper ====
+==== svMapper ====
-indelMapper is a program to annotate a set of indels in [[#VCF|VCF]] format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.
+svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.
 '''Usage''':
-  indelMapper <annotation.interval> <annotation.fa>
+  svMapper <annotation.interval>
 * Inputs: Takes a [[#VCF|VCF]] input from STDIN
-* Outputs:  Outputs annotated indels in [[#VCF|VCF]] format.  The annotation information is captured as part of  the INFO field.  For details refer to the [[#VCF|VCF]] format specification.
+* Outputs:  Outputs annotated SVs in [[#VCF|VCF]] format.  The annotation information is captured as part of  the INFO field.  For details refer to the [[#VCF|VCF]] format specification.
 * ''Required arguments''
 ** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
-** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
+* ''Optional arguments''
+** None
+<br>
+<center>[[#top|Top]]</center>
+==== genericMapper ====
+genericMapper is a program to annotate a number of different variants in [[#VCF|VCF]] format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).
+'''Usage''':
+ genericMapper <annotation.interval> <nameFeature>
+* Inputs: Takes a [[#VCF|VCF]] input from STDIN
+* Outputs:  Outputs the annotated variants in [[#VCF|VCF]] format.  The annotation information is captured as part of  the INFO field.
+* ''Required arguments''
+** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. This can be a generic [[#Interval|Interval]].
+** nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
 * ''Optional arguments''
 ** None
@@ Line 155: / Line 174: @@
 * Inputs: None
-* Outputs:  Generates two output files. The first file, named file.geneSummary.txt, contains the number of variants categorized by type for each gene.  A second file, named file.sampleSummary.txt, summarizes number of variants categorized by type for each sample.
+* Outputs:  Generates two output files. The first file, named ''file.geneSummary.txt'', contains the number of variants categorized by type for each gene.  A second file, named ''file.sampleSummary.txt'', summarizes number of variants categorized by type for each sample.
 * ''Required arguments''
 ** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs).  This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
@@ Line 177: / Line 196: @@
 * Outputs:  Generates an image in PNG format for each gene that has at least one annotated variant.
 * ''Required arguments''
-** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs).  This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
+** file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs).  This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
 ** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
 ** outputDir - The output directory where the images are stored
@@ Line 193: / Line 212: @@
 '''Usage''':
-  vcf2images <file.vcf.gz> <annotation.interval> <outputDir>
+  vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>
 * Inputs: None
@@ Line 205: / Line 224: @@
 <br>
+<center>[[#top|Top]]</center>
+==== vcfModifyHeader ====
+vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a [[#VCF|VCF]] file.  Specifically, it assigns each sample to a group or population (these assignments are used by other programs including [[#vcfSummary|vcfSummary]]).
+'''Usage''':
+ vcfModifyHeader <oldHeader.vcf> <groups.txt>
+* Inputs: None
+* Outputs:  Generates a [[#VCF|VCF]] header file.
+* ''Required arguments''
+** oldHeader.vcf - The meta lines of a [[#VCF|VCF]] file. It can be obtained by using the following command:
+ grep '#' file.vcf > file.header.vcf
+** groups.txt - This tab-delimited file that assigns each sample present in the [[#VCF|VCF]] to a group/population. Here is a small sample file:
+ HG00629	CHS
+ HG00634	CHS
+ HG00635	CHS
+ HG00637	PUR
+ HG00638	PUR
+ HG00640	PUR
+ NA06984	CEU
+ NA06985	CEU
+ NA06986	CEU
+ NA06989	CEU
+ NA06994	CEU
+* ''Optional arguments''
+** None
+<br><br>
 === Auxiliary programs ===
@@ Line 228: / Line 279: @@
   awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
-  gencode2interval < encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval
+  gencode2interval < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval
 <br>
@@ Line 254: / Line 305: @@
 === External programs ===
 <center>[[#top|Top]]</center>
@@ Line 263: / Line 313: @@
 <br>
+<center>[[#top|Top]]</center>
+==== VCF tools ====
+[http://vcftools.sourceforge.net/ VCF tools] consists of a suite of very useful modules to manipulate [[#VCF|VCF]] files. For more information consult the [http://vcftools.sourceforge.net/docs.html documentation page].
+<br><br>
 == Example workflow ==
-=== Preprocessing of the annotation files ===
+This workflow shows how the [http://info.gersteinlab.org/VAT/dataSets ''1000 Genomes Project, Phase I, chr22, SNP calls''] data set was processed.
-=== Annotation of variants ===
+<br>
-=== Generation of summary statistics ===
+<center>[[#top|Top]]</center>
+=== Prerequisites ===
+Download the GENCODE annotation set (version 3c, hg19):
+ $ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz
+Download the human genome (hg19) in 2bit format.  This is used by [[#interval2sequences|interval2sequences]] to extract the genomic sequences for the entries specified in the annotation set:
+ $ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
+Download the SNP files in [[#VCF|VCF]] format and a third file that assigns each sample to a population:
+ $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
+ $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi
+ $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel
+Extract variants on chromosome 22:
+ $ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz
+<br>
+<center>[[#top|Top]]</center>
+=== Preprocessing of the annotation file ===
+Decompress the annotation file:
+ $ gunzip gencode.v3c.annotation.GRCh37.gtf.gz
+Extract the coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
+ $ awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
+Convert the GENCODE GTF file into [[#Interval|Interval]] format:
+ $ [[#gencode2interval|gencode2interval]] < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval
+Retrieve the genomic sequences for the transcripts specified in the annotation file.
+ $ [[#interval2sequences|interval2sequences]] hg19.2bit gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval exonic > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa
+<br>
+<center>[[#top|Top]]</center>
+=== Annotation of the SNPs ===
+Annotate the variants using [[#snpMapper|snpMapper]]
+ $ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | [[#snpMapper|snpMapper]] gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf
+<br>
+<center>[[#top|Top]]</center>
+=== Modification the VCF header line ===
+Modify the [[#VCF|VCF]] header line to assign individual samples to populations (groups).  This is done by using the following syntax: group:sample (i.e. CEU:NA0705).
+First get the old meta-data lines:
+ $ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf
+Store the annotated variants in a separate file:
+ $ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf
+Create the new meta-data lines:
+ $ [[#vcfModifyHeader|vcfModifyHeader]] ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf
+Merge the new meta-data lines with the annotated variants and create a new file called ''ALL.2of4intersection.20100804.chr22.vcf'':
+ $ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf
+Compress the newly created [[#VCF|VCF]] file with the annotated variants:
+ $ bgzip ALL.2of4intersection.20100804.chr22.vcf
+Index the newly created [[#VCF|VCF]] file with the annotated variants:
+ $ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz
+<br>
+<center>[[#top|Top]]</center>
+=== Generation of summaries and images ===
+Generate gene and sample summaries for the annotated variants
+ $ [[#vcfSummary|vcfSummary]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval
+ Resulting files: ''ALL.2of4intersection.20100804.chr22.geneSummary.txt'' and ''ALL.2of4intersection.20100804.chr22.sampleSummary.txt''
+Make a new directory to store the images and [[#VCF|VCF]] files for each gene.
+ $ mkdir ALL.2of4intersection.20100804.chr22
+Generate an image for each gene with at least one annotated variant.
+ $ [[#vcf2images|vcf2images]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22
+Subset the [[#VCF|VCF]] file with the annotated variants by gene.
+ $ [[#vcfSubsetByGene|vcfSubsetByGene]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22
+<br>
+<center>[[#top|Top]]</center>
 === Setting up the web server ===
+Make a TAR ball of the relevant files:
+* Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
+* File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
+* File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
+* Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
+* Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)
+ $ tar -cvf ALL.2of4intersection.20100804.chr22.tar \
+    ALL.2of4intersection.20100804.chr22 \
+    ALL.2of4intersection.20100804.chr22.geneSummary.txt \
+    ALL.2of4intersection.20100804.chr22.sampleSummary.txt \
+    ALL.2of4intersection.20100804.chr22.vcf.gz \
+    ALL.2of4intersection.20100804.chr22.vcf.gz.tbi
+Copy the relevant files to the web server ('''WEB_DATA_DIR''' specified in the [http://info.gersteinlab.org/VAT/download#Installation_and_Configuration_of_VAT VAT configuration file])
+ $ scp ALL.2of4intersection.20100804.chr22.tar user@webserver:/path/to/WEB_DATA_DIR
+Unpack the TAR ball on the web server
+ $ tar -xvf ALL.2of4intersection.20100804.chr22.tar
+'''View the results''': [http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding Link to web server]

VAT

From GersteinInfo

Latest revision as of 14:23, 16 June 2011

Contents

Data formats

VCF

Interval

List of programs

VAT Core Modules

snpMapper

indelMapper

svMapper

genericMapper

vcfSummary

vcf2images

vcfSubsetByGene

vcfModifyHeader

Auxiliary programs

gencode2interval

interval2sequences

External programs

bgzip/tabix

VCF tools

Example workflow

Prerequisites

Preprocessing of the annotation file

Annotation of the SNPs

Modification the VCF header line

Generation of summaries and images

Setting up the web server

Views

Personal tools

GersteinLab Public Wiki

Search

Toolbox