GersteinInfo - User contributions [en]

VAT

2011-06-16T14:23:34Z

Lukas.habegger: /* gencode2interval */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== Data formats ==

<center>[[#top|Top]]</center>

=== VCF ===

The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the [http://www.1000genomes.org 1000 Genomes Project]. A detailed summary of this file format can be found [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 here]. The annotation information is captured as part of the '''INFO field''' using the '''VA (Variant Annotation) tag'''. The string with the variant information has the following format:

'''AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}'''

All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:

[[File:VariantFormat.png|1000px]]

The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by [[#snpMapper|snpMapper]]): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by [[#indelMapper|indelMapper]]), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).

'''Example 1''': A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.

chr1 23112837 . A T . PASS AA=A;AC=7;AN=118;DP=168;SF=2;'''VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*'''

'''Example 2''': A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.

chr1 1110357 . G A . PASS AA=G;AC=3;AN=118;DP=203;SF=2;'''VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H'''

'''Example 3''': A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.

chr9 35819390 rs2381409 C T . PASS AA=N;AC=157;AN=240;DP=49;SF=0,1;'''VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106'''

'''Example 4''': An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.

chr7 140118541 . TACAACAACA T,TACA . PASS HP=1;'''VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L'''

Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).

VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using [[#vcfModifyHeader|vcfModifyHeader]]. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.

 

<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

'''Note''': For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the [[#gencode2interval|gencode2interval]] program ensures proper formatting.

Example file:

ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008516|ENST00000325800|MMP25|MMP25-201 chr16 + 3097544 3105947 4 3097544,3100009,3100254,3105830 3097548,3100145,3100546,3105947
ENSG00000008516|ENST00000336577|MMP25|MMP25-202 chr16 + 3096918 3109096 10 3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827 3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096

In this example, each interval (line) represents a transcript, while the sub-intervals denote exons. The geneId is utilized to determine if multiple transcripts belong to the same gene model.

'''Note''': the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

== List of programs ==

=== VAT Core Modules ===

<center>[[#top|Top]]</center>

==== snpMapper ====

snpMapper is a program to annotate a set of SNPs in [[#VCF|VCF]] format. The program determines the effect of a SNP on the coding potential (synonymous, nonsynonymous, prematureStop, removedStop, spliceOverlap) of each transcript of a gene.

'''Usage''':

snpMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SNPs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== indelMapper ====

indelMapper is a program to annotate a set of indels in [[#VCF|VCF]] format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.

'''Usage''':

indelMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated indels in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== svMapper ====

svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.

'''Usage''':

svMapper <annotation.interval>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SVs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== genericMapper ====

genericMapper is a program to annotate a number of different variants in [[#VCF|VCF]] format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).

'''Usage''':

genericMapper <annotation.interval> <nameFeature>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs the annotated variants in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. This can be a generic [[#Interval|Interval]].
** nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSummary ====

vcfSummary is a program to aggregate annotated variants across genes and samples.

'''Usage''':

vcfSummary <file.vcf.gz> <annotation.interval>

* Inputs: None
* Outputs: Generates two output files. The first file, named ''file.geneSummary.txt'', contains the number of variants categorized by type for each gene. A second file, named ''file.sampleSummary.txt'', summarizes number of variants categorized by type for each sample.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcf2images ====

vcf2images is a program to generate an image for each gene to visualize effect of the annotated variants.

'''Usage''':

vcf2images <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates an image in PNG format for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where the images are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSubsetByGene ====

vcfSubsetByGene is a program to subset a [[#VCF|VCF]] file with annotated variants by gene.

'''Usage''':

vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] file for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where [[#VCF|VCF]] files are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfModifyHeader ====

vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a [[#VCF|VCF]] file. Specifically, it assigns each sample to a group or population (these assignments are used by other programs including [[#vcfSummary|vcfSummary]]).

'''Usage''':

vcfModifyHeader <oldHeader.vcf> <groups.txt>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] header file.
* ''Required arguments''
** oldHeader.vcf - The meta lines of a [[#VCF|VCF]] file. It can be obtained by using the following command:
grep '#' file.vcf > file.header.vcf
** groups.txt - This tab-delimited file that assigns each sample present in the [[#VCF|VCF]] to a group/population. Here is a small sample file:
HG00629 CHS
HG00634 CHS
HG00635 CHS
HG00637 PUR
HG00638 PUR
HG00640 PUR
NA06984 CEU
NA06985 CEU
NA06986 CEU
NA06989 CEU
NA06994 CEU
* ''Optional arguments''
** None

 

=== Auxiliary programs ===

<center>[[#top|Top]]</center>

==== gencode2interval ====

gencode2interval converts a GENCODE annotation file (in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format) to the [[#Interval|Interval]] format.

'''Usage''':

gencode2interval

* Inputs: Takes a GENCODE annotation file in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format from STDIN
* Outputs: Outputs the GENCODE annotation file in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Note: To obtain the coding sequences of the elements with gene_type ''protein_coding'' and transcript_type ''protein_coding'' the following command should be used:

awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
gencode2interval < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

 

<center>[[#top|Top]]</center>

==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set in [[#Interval|Interval]] format.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== External programs ===

<center>[[#top|Top]]</center>

==== bgzip/tabix ====

Tabix is generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval. This tool was developed by Heng Li. For more information consult the [http://samtools.sourceforge.net/tabix.shtml tabix documentation page].

 

<center>[[#top|Top]]</center>

==== VCF tools ====

[http://vcftools.sourceforge.net/ VCF tools] consists of a suite of very useful modules to manipulate [[#VCF|VCF]] files. For more information consult the [http://vcftools.sourceforge.net/docs.html documentation page].

 

== Example workflow ==

This workflow shows how the [http://info.gersteinlab.org/VAT/dataSets ''1000 Genomes Project, Phase I, chr22, SNP calls''] data set was processed.

 

<center>[[#top|Top]]</center>

=== Prerequisites ===

Download the GENCODE annotation set (version 3c, hg19):
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz

Download the human genome (hg19) in 2bit format. This is used by [[#interval2sequences|interval2sequences]] to extract the genomic sequences for the entries specified in the annotation set:
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

Download the SNP files in [[#VCF|VCF]] format and a third file that assigns each sample to a population:
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel

Extract variants on chromosome 22:
$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz

 

<center>[[#top|Top]]</center>

=== Preprocessing of the annotation file ===

Decompress the annotation file:
$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz

Extract the coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
$ awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf

Convert the GENCODE GTF file into [[#Interval|Interval]] format:
$ [[#gencode2interval|gencode2interval]] < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Retrieve the genomic sequences for the transcripts specified in the annotation file.
$ [[#interval2sequences|interval2sequences]] hg19.2bit gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval exonic > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa

 

<center>[[#top|Top]]</center>

=== Annotation of the SNPs ===

Annotate the variants using [[#snpMapper|snpMapper]]

$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | [[#snpMapper|snpMapper]] gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf

 

<center>[[#top|Top]]</center>

=== Modification the VCF header line ===

Modify the [[#VCF|VCF]] header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705).

First get the old meta-data lines:
$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf

Store the annotated variants in a separate file:
$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf

Create the new meta-data lines:
$ [[#vcfModifyHeader|vcfModifyHeader]] ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf

Merge the new meta-data lines with the annotated variants and create a new file called ''ALL.2of4intersection.20100804.chr22.vcf'':
$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf

Compress the newly created [[#VCF|VCF]] file with the annotated variants:
$ bgzip ALL.2of4intersection.20100804.chr22.vcf

Index the newly created [[#VCF|VCF]] file with the annotated variants:
$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz

 

<center>[[#top|Top]]</center>

=== Generation of summaries and images ===

Generate gene and sample summaries for the annotated variants
$ [[#vcfSummary|vcfSummary]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Resulting files: ''ALL.2of4intersection.20100804.chr22.geneSummary.txt'' and ''ALL.2of4intersection.20100804.chr22.sampleSummary.txt''

Make a new directory to store the images and [[#VCF|VCF]] files for each gene.
$ mkdir ALL.2of4intersection.20100804.chr22

Generate an image for each gene with at least one annotated variant.
$ [[#vcf2images|vcf2images]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Subset the [[#VCF|VCF]] file with the annotated variants by gene.
$ [[#vcfSubsetByGene|vcfSubsetByGene]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

 

<center>[[#top|Top]]</center>

=== Setting up the web server ===

Make a TAR ball of the relevant files:

* Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
* File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
* File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
* Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
* Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)

$ tar -cvf ALL.2of4intersection.20100804.chr22.tar \
ALL.2of4intersection.20100804.chr22 \
ALL.2of4intersection.20100804.chr22.geneSummary.txt \
ALL.2of4intersection.20100804.chr22.sampleSummary.txt \
ALL.2of4intersection.20100804.chr22.vcf.gz \
ALL.2of4intersection.20100804.chr22.vcf.gz.tbi

Copy the relevant files to the web server ('''WEB_DATA_DIR''' specified in the [http://info.gersteinlab.org/VAT/download#Installation_and_Configuration_of_VAT VAT configuration file])
$ scp ALL.2of4intersection.20100804.chr22.tar user@webserver:/path/to/WEB_DATA_DIR

Unpack the TAR ball on the web server
$ tar -xvf ALL.2of4intersection.20100804.chr22.tar

'''View the results''': [http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding Link to web server]

VAT/download

2011-06-15T15:33:30Z

Lukas.habegger: /* Setup of the web server */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz libbios-1.0.0.tar.gz] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz vat-1.0.0.tar.gz] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0_64bit.zip vat-1.0.0_64bit.zip] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Pre-processed annotation sets'''
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
** The construction of the vat_cgi URL requires following item/value pairs (for example: http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding):
*** mode=process
*** dataSet=SampleData
*** annotationSet=nameOfAnnotationSet
*** type=coding or nonCoding

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/download

2011-06-15T15:33:09Z

Lukas.habegger: /* Setup of the web server */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz libbios-1.0.0.tar.gz] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz vat-1.0.0.tar.gz] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0_64bit.zip vat-1.0.0_64bit.zip] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets'''
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
** The construction of the vat_cgi URL requires following item/value pairs (for example: http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding):
*** mode=process
*** dataSet=SampleData
*** annotationSet=nameOfAnnotationSet
*** type=coding or nonCoding

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/download

2011-06-15T15:32:27Z

Lukas.habegger: /* Setup of the web server */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz libbios-1.0.0.tar.gz] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz vat-1.0.0.tar.gz] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0_64bit.zip vat-1.0.0_64bit.zip] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets'''
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
** The construction of the vat_cgi URL requires following item/value pairs (for example: http://dynamic.gersteinlab.org/people/lh372/vat_cgi?'''mode=process'''&'''dataSet=ALL.2of4intersection.20100804.chr22'''&'''annotationSet=gencode3c'''&'''type=coding'''):
*** mode=process
*** dataSet=SampleData
*** annotationSet=nameOfAnnotationSet
*** type=coding or nonCoding

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/download

2011-06-15T15:30:54Z

Lukas.habegger: /* Setup of the web server */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz libbios-1.0.0.tar.gz] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz vat-1.0.0.tar.gz] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0_64bit.zip vat-1.0.0_64bit.zip] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets'''
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
** The construction of the vat_cgi URL requires following item/value pairs:
*** mode=process
*** dataSet=SampleData
*** annotationSet=nameOfAnnotationSet
*** type=coding or nonCoding
http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/download

2011-06-15T15:24:20Z

Lukas.habegger: /* Setup of the web server */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz libbios-1.0.0.tar.gz] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz vat-1.0.0.tar.gz] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0_64bit.zip vat-1.0.0_64bit.zip] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets'''
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/dataSets

2011-06-14T18:25:13Z

Lukas.habegger:

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== Data sets ==

<center>[[#top|Top]]</center>

=== 1000 Genomes Pilot Project: Low coverage samples ===

- Data files
- Source: pilot_data, release: 2010_07, FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/
- Indels
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/CEU.low_coverage.2010_07.indel.genotypes.vcf.gz CEU.low_coverage.2010_07.indel.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/JPTCHB.low_coverage.2010_07.indel.genotypes.vcf.gz JPTCHB.low_coverage.2010_07.indel.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/YRI.low_coverage.2010_07.indel.genotypes.vcf.gz YRI.low_coverage.2010_07.indel.genotypes.vcf.gz]
- SNPs
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/CEU.low_coverage.2010_07.genotypes.vcf.gz CEU.low_coverage.2010_07.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/CHBJPT.low_coverage.2010_07.genotypes.vcf.gz CHBJPT.low_coverage.2010_07.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/YRI.low_coverage.2010_07.genotypes.vcf.gz YRI.low_coverage.2010_07.genotypes.vcf.gz]
- Annotation file: [ftp://ftp.sanger.ac.uk/pub/gencode/release_3b/gencode.v3b.annotation.NCBI36.gtf.gz GENCODE (version 3b, hg18)] using CDS elements where ''gene_type = protein_coding'' and ''transcript_type = protein_coding''
- Results: [http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=1000genomes_lowCoverage&annotationSet=gencode3b&type=coding VAT]

 

<center>[[#top|Top]]</center>

=== 1000 Genomes Project, Phase I, chr22, SNP calls ===

- Data files
- Source: release: 20100804, FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/
- SNPs: [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ALL.2of4intersection.20100804.genotypes.vcf.gz]
- Annotation file: [ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz GENCODE (version 3c, hg19)] using CDS elements where ''gene_type = protein_coding'' and ''transcript_type = protein_coding''
- Results: [http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding VAT]
- [http://info.gersteinlab.org/VAT#Example_workflow Detailed workflow]

 

<center>[[#top|Top]]</center>

== Pre-processed GENCODE annotation sets ==

The pre-processed GENCODE annotation sets can be downloaded [http://info.gersteinlab.org/VAT/download#Download_of_pre-processed_annotation_sets here].

VAT

2011-06-14T18:23:54Z

Lukas.habegger: /* Setting up the web server */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== Data formats ==

<center>[[#top|Top]]</center>

=== VCF ===

The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the [http://www.1000genomes.org 1000 Genomes Project]. A detailed summary of this file format can be found [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 here]. The annotation information is captured as part of the '''INFO field''' using the '''VA (Variant Annotation) tag'''. The string with the variant information has the following format:

'''AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}'''

All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:

[[File:VariantFormat.png|1000px]]

The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by [[#snpMapper|snpMapper]]): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by [[#indelMapper|indelMapper]]), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).

'''Example 1''': A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.

chr1 23112837 . A T . PASS AA=A;AC=7;AN=118;DP=168;SF=2;'''VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*'''

'''Example 2''': A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.

chr1 1110357 . G A . PASS AA=G;AC=3;AN=118;DP=203;SF=2;'''VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H'''

'''Example 3''': A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.

chr9 35819390 rs2381409 C T . PASS AA=N;AC=157;AN=240;DP=49;SF=0,1;'''VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106'''

'''Example 4''': An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.

chr7 140118541 . TACAACAACA T,TACA . PASS HP=1;'''VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L'''

Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).

VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using [[#vcfModifyHeader|vcfModifyHeader]]. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.

 

<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

'''Note''': For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the [[#gencode2interval|gencode2interval]] program ensures proper formatting.

Example file:

ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008516|ENST00000325800|MMP25|MMP25-201 chr16 + 3097544 3105947 4 3097544,3100009,3100254,3105830 3097548,3100145,3100546,3105947
ENSG00000008516|ENST00000336577|MMP25|MMP25-202 chr16 + 3096918 3109096 10 3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827 3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096

In this example, each interval (line) represents a transcript, while the sub-intervals denote exons. The geneId is utilized to determine if multiple transcripts belong to the same gene model.

'''Note''': the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

== List of programs ==

=== VAT Core Modules ===

<center>[[#top|Top]]</center>

==== snpMapper ====

snpMapper is a program to annotate a set of SNPs in [[#VCF|VCF]] format. The program determines the effect of a SNP on the coding potential (synonymous, nonsynonymous, prematureStop, removedStop, spliceOverlap) of each transcript of a gene.

'''Usage''':

snpMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SNPs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== indelMapper ====

indelMapper is a program to annotate a set of indels in [[#VCF|VCF]] format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.

'''Usage''':

indelMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated indels in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== svMapper ====

svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.

'''Usage''':

svMapper <annotation.interval>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SVs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== genericMapper ====

genericMapper is a program to annotate a number of different variants in [[#VCF|VCF]] format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).

'''Usage''':

genericMapper <annotation.interval> <nameFeature>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs the annotated variants in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. This can be a generic [[#Interval|Interval]].
** nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSummary ====

vcfSummary is a program to aggregate annotated variants across genes and samples.

'''Usage''':

vcfSummary <file.vcf.gz> <annotation.interval>

* Inputs: None
* Outputs: Generates two output files. The first file, named ''file.geneSummary.txt'', contains the number of variants categorized by type for each gene. A second file, named ''file.sampleSummary.txt'', summarizes number of variants categorized by type for each sample.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcf2images ====

vcf2images is a program to generate an image for each gene to visualize effect of the annotated variants.

'''Usage''':

vcf2images <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates an image in PNG format for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where the images are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSubsetByGene ====

vcfSubsetByGene is a program to subset a [[#VCF|VCF]] file with annotated variants by gene.

'''Usage''':

vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] file for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where [[#VCF|VCF]] files are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfModifyHeader ====

vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a [[#VCF|VCF]] file. Specifically, it assigns each sample to a group or population (these assignments are used by other programs including [[#vcfSummary|vcfSummary]]).

'''Usage''':

vcfModifyHeader <oldHeader.vcf> <groups.txt>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] header file.
* ''Required arguments''
** oldHeader.vcf - The meta lines of a [[#VCF|VCF]] file. It can be obtained by using the following command:
grep '#' file.vcf > file.header.vcf
** groups.txt - This tab-delimited file that assigns each sample present in the [[#VCF|VCF]] to a group/population. Here is a small sample file:
HG00629 CHS
HG00634 CHS
HG00635 CHS
HG00637 PUR
HG00638 PUR
HG00640 PUR
NA06984 CEU
NA06985 CEU
NA06986 CEU
NA06989 CEU
NA06994 CEU
* ''Optional arguments''
** None

 

=== Auxiliary programs ===

<center>[[#top|Top]]</center>

==== gencode2interval ====

gencode2interval converts a GENCODE annotation file (in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format) to the [[#Interval|Interval]] format.

'''Usage''':

gencode2interval

* Inputs: Takes a GENCODE annotation file in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format from STDIN
* Outputs: Outputs the GENCODE annotation file in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Note: To obtain the coding sequences of the elements with gene_type ''protein_coding'' and transcript_type ''protein_coding'' the following command should be used:

awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
gencode2interval < encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

 

<center>[[#top|Top]]</center>

==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set in [[#Interval|Interval]] format.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== External programs ===

<center>[[#top|Top]]</center>

==== bgzip/tabix ====

Tabix is generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval. This tool was developed by Heng Li. For more information consult the [http://samtools.sourceforge.net/tabix.shtml tabix documentation page].

 

<center>[[#top|Top]]</center>

==== VCF tools ====

[http://vcftools.sourceforge.net/ VCF tools] consists of a suite of very useful modules to manipulate [[#VCF|VCF]] files. For more information consult the [http://vcftools.sourceforge.net/docs.html documentation page].

 

== Example workflow ==

This workflow shows how the [http://info.gersteinlab.org/VAT/dataSets ''1000 Genomes Project, Phase I, chr22, SNP calls''] data set was processed.

 

<center>[[#top|Top]]</center>

=== Prerequisites ===

Download the GENCODE annotation set (version 3c, hg19):
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz

Download the human genome (hg19) in 2bit format. This is used by [[#interval2sequences|interval2sequences]] to extract the genomic sequences for the entries specified in the annotation set:
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

Download the SNP files in [[#VCF|VCF]] format and a third file that assigns each sample to a population:
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel

Extract variants on chromosome 22:
$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz

 

<center>[[#top|Top]]</center>

=== Preprocessing of the annotation file ===

Decompress the annotation file:
$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz

Extract the coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
$ awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf

Convert the GENCODE GTF file into [[#Interval|Interval]] format:
$ [[#gencode2interval|gencode2interval]] < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Retrieve the genomic sequences for the transcripts specified in the annotation file.
$ [[#interval2sequences|interval2sequences]] hg19.2bit gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval exonic > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa

 

<center>[[#top|Top]]</center>

=== Annotation of the SNPs ===

Annotate the variants using [[#snpMapper|snpMapper]]

$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | [[#snpMapper|snpMapper]] gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf

 

<center>[[#top|Top]]</center>

=== Modification the VCF header line ===

Modify the [[#VCF|VCF]] header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705).

First get the old meta-data lines:
$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf

Store the annotated variants in a separate file:
$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf

Create the new meta-data lines:
$ [[#vcfModifyHeader|vcfModifyHeader]] ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf

Merge the new meta-data lines with the annotated variants and create a new file called ''ALL.2of4intersection.20100804.chr22.vcf'':
$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf

Compress the newly created [[#VCF|VCF]] file with the annotated variants:
$ bgzip ALL.2of4intersection.20100804.chr22.vcf

Index the newly created [[#VCF|VCF]] file with the annotated variants:
$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz

 

<center>[[#top|Top]]</center>

=== Generation of summaries and images ===

Generate gene and sample summaries for the annotated variants
$ [[#vcfSummary|vcfSummary]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Resulting files: ''ALL.2of4intersection.20100804.chr22.geneSummary.txt'' and ''ALL.2of4intersection.20100804.chr22.sampleSummary.txt''

Make a new directory to store the images and [[#VCF|VCF]] files for each gene.
$ mkdir ALL.2of4intersection.20100804.chr22

Generate an image for each gene with at least one annotated variant.
$ [[#vcf2images|vcf2images]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Subset the [[#VCF|VCF]] file with the annotated variants by gene.
$ [[#vcfSubsetByGene|vcfSubsetByGene]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

 

<center>[[#top|Top]]</center>

=== Setting up the web server ===

Make a TAR ball of the relevant files:

* Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
* File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
* File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
* Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
* Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)

$ tar -cvf ALL.2of4intersection.20100804.chr22.tar \
ALL.2of4intersection.20100804.chr22 \
ALL.2of4intersection.20100804.chr22.geneSummary.txt \
ALL.2of4intersection.20100804.chr22.sampleSummary.txt \
ALL.2of4intersection.20100804.chr22.vcf.gz \
ALL.2of4intersection.20100804.chr22.vcf.gz.tbi

Copy the relevant files to the web server ('''WEB_DATA_DIR''' specified in the [http://info.gersteinlab.org/VAT/download#Installation_and_Configuration_of_VAT VAT configuration file])
$ scp ALL.2of4intersection.20100804.chr22.tar user@webserver:/path/to/WEB_DATA_DIR

Unpack the TAR ball on the web server
$ tar -xvf ALL.2of4intersection.20100804.chr22.tar

'''View the results''': [http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding Link to web server]

VAT/download

2011-06-14T18:23:19Z

Lukas.habegger: /* Executables */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz libbios-1.0.0.tar.gz] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz vat-1.0.0.tar.gz] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0_64bit.zip vat-1.0.0_64bit.zip] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets''' (gencode3b.interval, gencode3b.fa, gencode3c.interval, gencode3c.fa)
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/download

2011-06-14T18:20:15Z

Lukas.habegger:

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz libbios-1.0.0.tar.gz] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz vat-1.0.0.tar.gz] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0_32bit.zip vat-1.0.0_32bit.zip] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0_64bit.zip vat-1.0.0_64bit.zip] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets''' (gencode3b.interval, gencode3b.fa, gencode3c.interval, gencode3c.fa)
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/download

2011-06-14T18:18:20Z

Lukas.habegger: /* Source code */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz libbios-1.0.0.tar.gz] - Initial upload (6/14/2011)
* [http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz vat-1.0.0.tar.gz] - Initial upload (6/14/2011)

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://archive.gersteinlab.org/proj/VAT/src VAT-0.5-UNIX.tar.gz] - 64bit version

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets''' (gencode3b.interval, gencode3b.fa, gencode3c.interval, gencode3c.fa)
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/download

2011-06-14T17:45:17Z

Lukas.habegger:

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://archive.gersteinlab.org/proj/VAT/src libBIOS-1.1.0.tar.gz]
* [http://archive.gersteinlab.org/proj/VAT/src VAT-0.5.tar.gz] - Initial upload

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://archive.gersteinlab.org/proj/VAT/src VAT-0.5-UNIX.tar.gz] - 64bit version

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets''' (gencode3b.interval, gencode3b.fa, gencode3c.interval, gencode3c.fa)
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT/download

2011-06-14T17:24:53Z

Lukas.habegger:

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== External Software ==

<center>[[#top|Top]]</center>

=== Required ===

* [http://www.gnu.org/software/gsl/ GSL] - GNU Scientific Library (version-1.14; required for libBIOS, which is a general C library).
* [http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip BlatSuite] - BLAT and a collection of utility programs. These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://www.libgd.org/Main_Page GD library] - The GD library is used to create an image for each gene model and its associated variants (version-2.0.35; required by VAT).
* [http://samtools.sourceforge.net/tabix.shtml Tabix] - Tabix (version-0.2.3) is a generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval ([http://sourceforge.net/projects/samtools/files/tabix/ download]). These tools are utilized by VAT. Note: these executables must be part of the PATH.
* [http://rna.urmc.rochester.edu/RNAstructure.html RNAstructure] - RNAstructure is a software package for RNA structure prediction and analysis. This tool is utilized by VAT for prediction of structures for RNA sequences with and without the variants.
* [http://varna.lri.fr/downloads.html VARNA] - VARNA is a java applet for producing high quality RNA secondary structure plots. VAT utilizes VARNA for visualization of the RNA secondary structures.
 

<center>[[#top|Top]]</center>

=== Optional ===

* [http://vcftools.sourceforge.net/index.html VCF tools] - VCF tools consists of a suite of useful modules to manipulate VCF files.

 

== VAT Download ==

 

<pre>
Important Note
==============

THIS PACKAGE (VAT) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>

 

<center>[[#top|Top]]</center>

=== Source code ===

VAT is a based on a general C library, called libBIOS. A TAR ball of libBIOS and VAT can be downloaded here:
* [http://archive.gersteinlab.org/proj/VAT/src libBIOS-1.1.0.tar.gz]
* [http://archive.gersteinlab.org/proj/VAT/src VAT-0.5.tar.gz] - Initial upload

 

<center>[[#top|Top]]</center>

=== Executables ===

Statically built binaries for UNIX can be found here:
* [http://archive.gersteinlab.org/proj/VAT/src VAT-0.5-UNIX.tar.gz] - 64bit version

 

<center>[[#top|Top]]</center>

=== License information ===

The software package is released under the [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommerical)]. 
For more details please refer to the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.

 

== Installation ==

<center>[[#top|Top]]</center>

=== Installation of the external GSL and GD libraries ===

In order to install VAT two external libraries must be installed first. The libBIOS library depends on GSL, whereas VAT makes use of the GD library. Please follow the instructions provided by each package. The GSL library can be installed on most systems using the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gsl-1.14/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

Similarly, the [http://www.libgd.org/Main_Page GD library] can be installed on most systems with the following commands:
<pre>
$ cd /path/to/gd-2.0.35/
$ ./configure --prefix=`pwd` --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

After they are installed, the first step to install VAT is the installation and configuration of libBIOS.

 

<center>[[#top|Top]]</center>

=== Installation and Configuration of libBIOS ===

Depending on where the three libraries (GSL, libBIOS, and GD) are installed, the following variables need to be set:

<pre>
export CPPFLAGS="-I/path/to/gsl-1.14/include -I/path/to/libbios/include -I/path/to/gd-2.0.35/include"
export LDFLAGS="-L/path/to/gsl-1.14/lib -L/path/to/libbios/lib -L/path/to/gd-2.0.35/lib"
</pre>

libBIOS can be installed on most systems with the following commands:
<pre>
$ cd /path/to/libbios-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

 

<center>[[#top|Top]]</center>

=== Installation of RNAstructure and VARNA ===

Download RNAstructure and follow the building instructions. Make sure that the build directory is included in PATH environment variable. In addition, RNAstructure needs an environment variable named DATAPATH to be set to the directory of thermodynamic parameter files that are distributed with RNAstructure package.

Download VARNA jar file and add the jar file path to CLASSPATH environment variable.

<center>[[#top|Top]]</center>

=== Installation and Configuration of VAT ===

A few simple steps are required to install VAT:
<pre>
$ cd /path/to/vat-x.x.x/
$ ./configure --prefix=`pwd`
$ make
$ make install
</pre>

VAT contains a configuration file ('''vatConfirgurationTemplate.txt'''), which contains a set of variables that are used by a number of different programs. The name/value pairs are space or tab-delimited. Empty lines are lines starting with '//' are ignored.
<pre>

// ===============================================================================
// REQUIRED
// ===============================================================================

// Tabix directory (includes both tabix and bgzip)
TABIX_DIR /path/to/tabix-0.2.3

// ===============================================================================
// OPTIONAL (required only for CGIs)
// ===============================================================================

// CGI base URL (where the CGIs are located)
WEB_URL_CGI http://webserver.org/path

// Path to the web data directory where the preprocessed files are stored
WEB_DATA_DIR /path/to/public_html/path/to/VAT
// URL to preprocessed files
WEB_DATA_URL http://webserver.org/path/to/VAT

</pre>

This file has to be '''configured properly''' by filling in the required information. Subsequently, the following environment variable ('''VAT_CONFIG_FILE''') has to be set:

VAT_CONFIG_FILE=/pathTo/vat/vatConfirgurationTemplate.txt

 

== Setup of the web server ==

 

This step is optional, but useful for visualizing the results of processed data sets. The following steps are required:

* The executable '''vat_cgi''' has to be located in the cgi-bin directory on the web server
* The configuration file ('''vatConfirgurationTemplate.txt''') must contain the pertinent information
* The following .htaccess file should be added to the cgi-bin:
SetEnv VAT_CONFIG_FILE /path/to/vatConfirgurationTemplate.txt
* The web data directory (defined by WEB_DATA_DIR in the configuration file) requires the following information:
** '''Preprocessed annotation sets''' (gencode3b.interval, gencode3b.fa, gencode3c.interval, gencode3c.fa)
** The '''tabix''' and '''bgzip''' executables
** Two images provided by the VAT source code: '''check.png''' and '''processing.gif''' (referred to by vat_cgi)
** Directory that has the same name as the data set (in this example: SampleData). This directory contains the images for each gene (created by [http://info.gersteinlab.org/VAT#vcf2images vcf2images]) and a VCF file for each gene (created by [http://info.gersteinlab.org/VAT#vcfSubsetByGene vcfSubsetByGene])
*** SampleData
*** SampleData.vcf.gz
*** SampleData.vcf.gz.tbi
*** SampleData.sampleSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])
*** SampleData.geneSummary.txt (generated by [http://info.gersteinlab.org/VAT#vcfSummary vcfSummary])

For additional information please refer to the [http://info.gersteinlab.org/VAT#Setting_up_the_web_server example workflow].

 

== Download of pre-processed annotation sets ==

 

The following annotation sets are derived from the [http://www.gencodegenes.org/ GENCODE] project. Each each entry has a set of '''transcript coordinates''' (in [http://info.gersteinlab.org/VAT#Interval Interval] format) and a set of '''transcript sequences''' (introns removed; sequence with respect to the '+' strand; in FASTA format)

* Coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.fa Transcript sequences]

 

* miRNAs where ''gene_type'' is ''miRNA'':
** GENCODE version 3b (hg18): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3b.miRNA.fa Transcript sequences]
** GENCODE version 3c (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode3c.miRNA.fa Transcript sequences]
** GENCODE version 4 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode4.miRNA.fa Transcript sequences]
** GENCODE version 5 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode5.miRNA.fa Transcript sequences]
** GENCODE version 6 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode6.miRNA.fa Transcript sequences]
** GENCODE version 7 (hg19): [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.interval Transcript coordinates], [http://homes.gersteinlab.org/people/lh372/VAT/gencode7.miRNA.fa Transcript sequences]

VAT

2011-06-13T14:35:46Z

Lukas.habegger: /* Setting up the web server */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== Data formats ==

<center>[[#top|Top]]</center>

=== VCF ===

The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the [http://www.1000genomes.org 1000 Genomes Project]. A detailed summary of this file format can be found [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 here]. The annotation information is captured as part of the '''INFO field''' using the '''VA (Variant Annotation) tag'''. The string with the variant information has the following format:

'''AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}'''

All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:

[[File:VariantFormat.png|1000px]]

The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by [[#snpMapper|snpMapper]]): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by [[#indelMapper|indelMapper]]), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).

'''Example 1''': A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.

chr1 23112837 . A T . PASS AA=A;AC=7;AN=118;DP=168;SF=2;'''VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*'''

'''Example 2''': A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.

chr1 1110357 . G A . PASS AA=G;AC=3;AN=118;DP=203;SF=2;'''VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H'''

'''Example 3''': A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.

chr9 35819390 rs2381409 C T . PASS AA=N;AC=157;AN=240;DP=49;SF=0,1;'''VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106'''

'''Example 4''': An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.

chr7 140118541 . TACAACAACA T,TACA . PASS HP=1;'''VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L'''

Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).

VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using [[#vcfModifyHeader|vcfModifyHeader]]. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.

 

<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

'''Note''': For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the [[#gencode2interval|gencode2interval]] program ensures proper formatting.

Example file:

ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008516|ENST00000325800|MMP25|MMP25-201 chr16 + 3097544 3105947 4 3097544,3100009,3100254,3105830 3097548,3100145,3100546,3105947
ENSG00000008516|ENST00000336577|MMP25|MMP25-202 chr16 + 3096918 3109096 10 3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827 3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096

In this example, each interval (line) represents a transcript, while the sub-intervals denote exons. The geneId is utilized to determine if multiple transcripts belong to the same gene model.

'''Note''': the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

== List of programs ==

=== VAT Core Modules ===

<center>[[#top|Top]]</center>

==== snpMapper ====

snpMapper is a program to annotate a set of SNPs in [[#VCF|VCF]] format. The program determines the effect of a SNP on the coding potential (synonymous, nonsynonymous, prematureStop, removedStop, spliceOverlap) of each transcript of a gene.

'''Usage''':

snpMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SNPs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== indelMapper ====

indelMapper is a program to annotate a set of indels in [[#VCF|VCF]] format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.

'''Usage''':

indelMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated indels in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== svMapper ====

svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.

'''Usage''':

svMapper <annotation.interval>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SVs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== genericMapper ====

genericMapper is a program to annotate a number of different variants in [[#VCF|VCF]] format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).

'''Usage''':

genericMapper <annotation.interval> <nameFeature>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs the annotated variants in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. This can be a generic [[#Interval|Interval]].
** nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSummary ====

vcfSummary is a program to aggregate annotated variants across genes and samples.

'''Usage''':

vcfSummary <file.vcf.gz> <annotation.interval>

* Inputs: None
* Outputs: Generates two output files. The first file, named ''file.geneSummary.txt'', contains the number of variants categorized by type for each gene. A second file, named ''file.sampleSummary.txt'', summarizes number of variants categorized by type for each sample.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcf2images ====

vcf2images is a program to generate an image for each gene to visualize effect of the annotated variants.

'''Usage''':

vcf2images <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates an image in PNG format for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where the images are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSubsetByGene ====

vcfSubsetByGene is a program to subset a [[#VCF|VCF]] file with annotated variants by gene.

'''Usage''':

vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] file for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where [[#VCF|VCF]] files are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfModifyHeader ====

vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a [[#VCF|VCF]] file. Specifically, it assigns each sample to a group or population (these assignments are used by other programs including [[#vcfSummary|vcfSummary]]).

'''Usage''':

vcfModifyHeader <oldHeader.vcf> <groups.txt>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] header file.
* ''Required arguments''
** oldHeader.vcf - The meta lines of a [[#VCF|VCF]] file. It can be obtained by using the following command:
grep '#' file.vcf > file.header.vcf
** groups.txt - This tab-delimited file that assigns each sample present in the [[#VCF|VCF]] to a group/population. Here is a small sample file:
HG00629 CHS
HG00634 CHS
HG00635 CHS
HG00637 PUR
HG00638 PUR
HG00640 PUR
NA06984 CEU
NA06985 CEU
NA06986 CEU
NA06989 CEU
NA06994 CEU
* ''Optional arguments''
** None

 

=== Auxiliary programs ===

<center>[[#top|Top]]</center>

==== gencode2interval ====

gencode2interval converts a GENCODE annotation file (in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format) to the [[#Interval|Interval]] format.

'''Usage''':

gencode2interval

* Inputs: Takes a GENCODE annotation file in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format from STDIN
* Outputs: Outputs the GENCODE annotation file in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Note: To obtain the coding sequences of the elements with gene_type ''protein_coding'' and transcript_type ''protein_coding'' the following command should be used:

awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
gencode2interval < encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

 

<center>[[#top|Top]]</center>

==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set in [[#Interval|Interval]] format.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== External programs ===

<center>[[#top|Top]]</center>

==== bgzip/tabix ====

Tabix is generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval. This tool was developed by Heng Li. For more information consult the [http://samtools.sourceforge.net/tabix.shtml tabix documentation page].

 

<center>[[#top|Top]]</center>

==== VCF tools ====

[http://vcftools.sourceforge.net/ VCF tools] consists of a suite of very useful modules to manipulate [[#VCF|VCF]] files. For more information consult the [http://vcftools.sourceforge.net/docs.html documentation page].

 

== Example workflow ==

This workflow shows how the [http://info.gersteinlab.org/VAT/dataSets ''1000 Genomes Project, Phase I, chr22, SNP calls''] data set was processed.

 

<center>[[#top|Top]]</center>

=== Prerequisites ===

Download the GENCODE annotation set (version 3c, hg19):
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz

Download the human genome (hg19) in 2bit format. This is used by [[#interval2sequences|interval2sequences]] to extract the genomic sequences for the entries specified in the annotation set:
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

Download the SNP files in [[#VCF|VCF]] format and a third file that assigns each sample to a population:
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel

Extract variants on chromosome 22:
$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz

 

<center>[[#top|Top]]</center>

=== Preprocessing of the annotation file ===

Decompress the annotation file:
$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz

Extract the coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
$ awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf

Convert the GENCODE GTF file into [[#Interval|Interval]] format:
$ [[#gencode2interval|gencode2interval]] < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Retrieve the genomic sequences for the transcripts specified in the annotation file.
$ [[#interval2sequences|interval2sequences]] hg19.2bit gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval exonic > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa

 

<center>[[#top|Top]]</center>

=== Annotation of the SNPs ===

Annotate the variants using [[#snpMapper|snpMapper]]

$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | [[#snpMapper|snpMapper]] gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf

 

<center>[[#top|Top]]</center>

=== Modification the VCF header line ===

Modify the [[#VCF|VCF]] header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705).

First get the old meta-data lines:
$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf

Store the annotated variants in a separate file:
$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf

Create the new meta-data lines:
$ [[#vcfModifyHeader|vcfModifyHeader]] ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf

Merge the new meta-data lines with the annotated variants and create a new file called ''ALL.2of4intersection.20100804.chr22.vcf'':
$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf

Compress the newly created [[#VCF|VCF]] file with the annotated variants:
$ bgzip ALL.2of4intersection.20100804.chr22.vcf

Index the newly created [[#VCF|VCF]] file with the annotated variants:
$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz

 

<center>[[#top|Top]]</center>

=== Generation of summaries and images ===

Generate gene and sample summaries for the annotated variants
$ [[#vcfSummary|vcfSummary]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Resulting files: ''ALL.2of4intersection.20100804.chr22.geneSummary.txt'' and ''ALL.2of4intersection.20100804.chr22.sampleSummary.txt''

Make a new directory to store the images and [[#VCF|VCF]] files for each gene.
$ mkdir ALL.2of4intersection.20100804.chr22

Generate an image for each gene with at least one annotated variant.
$ [[#vcf2images|vcf2images]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Subset the [[#VCF|VCF]] file with the annotated variants by gene.
$ [[#vcfSubsetByGene|vcfSubsetByGene]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

 

<center>[[#top|Top]]</center>

=== Setting up the web server ===

Make a TAR ball of the relevant files:

* Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
* File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
* File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
* Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
* Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)

$ tar -cvf ALL.2of4intersection.20100804.chr22.tar \
ALL.2of4intersection.20100804.chr22 \
ALL.2of4intersection.20100804.chr22.geneSummary.txt \
ALL.2of4intersection.20100804.chr22.sampleSummary.txt \
ALL.2of4intersection.20100804.chr22.vcf.gz \
ALL.2of4intersection.20100804.chr22.vcf.gz.tbi

Copy the relevant files to the web server ('''WEB_DATA_DIR''' specified in the [http://info.gersteinlab.org/VAT/download#Installation_and_Configuration_of_VAT VAT configuration file])
$ scp ALL.2of4intersection.20100804.chr22.tar user@webserver:/path/to/WEB_DATA_DIR

Unpack the TAR ball on the web server
$ tar -xvf ALL.2of4intersection.20100804.chr22.tar

'''View the results''': [http://dynamic.gersteinlab.org/people/lh372/dev/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding Link to web server]

VAT

2011-06-07T14:24:35Z

Lukas.habegger:

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== Data formats ==

<center>[[#top|Top]]</center>

=== VCF ===

The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the [http://www.1000genomes.org 1000 Genomes Project]. A detailed summary of this file format can be found [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 here]. The annotation information is captured as part of the '''INFO field''' using the '''VA (Variant Annotation) tag'''. The string with the variant information has the following format:

'''AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}'''

All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:

[[File:VariantFormat.png|1000px]]

The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by [[#snpMapper|snpMapper]]): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by [[#indelMapper|indelMapper]]), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).

'''Example 1''': A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.

chr1 23112837 . A T . PASS AA=A;AC=7;AN=118;DP=168;SF=2;'''VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*'''

'''Example 2''': A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.

chr1 1110357 . G A . PASS AA=G;AC=3;AN=118;DP=203;SF=2;'''VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H'''

'''Example 3''': A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.

chr9 35819390 rs2381409 C T . PASS AA=N;AC=157;AN=240;DP=49;SF=0,1;'''VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106'''

'''Example 4''': An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.

chr7 140118541 . TACAACAACA T,TACA . PASS HP=1;'''VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L'''

Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).

VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using [[#vcfModifyHeader|vcfModifyHeader]]. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.

 

<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

'''Note''': For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the [[#gencode2interval|gencode2interval]] program ensures proper formatting.

Example file:

ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008516|ENST00000325800|MMP25|MMP25-201 chr16 + 3097544 3105947 4 3097544,3100009,3100254,3105830 3097548,3100145,3100546,3105947
ENSG00000008516|ENST00000336577|MMP25|MMP25-202 chr16 + 3096918 3109096 10 3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827 3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096

In this example, each interval (line) represents a transcript, while the sub-intervals denote exons. The geneId is utilized to determine if multiple transcripts belong to the same gene model.

'''Note''': the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

== List of programs ==

=== VAT Core Modules ===

<center>[[#top|Top]]</center>

==== snpMapper ====

snpMapper is a program to annotate a set of SNPs in [[#VCF|VCF]] format. The program determines the effect of a SNP on the coding potential (synonymous, nonsynonymous, prematureStop, removedStop, spliceOverlap) of each transcript of a gene.

'''Usage''':

snpMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SNPs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== indelMapper ====

indelMapper is a program to annotate a set of indels in [[#VCF|VCF]] format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.

'''Usage''':

indelMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated indels in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== svMapper ====

svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.

'''Usage''':

svMapper <annotation.interval>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SVs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== genericMapper ====

genericMapper is a program to annotate a number of different variants in [[#VCF|VCF]] format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).

'''Usage''':

genericMapper <annotation.interval> <nameFeature>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs the annotated variants in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. This can be a generic [[#Interval|Interval]].
** nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSummary ====

vcfSummary is a program to aggregate annotated variants across genes and samples.

'''Usage''':

vcfSummary <file.vcf.gz> <annotation.interval>

* Inputs: None
* Outputs: Generates two output files. The first file, named ''file.geneSummary.txt'', contains the number of variants categorized by type for each gene. A second file, named ''file.sampleSummary.txt'', summarizes number of variants categorized by type for each sample.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcf2images ====

vcf2images is a program to generate an image for each gene to visualize effect of the annotated variants.

'''Usage''':

vcf2images <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates an image in PNG format for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where the images are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSubsetByGene ====

vcfSubsetByGene is a program to subset a [[#VCF|VCF]] file with annotated variants by gene.

'''Usage''':

vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] file for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where [[#VCF|VCF]] files are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfModifyHeader ====

vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a [[#VCF|VCF]] file. Specifically, it assigns each sample to a group or population (these assignments are used by other programs including [[#vcfSummary|vcfSummary]]).

'''Usage''':

vcfModifyHeader <oldHeader.vcf> <groups.txt>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] header file.
* ''Required arguments''
** oldHeader.vcf - The meta lines of a [[#VCF|VCF]] file. It can be obtained by using the following command:
grep '#' file.vcf > file.header.vcf
** groups.txt - This tab-delimited file that assigns each sample present in the [[#VCF|VCF]] to a group/population. Here is a small sample file:
HG00629 CHS
HG00634 CHS
HG00635 CHS
HG00637 PUR
HG00638 PUR
HG00640 PUR
NA06984 CEU
NA06985 CEU
NA06986 CEU
NA06989 CEU
NA06994 CEU
* ''Optional arguments''
** None

 

=== Auxiliary programs ===

<center>[[#top|Top]]</center>

==== gencode2interval ====

gencode2interval converts a GENCODE annotation file (in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format) to the [[#Interval|Interval]] format.

'''Usage''':

gencode2interval

* Inputs: Takes a GENCODE annotation file in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format from STDIN
* Outputs: Outputs the GENCODE annotation file in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Note: To obtain the coding sequences of the elements with gene_type ''protein_coding'' and transcript_type ''protein_coding'' the following command should be used:

awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
gencode2interval < encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

 

<center>[[#top|Top]]</center>

==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set in [[#Interval|Interval]] format.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== External programs ===

<center>[[#top|Top]]</center>

==== bgzip/tabix ====

Tabix is generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval. This tool was developed by Heng Li. For more information consult the [http://samtools.sourceforge.net/tabix.shtml tabix documentation page].

 

<center>[[#top|Top]]</center>

==== VCF tools ====

[http://vcftools.sourceforge.net/ VCF tools] consists of a suite of very useful modules to manipulate [[#VCF|VCF]] files. For more information consult the [http://vcftools.sourceforge.net/docs.html documentation page].

 

== Example workflow ==

This workflow shows how the [http://info.gersteinlab.org/VAT/dataSets ''1000 Genomes Project, Phase I, chr22, SNP calls''] data set was processed.

 

<center>[[#top|Top]]</center>

=== Prerequisites ===

Download the GENCODE annotation set (version 3c, hg19):
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz

Download the human genome (hg19) in 2bit format. This is used by [[#interval2sequences|interval2sequences]] to extract the genomic sequences for the entries specified in the annotation set:
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

Download the SNP files in [[#VCF|VCF]] format and a third file that assigns each sample to a population:
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel

Extract variants on chromosome 22:
$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz

 

<center>[[#top|Top]]</center>

=== Preprocessing of the annotation file ===

Decompress the annotation file:
$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz

Extract the coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
$ awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf

Convert the GENCODE GTF file into [[#Interval|Interval]] format:
$ [[#gencode2interval|gencode2interval]] < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Retrieve the genomic sequences for the transcripts specified in the annotation file.
$ [[#interval2sequences|interval2sequences]] hg19.2bit gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval exonic > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa

 

<center>[[#top|Top]]</center>

=== Annotation of the SNPs ===

Annotate the variants using [[#snpMapper|snpMapper]]

$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | [[#snpMapper|snpMapper]] gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf

 

<center>[[#top|Top]]</center>

=== Modification the VCF header line ===

Modify the [[#VCF|VCF]] header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705).

First get the old meta-data lines:
$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf

Store the annotated variants in a separate file:
$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf

Create the new meta-data lines:
$ [[#vcfModifyHeader|vcfModifyHeader]] ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf

Merge the new meta-data lines with the annotated variants and create a new file called ''ALL.2of4intersection.20100804.chr22.vcf'':
$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf

Compress the newly created [[#VCF|VCF]] file with the annotated variants:
$ bgzip ALL.2of4intersection.20100804.chr22.vcf

Index the newly created [[#VCF|VCF]] file with the annotated variants:
$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz

 

<center>[[#top|Top]]</center>

=== Generation of summaries and images ===

Generate gene and sample summaries for the annotated variants
$ [[#vcfSummary|vcfSummary]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Resulting files: ''ALL.2of4intersection.20100804.chr22.geneSummary.txt'' and ''ALL.2of4intersection.20100804.chr22.sampleSummary.txt''

Make a new directory to store the images and [[#VCF|VCF]] files for each gene.
$ mkdir ALL.2of4intersection.20100804.chr22

Generate an image for each gene with at least one annotated variant.
$ [[#vcf2images|vcf2images]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Subset the [[#VCF|VCF]] file with the annotated variants by gene.
$ [[#vcfSubsetByGene|vcfSubsetByGene]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

 

<center>[[#top|Top]]</center>

=== Setting up the web server ===

Make a TAR ball of the relevant files:

* Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
* File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
* File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
* Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
* Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)

$ tar -cvf ALL.2of4intersection.20100804.chr22.tar \
ALL.2of4intersection.20100804.chr22 \
ALL.2of4intersection.20100804.chr22.geneSummary.txt \
ALL.2of4intersection.20100804.chr22.sampleSummary.txt \
ALL.2of4intersection.20100804.chr22.vcf.gz \
ALL.2of4intersection.20100804.chr22.vcf.gz.tbi

Copy the relevant files to the web server ('''WEB_DATA_DIR''' specified in the [http://info.gersteinlab.org/VAT/download#Installation_and_Configuration_of_VAT VAT configuration file])
$ scp ALL.2of4intersection.20100804.chr22.tar user@webserver:/path/to/WEB_DATA_DIR

Unpack the TAR ball on the web server
$ tar -xvf ALL.2of4intersection.20100804.chr22.tar

'''View the results''': [http://dynamic.gersteinlab.org/people/lh372/dev/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c Link to web server]

VAT

2011-06-06T16:08:21Z

Lukas.habegger: /* Setting up the web server */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== Data formats ==

<center>[[#top|Top]]</center>

=== VCF ===

The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (Indels), and structural variants (SVs). This format was developed as part of the [http://www.1000genomes.org 1000 Genomes Project]. A detailed summary of this file format can be found [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 here]. The annotation information is captured as part of the '''INFO field''' using the '''VA (Variant Annotation) tag'''. The string with the variant information has the following format:

'''AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}'''

All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, Indel, or SV) and the variant type as shown in the table below:

[[File:VariantFormat.png|1000px]]

The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by [[#snpMapper|snpMapper]]): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For Indels (generated by [[#indelMapper|indelMapper]]), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and Indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).

'''Example 1''': A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.

chr1 23112837 . A T . PASS AA=A;AC=7;AN=118;DP=168;SF=2;'''VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*'''

'''Example 2''': A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.

chr1 1110357 . G A . PASS AA=G;AC=3;AN=118;DP=203;SF=2;'''VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H'''

'''Example 3''': A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.

chr9 35819390 rs2381409 C T . PASS AA=N;AC=157;AN=240;DP=49;SF=0,1;'''VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106'''

'''Example 4''': An Indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.

chr7 140118541 . TACAACAACA T,TACA . PASS HP=1;'''VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L'''

Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).

VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using [[#vcfModifyHeader|vcfModifyHeader]]. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.

 

<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

'''Note''': For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the [[#gencode2interval|gencode2interval]] program ensures proper formatting.

Example file:

ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267
ENSG00000008516|ENST00000325800|MMP25|MMP25-201 chr16 + 3097544 3105947 4 3097544,3100009,3100254,3105830 3097548,3100145,3100546,3105947
ENSG00000008516|ENST00000336577|MMP25|MMP25-202 chr16 + 3096918 3109096 10 3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827 3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096

In this example, each interval (line) represents a transcript, while the sub-intervals denote exons. The geneId is utilized to determine if multiple transcripts belong to the same gene model.

'''Note''': the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

== List of programs ==

=== VAT Core Modules ===

<center>[[#top|Top]]</center>

==== snpMapper ====

snpMapper is a program to annotate a set of SNPs in [[#VCF|VCF]] format. The program determines the effect of a SNP on the coding potential (synonymous, nonsynonymous, prematureStop, removedStop, spliceOverlap) of each transcript of a gene.

'''Usage''':

snpMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SNPs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== indelMapper ====

indelMapper is a program to annotate a set of Indels in [[#VCF|VCF]] format. The program determines the effect of an Indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.

'''Usage''':

indelMapper <annotation.interval> <annotation.fa>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated Indels in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== svMapper ====

svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.

'''Usage''':

svMapper <annotation.interval>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs annotated SVs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== genericMapper ====

genericMapper is a program to annotate a number of different variants in [[#VCF|VCF]] format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).

'''Usage''':

genericMapper <annotation.interval> <nameFeature>

* Inputs: Takes a [[#VCF|VCF]] input from STDIN
* Outputs: Outputs the annotated variants in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field.
* ''Required arguments''
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. This can be a generic [[#Interval|Interval]].
** nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSummary ====

vcfSummary is a program to aggregate annotated variants across genes and samples.

'''Usage''':

vcfSummary <file.vcf.gz> <annotation.interval>

* Inputs: None
* Outputs: Generates two output files. The first file, named ''file.geneSummary.txt'', contains the number of variants categorized by type for each gene. A second file, named ''file.sampleSummary.txt'', summarizes number of variants categorized by type for each sample.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of Indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcf2images ====

vcf2images is a program to generate an image for each gene to visualize effect of the annotated variants.

'''Usage''':

vcf2images <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates an image in PNG format for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, Indels, and SVs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where the images are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfSubsetByGene ====

vcfSubsetByGene is a program to subset a [[#VCF|VCF]] file with annotated variants by gene.

'''Usage''':

vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] file for each gene that has at least one annotated variant.
* ''Required arguments''
** file.vcf.gz - VCF file with annotated variants (this can be a mixture of Indels and SNPs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program.
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program.
** outputDir - The output directory where [[#VCF|VCF]] files are stored
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== vcfModifyHeader ====

vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a [[#VCF|VCF]] file. Specifically, it assigns each sample to a group or population (these assignments are used by other programs including [[#vcfSummary|vcfSummary]]).

'''Usage''':

vcfModifyHeader <oldHeader.vcf> <groups.txt>

* Inputs: None
* Outputs: Generates a [[#VCF|VCF]] header file.
* ''Required arguments''
** oldHeader.vcf - The meta lines of a [[#VCF|VCF]] file. It can be obtained by using the following command:
grep '#' file.vcf > file.header.vcf
** groups.txt - This tab-delimited file that assigns each sample present in the [[#VCF|VCF]] to a group/population. Here is a small sample file:
HG00629 CHS
HG00634 CHS
HG00635 CHS
HG00637 PUR
HG00638 PUR
HG00640 PUR
NA06984 CEU
NA06985 CEU
NA06986 CEU
NA06989 CEU
NA06994 CEU
* ''Optional arguments''
** None

 

=== Auxiliary programs ===

<center>[[#top|Top]]</center>

==== gencode2interval ====

gencode2interval converts a GENCODE annotation file (in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format) to the [[#Interval|Interval]] format.

'''Usage''':

gencode2interval

* Inputs: Takes a GENCODE annotation file in [http://genome.ucsc.edu/FAQ/FAQformat.html#format4 GTF] format from STDIN
* Outputs: Outputs the GENCODE annotation file in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Note: To obtain the coding sequences of the elements with gene_type ''protein_coding'' and transcript_type ''protein_coding'' the following command should be used:

awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
gencode2interval < encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > encode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

 

<center>[[#top|Top]]</center>

==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set in [[#Interval|Interval]] format.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== External programs ===

<center>[[#top|Top]]</center>

==== bgzip/tabix ====

Tabix is generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval. This tool was developed by Heng Li. For more information consult the [http://samtools.sourceforge.net/tabix.shtml tabix documentation page].

 

<center>[[#top|Top]]</center>

==== VCF tools ====

[http://vcftools.sourceforge.net/ VCF tools] consists of a suite of very useful modules to manipulate [[#VCF|VCF]] files. For more information consult the [http://vcftools.sourceforge.net/docs.html documentation page].

 

== Example workflow ==

This workflow shows how the [http://info.gersteinlab.org/VAT/dataSets ''1000 Genomes Project, Phase I, chr22, SNP calls''] data set was processed.

 

<center>[[#top|Top]]</center>

=== Prerequisites ===

Download the GENCODE annotation set (version 3c, hg19):
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz

Download the human genome (hg19) in 2bit format. This is used by [[#interval2sequences|interval2sequences]] to extract the genomic sequences for the entries specified in the annotation set:
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

Download the SNP files in [[#VCF|VCF]] format and a third file that assigns each sample to a population:
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel

Extract variants on chromosome 22:
$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz

 

<center>[[#top|Top]]</center>

=== Preprocessing of the annotation file ===

Decompress the annotation file:
$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz

Extract the coding sequence (CDS) elements where the both the ''gene_type'' and ''transcript_type'' are ''protein_coding'':
$ awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf

Convert the GENCODE GTF file into [[#Interval|Interval]] format:
$ [[#gencode2interval|gencode2interval]] < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Retrieve the genomic sequences for the transcripts specified in the annotation file.
$ [[#interval2sequences|interval2sequences]] hg19.2bit gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval exonic > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa

 

<center>[[#top|Top]]</center>

=== Annotation of the SNPs ===

Annotate the variants using [[#snpMapper|snpMapper]]

$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | [[#snpMapper|snpMapper]] gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf

 

<center>[[#top|Top]]</center>

=== Modification the VCF header line ===

Modify the [[#VCF|VCF]] header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705).

First get the old meta-data lines:
$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf

Store the annotated variants in a separate file:
$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf

Create the new meta-data lines:
$ [[#vcfModifyHeader|vcfModifyHeader]] ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf

Merge the new meta-data lines with the annotated variants and create a new file called ''ALL.2of4intersection.20100804.chr22.vcf'':
$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf

Compress the newly created [[#VCF|VCF]] file with the annotated variants:
$ bgzip ALL.2of4intersection.20100804.chr22.vcf

Index the newly created [[#VCF|VCF]] file with the annotated variants:
$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz

 

<center>[[#top|Top]]</center>

=== Generation of summaries and images ===

Generate gene and sample summaries for the annotated variants
$ [[#vcfSummary|vcfSummary]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval

Resulting files: ''ALL.2of4intersection.20100804.chr22.geneSummary.txt'' and ''ALL.2of4intersection.20100804.chr22.sampleSummary.txt''

Make a new directory to store the images and [[#VCF|VCF]] files for each gene.
$ mkdir ALL.2of4intersection.20100804.chr22

Generate an image for each gene with at least one annotated variant.
$ [[#vcf2images|vcf2images]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

Subset the [[#VCF|VCF]] file with the annotated variants by gene.
$ [[#vcfSubsetByGene|vcfSubsetByGene]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22

 

<center>[[#top|Top]]</center>

=== Setting up the web server ===

Make a TAR ball of the relevant files:

* Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
* File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
* File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
* Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
* Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)

$ tar -cvf ALL.2of4intersection.20100804.chr22.tar \
ALL.2of4intersection.20100804.chr22 \
ALL.2of4intersection.20100804.chr22.geneSummary.txt \
ALL.2of4intersection.20100804.chr22.sampleSummary.txt \
ALL.2of4intersection.20100804.chr22.vcf.gz \
ALL.2of4intersection.20100804.chr22.vcf.gz.tbi

Copy the relevant files to the web server ('''WEB_DATA_DIR''' specified in the [http://info.gersteinlab.org/VAT/download#Installation_and_Configuration_of_VAT VAT configuration file])
$ scp ALL.2of4intersection.20100804.chr22.tar user@webserver:/path/to/WEB_DATA_DIR

Unpack the TAR ball on the web server
$ tar -xvf ALL.2of4intersection.20100804.chr22.tar

'''View the results''': [http://dynamic.gersteinlab.org/people/lh372/dev/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c Link to web server]

VAT/dataSets

2011-06-06T16:07:50Z

Lukas.habegger: /* Data sets */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== Data sets ==

<center>[[#top|Top]]</center>

=== 1000 Genomes Pilot Project: Low coverage samples ===

- Data files
- Source: pilot_data, release: 2010_07, FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/
- Indels
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/CEU.low_coverage.2010_07.indel.genotypes.vcf.gz CEU.low_coverage.2010_07.indel.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/JPTCHB.low_coverage.2010_07.indel.genotypes.vcf.gz JPTCHB.low_coverage.2010_07.indel.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/YRI.low_coverage.2010_07.indel.genotypes.vcf.gz YRI.low_coverage.2010_07.indel.genotypes.vcf.gz]
- SNPs
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/CEU.low_coverage.2010_07.genotypes.vcf.gz CEU.low_coverage.2010_07.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/CHBJPT.low_coverage.2010_07.genotypes.vcf.gz CHBJPT.low_coverage.2010_07.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/YRI.low_coverage.2010_07.genotypes.vcf.gz YRI.low_coverage.2010_07.genotypes.vcf.gz]
- Annotation file: [ftp://ftp.sanger.ac.uk/pub/gencode/release_3b/gencode.v3b.annotation.NCBI36.gtf.gz GENCODE (version 3b, hg18)] using CDS elements where ''gene_type = protein_coding'' and ''transcript_type = protein_coding''
- Results: [http://dynamic.gersteinlab.org/people/lh372/dev/vat_cgi?mode=process&dataSet=1000genomes_lowCoverage&annotationSet=gencode3b&type=coding VAT]

 

<center>[[#top|Top]]</center>

=== 1000 Genomes Project, Phase I, chr22, SNP calls ===

- Data files
- Source: release: 20100804, FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/
- SNPs: [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ALL.2of4intersection.20100804.genotypes.vcf.gz]
- Annotation file: [ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz GENCODE (version 3c, hg19)] using CDS elements where ''gene_type = protein_coding'' and ''transcript_type = protein_coding''
- Results: [http://dynamic.gersteinlab.org/people/lh372/dev/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding VAT]
- [http://info.gersteinlab.org/VAT#Example_workflow Detailed workflow]

 

<center>[[#top|Top]]</center>

== Pre-processed GENCODE annotation sets ==

The pre-processed GENCODE annotation sets can be downloaded [http://info.gersteinlab.org/VAT/download#Download_of_pre-processed_annotation_sets here].

VAT/dataSets

2011-06-06T16:07:12Z

Lukas.habegger: /* Data sets */

<center>[http://archive.gersteinlab.org/proj/VAT '''VAT Main Page''']</center>

__TOC__

== Data sets ==

=== 1000 Genomes Project ===

<center>[[#top|Top]]</center>

==== 1000 Genomes Pilot Project: Low coverage samples ====

- Data files
- Source: pilot_data, release: 2010_07, FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/
- Indels
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/CEU.low_coverage.2010_07.indel.genotypes.vcf.gz CEU.low_coverage.2010_07.indel.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/JPTCHB.low_coverage.2010_07.indel.genotypes.vcf.gz JPTCHB.low_coverage.2010_07.indel.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/YRI.low_coverage.2010_07.indel.genotypes.vcf.gz YRI.low_coverage.2010_07.indel.genotypes.vcf.gz]
- SNPs
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/CEU.low_coverage.2010_07.genotypes.vcf.gz CEU.low_coverage.2010_07.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/CHBJPT.low_coverage.2010_07.genotypes.vcf.gz CHBJPT.low_coverage.2010_07.genotypes.vcf.gz]
- [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/snps/YRI.low_coverage.2010_07.genotypes.vcf.gz YRI.low_coverage.2010_07.genotypes.vcf.gz]
- Annotation file: [ftp://ftp.sanger.ac.uk/pub/gencode/release_3b/gencode.v3b.annotation.NCBI36.gtf.gz GENCODE (version 3b, hg18)] using CDS elements where ''gene_type = protein_coding'' and ''transcript_type = protein_coding''
- Results: [http://dynamic.gersteinlab.org/people/lh372/dev/vat_cgi?mode=process&dataSet=1000genomes_lowCoverage&annotationSet=gencode3b&type=coding VAT]

 

<center>[[#top|Top]]</center>

==== 1000 Genomes Project, Phase I, chr22, SNP calls ====

- Data files
- Source: release: 20100804, FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/
- SNPs: [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ALL.2of4intersection.20100804.genotypes.vcf.gz]
- Annotation file: [ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz GENCODE (version 3c, hg19)] using CDS elements where ''gene_type = protein_coding'' and ''transcript_type = protein_coding''
- Results: [http://dynamic.gersteinlab.org/people/lh372/dev/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding VAT]
- [http://info.gersteinlab.org/VAT#Example_workflow Detailed workflow]

 

<center>[[#top|Top]]</center>

== Pre-processed GENCODE annotation sets ==

The pre-processed GENCODE annotation sets can be downloaded [http://info.gersteinlab.org/VAT/download#Download_of_pre-processed_annotation_sets here].

VAT/download

2011-06-06T16:04:28Z