VAT
From GersteinInfo
(→Generation of summaries and images) |
(→gencode2interval) |
||
(58 intermediate revisions not shown) | |||
Line 3: | Line 3: | ||
__TOC__ | __TOC__ | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
== Data formats == | == Data formats == | ||
Line 16: | Line 10: | ||
=== VCF === | === VCF === | ||
- | The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including SNPs and | + | The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the [http://www.1000genomes.org 1000 Genomes Project]. A detailed summary of this file format can be found [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 here]. The annotation information is captured as part of the '''INFO field''' using the '''VA (Variant Annotation) tag'''. The string with the variant information has the following format: |
- | AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts} | + | '''AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}''' |
+ | |||
+ | All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below: | ||
+ | |||
+ | [[File:VariantFormat.png|1000px]] | ||
The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by [[#snpMapper|snpMapper]]): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by [[#indelMapper|indelMapper]]), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon). | The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by [[#snpMapper|snpMapper]]): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by [[#indelMapper|indelMapper]]), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon). | ||
Line 44: | Line 42: | ||
Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4). | Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4). | ||
+ | |||
+ | |||
+ | VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using [[#vcfModifyHeader|vcfModifyHeader]]. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter. | ||
<br> | <br> | ||
Line 51: | Line 52: | ||
=== Interval === | === Interval === | ||
- | The Interval format consists of | + | The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. |
- | This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of | + | This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386]. |
1. Name of the interval | 1. Name of the interval | ||
Line 63: | Line 64: | ||
8. Sub-interval end (with respect to the "+", comma-delimited) | 8. Sub-interval end (with respect to the "+", comma-delimited) | ||
- | Note: For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using [[#gencode2interval|gencode2interval]] program ensures proper formatting | + | '''Note''': For the purpose of VAT, the name field in the [[#Interval|Interval]] file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the [[#gencode2interval|gencode2interval]] program ensures proper formatting. |
Example file: | Example file: | ||
Line 75: | Line 76: | ||
In this example, each interval (line) represents a transcript, while the sub-intervals denote exons. The geneId is utilized to determine if multiple transcripts belong to the same gene model. | In this example, each interval (line) represents a transcript, while the sub-intervals denote exons. The geneId is utilized to determine if multiple transcripts belong to the same gene model. | ||
- | Note: the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''. | + | '''Note''': the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''. |
<br><br> | <br><br> | ||
Line 105: | Line 106: | ||
<center>[[#top|Top]]</center> | <center>[[#top|Top]]</center> | ||
- | ==== | + | ==== indelMapper ==== |
- | + | indelMapper is a program to annotate a set of indels in [[#VCF|VCF]] format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene. | |
'''Usage''': | '''Usage''': | ||
- | + | indelMapper <annotation.interval> <annotation.fa> | |
* Inputs: Takes a [[#VCF|VCF]] input from STDIN | * Inputs: Takes a [[#VCF|VCF]] input from STDIN | ||
- | * Outputs: Outputs annotated | + | * Outputs: Outputs annotated indels in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification. |
* ''Required arguments'' | * ''Required arguments'' | ||
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program. | ** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program. | ||
- | ** | + | ** annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the [[#interval2sequences|interval2sequences]] program using the 'exonic' mode. |
* ''Optional arguments'' | * ''Optional arguments'' | ||
** None | ** None | ||
Line 125: | Line 126: | ||
<center>[[#top|Top]]</center> | <center>[[#top|Top]]</center> | ||
- | ==== | + | ==== svMapper ==== |
- | + | svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene. | |
'''Usage''': | '''Usage''': | ||
- | + | svMapper <annotation.interval> | |
* Inputs: Takes a [[#VCF|VCF]] input from STDIN | * Inputs: Takes a [[#VCF|VCF]] input from STDIN | ||
- | * Outputs: Outputs annotated | + | * Outputs: Outputs annotated SVs in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. For details refer to the [[#VCF|VCF]] format specification. |
* ''Required arguments'' | * ''Required arguments'' | ||
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program. | ** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program. | ||
- | ** | + | * ''Optional arguments'' |
+ | ** None | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <center>[[#top|Top]]</center> | ||
+ | |||
+ | ==== genericMapper ==== | ||
+ | |||
+ | genericMapper is a program to annotate a number of different variants in [[#VCF|VCF]] format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential). | ||
+ | |||
+ | '''Usage''': | ||
+ | |||
+ | genericMapper <annotation.interval> <nameFeature> | ||
+ | |||
+ | * Inputs: Takes a [[#VCF|VCF]] input from STDIN | ||
+ | * Outputs: Outputs the annotated variants in [[#VCF|VCF]] format. The annotation information is captured as part of the INFO field. | ||
+ | * ''Required arguments'' | ||
+ | ** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. This can be a generic [[#Interval|Interval]]. | ||
+ | ** nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file. | ||
* ''Optional arguments'' | * ''Optional arguments'' | ||
** None | ** None | ||
Line 176: | Line 196: | ||
* Outputs: Generates an image in PNG format for each gene that has at least one annotated variant. | * Outputs: Generates an image in PNG format for each gene that has at least one annotated variant. | ||
* ''Required arguments'' | * ''Required arguments'' | ||
- | ** file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and | + | ** file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs). This file must be compressed using [[#bgzip/tabix|bgzip]] and indexed using the [[#bgzip/tabix|tabix]] program. |
** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program. | ** annotation.interval - Annotation file representing the genomic coordinates of the gene models in [[#Interval|Interval]] format. Each line in this file represents a transcript. This file is typically generated using the [[#gencode2interval|gencode2interval]] program. | ||
** outputDir - The output directory where the images are stored | ** outputDir - The output directory where the images are stored | ||
Line 259: | Line 279: | ||
awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf | awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf | ||
- | gencode2interval < | + | gencode2interval < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval |
<br> | <br> | ||
Line 304: | Line 324: | ||
== Example workflow == | == Example workflow == | ||
- | This workflow shows how the [http://info.gersteinlab.org/VAT/dataSets ''1000 Genomes Project, Phase I, SNP calls''] data set was processed. | + | This workflow shows how the [http://info.gersteinlab.org/VAT/dataSets ''1000 Genomes Project, Phase I, chr22, SNP calls''] data set was processed. |
<br> | <br> | ||
Line 318: | Line 338: | ||
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit | $ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit | ||
- | Download the | + | Download the SNP files in [[#VCF|VCF]] format and a third file that assigns each sample to a population: |
- | $ ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz | + | $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz |
- | $ ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel | + | $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi |
+ | $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel | ||
+ | |||
+ | Extract variants on chromosome 22: | ||
+ | $ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | ||
<br> | <br> | ||
Line 344: | Line 368: | ||
<center>[[#top|Top]]</center> | <center>[[#top|Top]]</center> | ||
- | === Annotation of | + | === Annotation of the SNPs === |
+ | |||
+ | |||
Annotate the variants using [[#snpMapper|snpMapper]] | Annotate the variants using [[#snpMapper|snpMapper]] | ||
- | $ zcat ALL.2of4intersection.20100804.genotypes.vcf.gz | [[#snpMapper|snpMapper]] gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.genotypes.annotated.vcf | + | |
+ | $ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | [[#snpMapper|snpMapper]] gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf | ||
<br> | <br> | ||
Line 353: | Line 380: | ||
<center>[[#top|Top]]</center> | <center>[[#top|Top]]</center> | ||
- | === | + | === Modification the VCF header line === |
Modify the [[#VCF|VCF]] header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705). | Modify the [[#VCF|VCF]] header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705). | ||
First get the old meta-data lines: | First get the old meta-data lines: | ||
- | $ grep "#" ALL.2of4intersection.20100804.genotypes.annotated.vcf > ALL.2of4intersection.20100804.genotypes.annotated.oldHeader.vcf | + | $ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf |
Store the annotated variants in a separate file: | Store the annotated variants in a separate file: | ||
- | $ grep "#" -v ALL.2of4intersection.20100804.genotypes.annotated.vcf > ALL.2of4intersection.20100804.genotypes.annotated.variants.vcf | + | $ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf |
Create the new meta-data lines: | Create the new meta-data lines: | ||
- | $ [[#vcfModifyHeader|vcfModifyHeader]] ALL.2of4intersection.20100804.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.genotypes.annotated.newHeader.vcf | + | $ [[#vcfModifyHeader|vcfModifyHeader]] ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf |
- | Merge the new meta-data lines with the annotated variants and create a new file called | + | Merge the new meta-data lines with the annotated variants and create a new file called ''ALL.2of4intersection.20100804.chr22.vcf'': |
- | $ cat ALL.2of4intersection.20100804.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.vcf | + | $ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf |
Compress the newly created [[#VCF|VCF]] file with the annotated variants: | Compress the newly created [[#VCF|VCF]] file with the annotated variants: | ||
- | $ bgzip ALL.2of4intersection.20100804.vcf | + | $ bgzip ALL.2of4intersection.20100804.chr22.vcf |
Index the newly created [[#VCF|VCF]] file with the annotated variants: | Index the newly created [[#VCF|VCF]] file with the annotated variants: | ||
- | $ tabix -p vcf ALL.2of4intersection.20100804.vcf.gz | + | $ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz |
<br> | <br> | ||
Line 382: | Line 409: | ||
Generate gene and sample summaries for the annotated variants | Generate gene and sample summaries for the annotated variants | ||
- | $ [[#vcfSummary|vcfSummary]] ALL.2of4intersection.20100804.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval | + | $ [[#vcfSummary|vcfSummary]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval |
- | Resulting files: ''ALL.2of4intersection.20100804.geneSummary.txt'' and ''ALL.2of4intersection.20100804.sampleSummary.txt'' | + | Resulting files: ''ALL.2of4intersection.20100804.chr22.geneSummary.txt'' and ''ALL.2of4intersection.20100804.chr22.sampleSummary.txt'' |
Make a new directory to store the images and [[#VCF|VCF]] files for each gene. | Make a new directory to store the images and [[#VCF|VCF]] files for each gene. | ||
- | $ mkdir ALL.2of4intersection.20100804 | + | $ mkdir ALL.2of4intersection.20100804.chr22 |
Generate an image for each gene with at least one annotated variant. | Generate an image for each gene with at least one annotated variant. | ||
- | $ [[#vcf2images|vcf2images]] ALL.2of4intersection.20100804.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804 | + | $ [[#vcf2images|vcf2images]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22 |
Subset the [[#VCF|VCF]] file with the annotated variants by gene. | Subset the [[#VCF|VCF]] file with the annotated variants by gene. | ||
- | $ [[#vcfSubsetByGene|vcfSubsetByGene]] ALL.2of4intersection.20100804.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804 | + | $ [[#vcfSubsetByGene|vcfSubsetByGene]] ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22 |
<br> | <br> | ||
Line 400: | Line 427: | ||
=== Setting up the web server === | === Setting up the web server === | ||
+ | |||
+ | Make a TAR ball of the relevant files: | ||
+ | |||
+ | * Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22) | ||
+ | * File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt) | ||
+ | * File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt) | ||
+ | * Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz) | ||
+ | * Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi) | ||
+ | |||
+ | $ tar -cvf ALL.2of4intersection.20100804.chr22.tar \ | ||
+ | ALL.2of4intersection.20100804.chr22 \ | ||
+ | ALL.2of4intersection.20100804.chr22.geneSummary.txt \ | ||
+ | ALL.2of4intersection.20100804.chr22.sampleSummary.txt \ | ||
+ | ALL.2of4intersection.20100804.chr22.vcf.gz \ | ||
+ | ALL.2of4intersection.20100804.chr22.vcf.gz.tbi | ||
+ | |||
+ | Copy the relevant files to the web server ('''WEB_DATA_DIR''' specified in the [http://info.gersteinlab.org/VAT/download#Installation_and_Configuration_of_VAT VAT configuration file]) | ||
+ | $ scp ALL.2of4intersection.20100804.chr22.tar user@webserver:/path/to/WEB_DATA_DIR | ||
+ | |||
+ | Unpack the TAR ball on the web server | ||
+ | $ tar -xvf ALL.2of4intersection.20100804.chr22.tar | ||
+ | |||
+ | '''View the results''': [http://dynamic.gersteinlab.org/people/lh372/vat_cgi?mode=process&dataSet=ALL.2of4intersection.20100804.chr22&annotationSet=gencode3c&type=coding Link to web server] |
Latest revision as of 14:23, 16 June 2011
Contents[hide] |
Data formats
VCF
The Variant Call Format (VCF) is a tab-delimited text file format to represent a number of different genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This format was developed as part of the 1000 Genomes Project. A detailed summary of this file format can be found here. The annotation information is captured as part of the INFO field using the VA (Variant Annotation) tag. The string with the variant information has the following format:
AlleleNumber:GeneName:GeneId:Strand:Type:FractionOfTranscriptsAffected:{List of transcripts}
All annotated variant use the above format to capture information about the gene. The format describing the list of affected transcripts depends on the variant class (SNP, indel, or SV) and the variant type as shown in the table below:
The allele number refers to the numbering of the alleles. By definition, the reference allele has zero as the allele number, whereas the alternate alleles are numbered starting at one (some variants have more than one alternate alleles). The type refers to the type of variant. For SNPs, the types can take on the following values (generated by snpMapper): synonymous, nonsynonymous, prematureStop, removedStop, and spliceOverlap. For indels (generated by indelMapper), the types can take on the following values: spliceOverlap, startOverlap, endOverlap, insertionFS, insertionNFS, deletionFS, deletionNFS, where FS denotes 'frameshift' and NFS indicates 'non-frameshift'. The term spliceOverlap (for both SNPs and indels) refers to a genetic variant that overlaps with a splice site (either two nucleotides downstream of an exon or two nucleotides upstream of an exon).
Example 1: A SNP is introducing a premature stop codon. This variant affects one out of five transcripts for this gene.
chr1 23112837 . A T . PASS AA=A;AC=7;AN=118;DP=168;SF=2;VA=1:EPHB2:ENSG00000133216:+:prematureStop:1/5:EPHB2-001:ENST00000400191:3165_3055_1019_K->*
Example 2: A SNP leads to a non-synonymous substitution. This variant affects two out of four transcripts for this gene.
chr1 1110357 . G A . PASS AA=G;AC=3;AN=118;DP=203;SF=2;VA=1:TTLL10:ENSG00000162571:+:nonsynonymous:2/4:TTLL10-001:ENST00000379288:1212_1187_396_R->H:TTLL10-202:ENST00000400931:1212_1187_396_R->H
Example 3: A SNP causing a non-synonymous substitution in one transcript and a splice overlap in another transcript of the same gene.
chr9 35819390 rs2381409 C T . PASS AA=N;AC=157;AN=240;DP=49;SF=0,1;VA=1:TMEM8B:ENSG00000137103:+:nonsynonymous:1/7:TMEM8B-202:ENST00000360192:2109_166_56_P->S,1:TMEM8B:ENSG00000137103:+:spliceOverlap:1/7:TMEM8B-001:ENST00000450762:2106
Example 4: An indel with two alternate alleles. Each alternate allele leads to a non-frameshift deletion.
chr7 140118541 . TACAACAACA T,TACA . PASS HP=1;VA=1:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQQ->L,2:AC006344.1:ENSG00000236914:+:deletionNFS:1/1:AC006344.1-201:ENST00000434223:66_23_8_LQQ->L
Notice that multiple annotation entries are comma-separated. Multiple annotation entries arise when a variant causes different types of effects on different transcripts (Example 3) or if there are multiple alternate alleles (Example 4).
VAT also enables the grouping of samples. For examples, samples can be assigned to different sub-populations or they can be designated as cases or controls. This is done by modifying the header line using vcfModifyHeader. Specifically, the sample is prefixed by group identifier using the ':' character as a delimiter.
Interval
The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. This format is closely associated with the intervalFind module, which is part of libBIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" Bioinformatics 2007;23:1386-1393 [1].
1. Name of the interval 2. Chromosome 3. Strand 4. Interval start (with respect to the "+") 5. Interval end (with respect to the "+") 6. Number of sub-intervals 7. Sub-interval starts (with respect to the "+", comma-delimited) 8. Sub-interval end (with respect to the "+", comma-delimited)
Note: For the purpose of VAT, the name field in the Interval file must contain four pieces of information delimited by the '|' symbol (geneId|transcriptId|geneName|transcriptName). Using the gencode2interval program ensures proper formatting.
Example file:
ENSG00000008513|ENST00000319914|ST3GAL1|ST3GAL1-201 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267 ENSG00000008513|ENST00000395320|ST3GAL1|ST3GAL1-202 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267 ENSG00000008513|ENST00000399640|ST3GAL1|ST3GAL1-203 chr8 - 134472009 134488267 6 134472009,134474117,134475656,134477020,134478136,134487961 134472180,134474237,134475702,134477200,134478333,134488267 ENSG00000008516|ENST00000325800|MMP25|MMP25-201 chr16 + 3097544 3105947 4 3097544,3100009,3100254,3105830 3097548,3100145,3100546,3105947 ENSG00000008516|ENST00000336577|MMP25|MMP25-202 chr16 + 3096918 3109096 10 3096918,3097415,3100009,3100254,3107033,3107310,3107531,3108181,3108412,3108827 3097017,3097548,3100145,3100547,3107210,3107395,3107614,3108334,3108670,3109096
In this example, each interval (line) represents a transcript, while the sub-intervals denote exons. The geneId is utilized to determine if multiple transcripts belong to the same gene model.
Note: the coordinates in the Interval format are zero-based and the end coordinate is not included.
List of programs
VAT Core Modules
snpMapper
snpMapper is a program to annotate a set of SNPs in VCF format. The program determines the effect of a SNP on the coding potential (synonymous, nonsynonymous, prematureStop, removedStop, spliceOverlap) of each transcript of a gene.
Usage:
snpMapper <annotation.interval> <annotation.fa>
- Inputs: Takes a VCF input from STDIN
- Outputs: Outputs annotated SNPs in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
- Required arguments
- annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
- annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the interval2sequences program using the 'exonic' mode.
- Optional arguments
- None
indelMapper
indelMapper is a program to annotate a set of indels in VCF format. The program determines the effect of an indel on the coding potential (frameshift insertion, non-frameshift insertion, frameshift deletion, non-frameshift deletion, spliceOverlap, startOverlap, endOverlap) of each transcript of a gene.
Usage:
indelMapper <annotation.interval> <annotation.fa>
- Inputs: Takes a VCF input from STDIN
- Outputs: Outputs annotated indels in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
- Required arguments
- annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
- annotation.fa - File with the transcript sequences in FASTA format for each entry specified in annotation.interval. This file is typically generated by the interval2sequences program using the 'exonic' mode.
- Optional arguments
- None
svMapper
svMapper is a program to annotate a set of SVs in VCF format. The program determines if a SV overlaps with different transcript isoforms of a gene.
Usage:
svMapper <annotation.interval>
- Inputs: Takes a VCF input from STDIN
- Outputs: Outputs annotated SVs in VCF format. The annotation information is captured as part of the INFO field. For details refer to the VCF format specification.
- Required arguments
- annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
- Optional arguments
- None
genericMapper
genericMapper is a program to annotate a number of different variants in VCF format. The program checks whether a variant overlaps with entries in the specified annotation set (it does not determine the effect on the coding potential).
Usage:
genericMapper <annotation.interval> <nameFeature>
- Inputs: Takes a VCF input from STDIN
- Outputs: Outputs the annotated variants in VCF format. The annotation information is captured as part of the INFO field.
- Required arguments
- annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. This can be a generic Interval.
- nameFeature - Specifies the type of the annotation feature (for example promotor regions). The name of the feature is included as part of the annotation information (in the INFO field) in the resulting VCF file.
- Optional arguments
- None
vcfSummary
vcfSummary is a program to aggregate annotated variants across genes and samples.
Usage:
vcfSummary <file.vcf.gz> <annotation.interval>
- Inputs: None
- Outputs: Generates two output files. The first file, named file.geneSummary.txt, contains the number of variants categorized by type for each gene. A second file, named file.sampleSummary.txt, summarizes number of variants categorized by type for each sample.
- Required arguments
- file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using bgzip and indexed using the tabix program.
- annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
- Optional arguments
- None
vcf2images
vcf2images is a program to generate an image for each gene to visualize effect of the annotated variants.
Usage:
vcf2images <file.vcf.gz> <annotation.interval> <outputDir>
- Inputs: None
- Outputs: Generates an image in PNG format for each gene that has at least one annotated variant.
- Required arguments
- file.vcf.gz - VCF file with annotated variants (this can be a mixture of SNPs, indels, and SVs). This file must be compressed using bgzip and indexed using the tabix program.
- annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
- outputDir - The output directory where the images are stored
- Optional arguments
- None
vcfSubsetByGene
vcfSubsetByGene is a program to subset a VCF file with annotated variants by gene.
Usage:
vcfSubsetByGene <file.vcf.gz> <annotation.interval> <outputDir>
- Inputs: None
- Outputs: Generates a VCF file for each gene that has at least one annotated variant.
- Required arguments
- file.vcf.gz - VCF file with annotated variants (this can be a mixture of indels and SNPs). This file must be compressed using bgzip and indexed using the tabix program.
- annotation.interval - Annotation file representing the genomic coordinates of the gene models in Interval format. Each line in this file represents a transcript. This file is typically generated using the gencode2interval program.
- outputDir - The output directory where VCF files are stored
- Optional arguments
- None
vcfModifyHeader
vcfModifyHeader is a program to modify the header line (part of the meta-lines) in a VCF file. Specifically, it assigns each sample to a group or population (these assignments are used by other programs including vcfSummary).
Usage:
vcfModifyHeader <oldHeader.vcf> <groups.txt>
- Inputs: None
- Outputs: Generates a VCF header file.
- Required arguments
- oldHeader.vcf - The meta lines of a VCF file. It can be obtained by using the following command:
grep '#' file.vcf > file.header.vcf
- groups.txt - This tab-delimited file that assigns each sample present in the VCF to a group/population. Here is a small sample file:
HG00629 CHS HG00634 CHS HG00635 CHS HG00637 PUR HG00638 PUR HG00640 PUR NA06984 CEU NA06985 CEU NA06986 CEU NA06989 CEU NA06994 CEU
- Optional arguments
- None
Auxiliary programs
gencode2interval
gencode2interval converts a GENCODE annotation file (in GTF format) to the Interval format.
Usage:
gencode2interval
- Inputs: Takes a GENCODE annotation file in GTF format from STDIN
- Outputs: Outputs the GENCODE annotation file in Interval format to STDOUT
- Required arguments
- None
- Optional arguments
- None
Note: To obtain the coding sequences of the elements with gene_type protein_coding and transcript_type protein_coding the following command should be used:
awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf gencode2interval < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval
interval2sequences
Module to retrieve genomic/exonic sequences for an annotation set in Interval format.
Usage:
interval2sequences <file.2bit> <file.annotation> <exonic|genomic>
- Inputs: None
- Outputs: Reports the extracted sequences in FASTA format
- Required arguments
- file.2bit - genome reference sequence in 2bit format
- file.annotation - annotation set in Interval format (each line represents one transcript)
- < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
- Optional arguments
- None
External programs
bgzip/tabix
Tabix is generic tool that indexes position-sorted files in tab-delimited formats to facilitate fast retrieval. This tool was developed by Heng Li. For more information consult the tabix documentation page.
VCF tools
VCF tools consists of a suite of very useful modules to manipulate VCF files. For more information consult the documentation page.
Example workflow
This workflow shows how the 1000 Genomes Project, Phase I, chr22, SNP calls data set was processed.
Prerequisites
Download the GENCODE annotation set (version 3c, hg19):
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz
Download the human genome (hg19) in 2bit format. This is used by interval2sequences to extract the genomic sequences for the entries specified in the annotation set:
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
Download the SNP files in VCF format and a third file that assigns each sample to a population:
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz.tbi $ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/20100804.ALL.panel
Extract variants on chromosome 22:
$ tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 22 | bgzip -c > ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz
Preprocessing of the annotation file
Decompress the annotation file:
$ gunzip gencode.v3c.annotation.GRCh37.gtf.gz
Extract the coding sequence (CDS) elements where the both the gene_type and transcript_type are protein_coding:
$ awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v3c.annotation.GRCh37.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf
Convert the GENCODE GTF file into Interval format:
$ gencode2interval < gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.gtf > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval
Retrieve the genomic sequences for the transcripts specified in the annotation file.
$ interval2sequences hg19.2bit gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval exonic > gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa
Annotation of the SNPs
Annotate the variants using snpMapper
$ zcat ALL.2of4intersection.20100804.chr22.genotypes.vcf.gz | snpMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf
Modification the VCF header line
Modify the VCF header line to assign individual samples to populations (groups). This is done by using the following syntax: group:sample (i.e. CEU:NA0705).
First get the old meta-data lines:
$ grep "#" ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf
Store the annotated variants in a separate file:
$ grep "#" -v ALL.2of4intersection.20100804.chr22.genotypes.annotated.vcf > ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf
Create the new meta-data lines:
$ vcfModifyHeader ALL.2of4intersection.20100804.chr22.genotypes.annotated.oldHeader.vcf 20100804.ALL.panel > ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf
Merge the new meta-data lines with the annotated variants and create a new file called ALL.2of4intersection.20100804.chr22.vcf:
$ cat ALL.2of4intersection.20100804.chr22.genotypes.annotated.newHeader.vcf ALL.2of4intersection.20100804.chr22.genotypes.annotated.variants.vcf > ALL.2of4intersection.20100804.chr22.vcf
Compress the newly created VCF file with the annotated variants:
$ bgzip ALL.2of4intersection.20100804.chr22.vcf
Index the newly created VCF file with the annotated variants:
$ tabix -p vcf ALL.2of4intersection.20100804.chr22.vcf.gz
Generation of summaries and images
Generate gene and sample summaries for the annotated variants
$ vcfSummary ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval Resulting files: ALL.2of4intersection.20100804.chr22.geneSummary.txt and ALL.2of4intersection.20100804.chr22.sampleSummary.txt
Make a new directory to store the images and VCF files for each gene.
$ mkdir ALL.2of4intersection.20100804.chr22
Generate an image for each gene with at least one annotated variant.
$ vcf2images ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22
Subset the VCF file with the annotated variants by gene.
$ vcfSubsetByGene ALL.2of4intersection.20100804.chr22.vcf.gz gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval ./ALL.2of4intersection.20100804.chr22
Setting up the web server
Make a TAR ball of the relevant files:
- Directory with the images and the VCF files for each gene (ALL.2of4intersection.20100804.chr22)
- File with the gene summary (ALL.2of4intersection.20100804.chr22.geneSummary.txt)
- File with the sample summary (ALL.2of4intersection.20100804.chr22.sampleSummary.txt)
- Compressed VCF file with the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz)
- Index file of the annotated variants (ALL.2of4intersection.20100804.chr22.vcf.gz.tbi)
$ tar -cvf ALL.2of4intersection.20100804.chr22.tar \ ALL.2of4intersection.20100804.chr22 \ ALL.2of4intersection.20100804.chr22.geneSummary.txt \ ALL.2of4intersection.20100804.chr22.sampleSummary.txt \ ALL.2of4intersection.20100804.chr22.vcf.gz \ ALL.2of4intersection.20100804.chr22.vcf.gz.tbi
Copy the relevant files to the web server (WEB_DATA_DIR specified in the VAT configuration file)
$ scp ALL.2of4intersection.20100804.chr22.tar user@webserver:/path/to/WEB_DATA_DIR
Unpack the TAR ball on the web server
$ tar -xvf ALL.2of4intersection.20100804.chr22.tar
View the results: Link to web server