FunSVPT

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
(B. Tool installation)
(Building data context)
 
(40 intermediate revisions not shown)
Line 8: Line 8:
* [http://code.google.com/p/bedtools/downloads/list bedtools] (version bedtools-2.17.0) <br>
* [http://code.google.com/p/bedtools/downloads/list bedtools] (version bedtools-2.17.0) <br>
* [http://sourceforge.net/projects/samtools/files/tabix/ tabix] (version tabix-0.2.6 and up) <br>
* [http://sourceforge.net/projects/samtools/files/tabix/ tabix] (version tabix-0.2.6 and up) <br>
-
*  [http://vat.gersteinlab.org/index.php VAT] (snpMapper Module) - A good installation guide for VAT can be found [http://ngsda.blogspot.com/2011/06/vat.html here]. <br>
+
*  [http://vat.gersteinlab.org/index.php VAT] (snpMapper, indelMapper Module) - A good installation guide for VAT can be found [http://ngsda.blogspot.com/2011/06/vat.html here]. <br>
-
<span style="font-size:88%; line-height: 1.3em;">If you are only interested in non-coding variants, you don't need to install VAT. But remember to use '-nc' option in Funseq </span>
+
<span style="font-size:88%; line-height: 1.3em;">If you are only interested in non-coding variants, you don't need to install VAT. But remember to use '-nc' option.</span>
<br>
<br>
*  [http://bioinfo.lifl.fr/TFM/TFMpvalue/ TFMpvalue-sc2pv] <br>
*  [http://bioinfo.lifl.fr/TFM/TFMpvalue/ TFMpvalue-sc2pv] <br>
Line 35: Line 35:
==C. Pre-built Data Context==
==C. Pre-built Data Context==
-
All of the data can be downloaded under ‘Downloads’ in the web server. If you would like to use the data, please download them and put them under ‘funsvpt-0.1/data’.
+
All of the data can be downloaded under ‘Downloads’ in the web server. If you would like to use the data, please download and put them under 'funsvpt-0.1/data’.
==D. Tool Usage==
==D. Tool Usage==
To display the usage of tool, type ‘./run.sh’. <br>
To display the usage of tool, type ‘./run.sh’. <br>
  * Usage : ./run.sh -f file -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file  
  * Usage : ./run.sh -f file -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file  
-
           -cls file -exf  <rpkm/raw> -p int -cancer cancer_type -s score -uw
+
           -cls file -exf  <rpkm/raw> -p int -cancer cancer_type -s score -uw -ua user_annotations_directory
         Options :
         Options :
                 -f [Required] User Input SNVs File
                 -f [Required] User Input SNVs File
Line 58: Line 58:
                 -s [Optional] Score threshold to call non-coding candidates, default = 1.5  
                 -s [Optional] Score threshold to call non-coding candidates, default = 1.5  
                 for weighted scoring & default = 5 for unweighted scoring
                 for weighted scoring & default = 5 for unweighted scoring
 +
                -ua [Optional] Directory contains user annotations. Default is 'data/user_annotations'
          
          
         Multiple Genomes with Recurrent Output
         Multiple Genomes with Recurrent Output
Line 66: Line 67:
                
                
                 NOTE: Please make sure you have sufficient memory, at least 3G.
                 NOTE: Please make sure you have sufficient memory, at least 3G.
 +
-maf : should be a number between 0~1<br>
 +
-nc : when using this option, users don’t need to install VAT (variant annotation tool)<br>
 +
-exp, -cls, -exf : if used, should be specified together.<br>
 +
-m : We also provide the option for germline or personal genomes, which compare mutated allele with ancestral allele, since the functional impact of variants reflects the historical event when the polymorphism was first introduced in the human populations.
==E. Input files==
==E. Input files==
-
* User input file (-f): could be either BED format or VCF format.
+
* User input file (-f): could be either BED or VCF format. For indels, please use “-” instead of other symbols in ‘allele’ columns for insertions or deletions. Indels will be analyzed for BED format.  
   
   
-
  BED format. In addition to the three required BED fields, please prepare your file as follows (5 required fields, tab delimited;  
+
  BED format. In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited;  
  the 6th column is reserved for sample names, do not put other information there):  
  the 6th column is reserved for sample names, do not put other information there):  
  chromosome, start position, end position, reference allele, and alternative allele.
  chromosome, start position, end position, reference allele, and alternative allele.
         Chromosome - name of the chromosome (e.g. chr3, chrX)
         Chromosome - name of the chromosome (e.g. chr3, chrX)
-
         Start position - start position of variants. (0-based)
+
         Start position - start coordinates of variants. (0-based)
-
         End position - ending position of variants. (end exclusive)
+
         End position - end coordinates of variants. (end exclusive)
                 e.g., chr1  0    100  spanning bases numbered 0-99
                 e.g., chr1  0    100  spanning bases numbered 0-99
-
         Reference allele - reference allele of variants
+
         Reference allele - germlime allele of variants
-
         Alternative allele - alternative allele of variants
+
         Alternative allele - mutated allele of variants
          
          
-
  VCF forma. The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited): 

+
  VCF format. The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited): 

  #CHROM POS ID REF ALT  QUAL FILTER INFO
  #CHROM POS ID REF ALT  QUAL FILTER INFO
          
          
  Recurrent analysis input format
  Recurrent analysis input format
-
         Option 1: separate files for each genome (BED or VCF). Use “-f file1, file2, file3” separated by comma.
+
         Option 1: separated files for each genome (BED or VCF). Use “-f file1, file2, file3” separated by comma.
         Option 2: put all variants in one file (only for BED format, use the 6th column labeling sample names). Use “-f file”.
         Option 2: put all variants in one file (only for BED format, use the 6th column labeling sample names). Use “-f file”.
-
* Gene list format (-g): If you are interested in particular set of genes, you can put your genes in one file (one gene per row) and use “-g file” to instruct the program to only analyze variants in or linked to those genes. Please use gene symbols.
+
* Gene list format (-g): If you are interested in particular set of genes, you can put your genes in one file (one gene per row) and use “-g file” to only analyze variants in or associated with those genes. Please use gene symbols.
-
* Gene expression format (-exp): Users can also upload gene expression data for the program to detect differentially expressed genes between cancer and benign samples and highlight variants linked to these genes. The gene expression data should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in rpkm or raw read counts. Tab delimited.  
+
* Gene expression format (-exp): Users can also upload gene expression file for the program to detect differentially expressed genes between cancer and benign samples and highlight variants associated with these genes. The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited.  
   
   
         Gene Sample1 Sample2 Sample3 Sample4 …
         Gene Sample1 Sample2 Sample3 Sample4 …
Line 96: Line 101:
         … … … … … …
         … … … … … …
-
* Sample class format (-cls): In addition to the expression data, users need to upload annotations of samples as “cancer” or “benign” (only two classes “cancer” or “benign”). The number of samples in this file should equal to that in expression data. And sample names should match.  
+
* Sample class format (-cls): In addition to the expression file, users need to upload a file with samples annotated as “cancer” or “benign” (only two classes “cancer” or “benign”). The number of samples in this file should be equal to that in expression data. And sample names should match.  
   
   
         Sample1 benign
         Sample1 benign
Line 105: Line 110:
==F. Output files==
==F. Output files==
-
Four output files will be generated: ‘Output.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detail results from all samples; Recur.Summary: the recurrence analysis when having multiple genomes; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop mutations, non-coding mutations with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and mutations in or linked to known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.  
+
Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.  
-
When providing gene_expression data, two additional files will be produced – ‘DE.gene.txt’ is the differentially expression genes from RNA-Seq analysis and ‘DE.pdf ’is the differential gene expression plot.
+
When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot.  
* Sample BED format output
* Sample BED format output
Line 148: Line 153:
  ##INFO=<ID=CANG,Number=.,Type=String,Description="Prior Gene Information, e.g.[cancer][TF_regulating_known_cancer_gene]
  ##INFO=<ID=CANG,Number=.,Type=String,Description="Prior Gene Information, e.g.[cancer][TF_regulating_known_cancer_gene]
  [up_regulated][actionable]...";
  [up_regulated][actionable]...";
-
  ##INFO=<ID=CDSS,Number=.,Type=String,Description="FunSEQ Coding Score">
+
  ##INFO=<ID=CDSS,Number=.,Type=String,Description="Coding Score">
-
  ##INFO=<ID=NCDS,Number=.,Type=String,Description="FunSEQ NonCoding Score">
+
  ##INFO=<ID=NCDS,Number=.,Type=String,Description="NonCoding Score">
  ##INFO=<ID=RECUR,Number=.,Type=String,Description="Recurrent elements / variants">
  ##INFO=<ID=RECUR,Number=.,Type=String,Description="Recurrent elements / variants">
  ##INFO=<ID=DBRECUR,Number=.,Type=String,Description="Recurrence database">
  ##INFO=<ID=DBRECUR,Number=.,Type=String,Description="Recurrence database">
Line 173: Line 178:
       Example: ‘NCENC=TFP(CEBPB|chr5:139639150-139639496),TFP(STAT3|chr5:139638936-139640136),TFP(STAT3|chr5:139638976-
       Example: ‘NCENC=TFP(CEBPB|chr5:139639150-139639496),TFP(STAT3|chr5:139638936-139640136),TFP(STAT3|chr5:139638976-
       139639553),TFP(STAT3|chr5:139638989-139639544),TFP(STAT3|chr5:139638999-139639716)’  
       139639553),TFP(STAT3|chr5:139638989-139639544),TFP(STAT3|chr5:139638999-139639716)’  
-
       This is formatted as “category(element_name|chromosome:position)” (0-based, end exclusive).  
+
       This is formatted as “category(element_name|chromosome:coordinates)” (0-based, end exclusive).  
               TFP - transcription factor binding peak.  
               TFP - transcription factor binding peak.  
               TFM - transcription factor bound motifs in peak regions.  
               TFM - transcription factor bound motifs in peak regions.  
-
               DHS - DNase1 hypersensitive sites, with number of cell lines (MCV, total 125 cell lines).  
+
               DHS - DNase1 hypersensitive sites, with number of cell lines (MCV, total 125 cell lines). For cell-line info, please refer to [http://archive.gersteinlab.org/yaofu/DHS/  DHS cell lines]
               ncRNA - non-coding RNA 

               ncRNA - non-coding RNA 

               Pseudogene  
               Pseudogene  
Line 183: Line 188:
  HOT (highly occupied region)
  HOT (highly occupied region)
       Example: ‘HOT=Helas3’
       Example: ‘HOT=Helas3’
-
       If a mutation occurs in HOT regions, the corresponding cell lines (5 in total) are shown. This annotation is from (Yip, et al., 2012).  
+
       If a variant occurs in HOT regions, the corresponding cell lines (5 in total) are shown. This annotation is from (Yip, et al., 2012).  
   
   
  MOTIFBR (motif-breaking analysis)
  MOTIFBR (motif-breaking analysis)
-
       Example: ‘MOTIFBR=MAX#Myc_known9_8mer#102248644#102248656#-#9#0.068966#0.931034’
+
       SNV Example: ‘MOTIFBR=MAX#Myc_known9_8mer#102248644#102248656#-#9#0.068966#0.931034’
-
       The variant causes a motif-breaking event. This field is a hash-delimited tag, defined as follows: 
TF name # motif name # motif start #  
+
       The variant causes a motif-breaking event. This field is a hash tag delimited, defined as follows: 
TF name # motif name # motif start #  
-
       motif end # motif strand # mutation position # alternative allele frequency in PWM # reference allele frequency in PWM
. (0-based, end exclusive)
+
       motif end # motif strand # mutation position # alternative allele frequency in PWM # reference allele frequency in PWM
. (0-based, end exclusive)   
-
+
     
 +
      Indel Example: ‘MOTIFBR=TCF12#TCF12_disc5_8mer#115719379#115719390#+’
 +
      This field is a hash tag delimited, defined as follows: 
TF name # motif name # motif start # motif end # motif strand. (0-based, end exclusive)
 +
 
  MOTIFG (motif-gaining analysis)
  MOTIFG (motif-gaining analysis)
-
       Example: ‘MOTIFG=GATA_known5#75658824#75658829#-#1#4.839#4.181’

+
       SNV Example: ‘MOTIFG=GATA_known5#75658824#75658829#-#1#4.839#4.181’

-
       The variant causes a motif-gaining event. Hash-delimited: motif name # motif start # motif end # motif strand # mutation position  
+
       The variant causes a motif-gaining event. Hash tag delimited: motif name # motif start # motif end # motif strand # mutation position  
-
       # motif score with alternative allele # motif score with reference allele. (0-based, end exclusive)
+
       # sequence score with alternative allele # sequence score with reference allele. (0-based, end exclusive)
 +
     
 +
      Indel example: ‘MOTIFG=Ets_known10#CGGAAA#6#+#5.743’

 +
      Hash tag delimited: motif name # motif sequence discovered # motif length # motif strand # sequence score with alternative allele.
   
   
  GENE (target gene - for coding: directly affected genes; for non-coding: promoter or distal regulatory module)
  GENE (target gene - for coding: directly affected genes; for non-coding: promoter or distal regulatory module)
Line 208: Line 219:
               [down_regulated]: the gene is down-regulated in cancers, when providing RNA-Seq gene expression data.   
               [down_regulated]: the gene is down-regulated in cancers, when providing RNA-Seq gene expression data.   
       When users provide new gene lists, tags about these gene lists will be shown in this field.  
       When users provide new gene lists, tags about these gene lists will be shown in this field.  
 +
 +
USER_ANNO (user annotations)
 +
      Example: ‘USER_ANNO=REPEAT(FLAM_A|chr1:100544744-100544854)’
 +
      This field stores all user provided annotations.
   
   
  RECUR (recurrent genes, regulatory elements and mutations within samples)
  RECUR (recurrent genes, regulatory elements and mutations within samples)
       Example: ‘RECUR=Pseudogene(ENST00000467115.1|chr1:568914-569121):PR1783(chr1:568941,chr1:569004*),PR2832(chr1:569004*)’
       Example: ‘RECUR=Pseudogene(ENST00000467115.1|chr1:568914-569121):PR1783(chr1:568941,chr1:569004*),PR2832(chr1:569004*)’
-
       When analyzing multiple genomes, if genes or regulatory elements are recurrent in >= 2 samples, it is annotated as ‘gene/regulatory
+
       When analyzing multiple genomes, if genes or regulatory elements are shown in >= 2 samples, they are annotated as ‘gene/regulatory
-
       element name: recurrent samples (mutations in corresponding samples (position is 1-based))’.  If it is a same site mutation, ‘*’ is tagged.  
+
       element name: recurrent samples (variants in corresponding samples (position is 1-based))’.  If it is a same site mutation, ‘*’ is tagged.  
   
   
  DBRECUR (Recurrence databse)  
  DBRECUR (Recurrence databse)  
Line 230: Line 245:
‘0.define.proximal.distal.regions.pl’ . We provide this script for users to split categories into proximal or distal subsets. The proximal or distal subsets can be used as new categories.  
‘0.define.proximal.distal.regions.pl’ . We provide this script for users to split categories into proximal or distal subsets. The proximal or distal subsets can be used as new categories.  
-
Scripts used to identify sensitive/ultra-sensitive regions totally from scratch –‘1.Randomization.pl’ and ‘1.2.FDR.r’ . ‘1.Randomization.pl’ uses element-sliding method to generate null distributions for enrichment of rare variants for each category.  ‘1.2.FDR.r’ calculates FDR for the randomization. This script can also be used to generate significant categories based on user-selected FDR.  
+
Scripts used to identify sensitive/ultra-sensitive regions totally from scratch –‘1.Randomization.pl’ and ‘1.2.FDR.r’. ‘1.Randomization.pl’ uses GSC (genome structure correction) like method to generate null distributions for enrichment of rare variants for each category.  ‘1.2.FDR.r’ calculates FDR for the randomization. This script can also be used to generate significant categories based on user-selected FDR.  
-
Scripts used to identify novel sensitive/ultra-sensitive regions upon those defined in (Khurana, et al., 2013) – ‘2.sensitive.regions.delta.increment.pl’. This script is only applicable to small number of categories ~ 5.  
+
Scripts used to identify novel sensitive/ultra-sensitive regions, in addition to those defined in (Khurana, et al., 2013) – ‘2.sensitive.regions.delta.increment.pl’. This script is only applicable to small number of categories ~ 5.  
Note: please prepare your polymorphisms file with only non-coding variants.  
Note: please prepare your polymorphisms file with only non-coding variants.  
Line 240: Line 255:
* Add new networks  
* Add new networks  
-
The networks data used are under ‘data/networks’ folder. FunSVPT will automatically read all the files in the folder and use the first field separated by ‘.’ as the network name. For example, ‘PPI.degree’ file will be used as network ‘PPI’ in FunSVPT. So to add new networks, simply put the network files into this folder and use the first field to denote the network name.  
+
The networks data used are under ‘data/networks’ folder. FunSVPT will automatically read all the files in the folder and use the first field separated by ‘.’ as the network name. For example, ‘PPI.degree’ file will be used as network ‘PPI’. So to add new networks, simply put the network files into this folder and use the first field to denote the network name.  
The files under the folder have two columns, ‘gene name’ and ‘centrality’. We provide ‘4.network.analysis.r’ for users to generate these files (either degree or betweenness centrality) from tab-delimited network files. Tab-delimited network files are two-columns files showing the interacting genes (for each row, ‘gene A’ ‘gene B’).  
The files under the folder have two columns, ‘gene name’ and ‘centrality’. We provide ‘4.network.analysis.r’ for users to generate these files (either degree or betweenness centrality) from tab-delimited network files. Tab-delimited network files are two-columns files showing the interacting genes (for each row, ‘gene A’ ‘gene B’).  
Line 253: Line 268:
This is similar to ‘Add new networks’. Please put files under ‘data/cancer_recurrence’ and use the first field as the cancer type name. This file can be produced by running FunSVPT (file ‘Recur.Summary’ produced by the tool) on cancer samples of a particular type.
This is similar to ‘Add new networks’. Please put files under ‘data/cancer_recurrence’ and use the first field as the cancer type name. This file can be produced by running FunSVPT (file ‘Recur.Summary’ produced by the tool) on cancer samples of a particular type.
-
* All of other files can be replaced with user-specific data. Please refer to data files under ‘data/’ to correctly format the data.
+
* Add user-specific annotation sets, such as epigenetic modifications
 +
Please put files under directory ‘data/user_annotations’ or specific directory with option (-ua). The first field separated by ‘.’ will be used as annotation name. Please prepare your files in BED format and use the 4th column for additional information, if needed. We have placed repeat regions obtained from UCSC there as an example.
 +
 
 +
* All of other files can be replaced with user-specific data. Please refer to the files under ‘data/’ to correctly format them.

Latest revision as of 01:41, 8 April 2014

Contents


Variants Prioritization

A. Dependencies

The following tools are required:

  • sed, awk, grep
  • bedtools (version bedtools-2.17.0)
  • tabix (version tabix-0.2.6 and up)
  • VAT (snpMapper, indelMapper Module) - A good installation guide for VAT can be found here.

If you are only interested in non-coding variants, you don't need to install VAT. But remember to use '-nc' option.

Retrieve GERP scores. Note that GERP data file is ~7G. If you are not interested in GERP scores, the GERP file and bigWigAverageOverBed are not needed.

Only needed for differential gene expression analysis.

Required for parallel running.
Please make sure you have Perl 5 and up.

B. Tool installation

This is a PERL- and Linux/UNIX-based tool. At the command-line prompt, type the following. The purpose is to write the path of funSVPT.pm to your environment.

$ tar xvf funSVPT.v.0.1.tar
$ cd funsvpt-0.1/
$ cd funSVPT/
$ perl Makefile.PL
$ make 
$ make test
$ make install

If you don’t have the permission to modify the environment, open the ‘.bashrc’ file and add the following to the end of the file. Then ‘source .bashrc’.

PERL5LIB=${PERL5LIB}: $path_of_the_tool/funsvpt-0.1/funSVPT/lib
export PERL5LIB

C. Pre-built Data Context

All of the data can be downloaded under ‘Downloads’ in the web server. If you would like to use the data, please download and put them under 'funsvpt-0.1/data’.

D. Tool Usage

To display the usage of tool, type ‘./run.sh’.

* Usage : ./run.sh -f file -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file 
          -cls file -exf   <rpkm/raw> -p int -cancer cancer_type -s score -uw -ua user_annotations_directory
       Options :
               	-f		[Required] User Input SNVs File
               	-inf	 	[Required] Input format - BED or VCF
               	-maf 		[Optional] Minor Allele Frequency Threshold to filter 1KG SNVs,default = 0 
               	-m		[Optional] 1 - Somatic Genome (default); 2 - Germline or Personal Genome
               	-outf	 	[Optional] Output format - BED or VCF,default is VCF
               	-nc		[Optional] Only do non-coding analysis, no need of VAT (variant annotation tool)
               	-o		[Optional] Output path, default is the directory 'out'
               	-g		[Optional] gene list, only output variants associated with selected genes. 	
               	-exp		[Optional] gene expression matrix
               	-cls		[Optional] class file for samples in gene expression matrix
               	-exf		[Optional] gene expression format - rpkm / raw
               	-p		[Optional] Number of genomes to parallel, default = 5
               	-cancer		[Optional] cancer type from recurrence database, default is all of the cancer type
               	-uw		[Optional] Use unweighted scoring scheme, defalut is weighted
               	-s		[Optional] Score threshold to call non-coding candidates, default = 1.5 
               			for weighted scoring & default = 5 for unweighted scoring
               	-ua		[Optional] Directory contains user annotations. Default is 'data/user_annotations'
       
       Multiple Genomes with Recurrent Output	
               	Option 1: Separate multiple files by ','
               	Example: ./run.sh -f file1,file2,file3,... -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> ...
               	Option 2: Use the 6th column of BED file to specify samples
               	Example: ./run.sh -f file -maf MAF -m <1/2> -inf bed -outf <bed/vcf> ...
               	
               	NOTE: Please make sure you have sufficient memory, at least 3G.

-maf : should be a number between 0~1
-nc : when using this option, users don’t need to install VAT (variant annotation tool)
-exp, -cls, -exf : if used, should be specified together.
-m : We also provide the option for germline or personal genomes, which compare mutated allele with ancestral allele, since the functional impact of variants reflects the historical event when the polymorphism was first introduced in the human populations.

E. Input files

  • User input file (-f): could be either BED or VCF format. For indels, please use “-” instead of other symbols in ‘allele’ columns for insertions or deletions. Indels will be analyzed for BED format.
BED format. In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited; 
the 6th column is reserved for sample names, do not put other information there): 
chromosome, start position, end position, reference allele, and alternative allele.
       Chromosome - name of the chromosome (e.g. chr3, chrX)
       Start position - start coordinates of variants. (0-based)
       End position - end coordinates of variants. (end exclusive)
               e.g., chr1   0     100  spanning bases numbered 0-99
       Reference allele - germlime allele of variants
       Alternative allele - mutated allele of variants
       
VCF format. The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited): 

#CHROM POS ID REF ALT  QUAL FILTER INFO
       
Recurrent analysis input format
       Option 1: separated files for each genome (BED or VCF). Use “-f file1, file2, file3” separated by comma.
       Option 2: put all variants in one file (only for BED format, use the 6th column labeling sample names). Use “-f file”.
  • Gene list format (-g): If you are interested in particular set of genes, you can put your genes in one file (one gene per row) and use “-g file” to only analyze variants in or associated with those genes. Please use gene symbols.
  • Gene expression format (-exp): Users can also upload gene expression file for the program to detect differentially expressed genes between cancer and benign samples and highlight variants associated with these genes. The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited.
       Gene	Sample1	Sample2	Sample3	Sample4	…
       A1BG	1	5	40	0	…
       A1CF	20	9	0	23	…
       …	…	…	…	…	…
  • Sample class format (-cls): In addition to the expression file, users need to upload a file with samples annotated as “cancer” or “benign” (only two classes “cancer” or “benign”). The number of samples in this file should be equal to that in expression data. And sample names should match.
       Sample1	benign
       Sample2	cancer
       Sample3	cancer
       Sample4	benign
       …	…

F. Output files

Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.

When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot.

  • Sample BED format output
Header: 
chr     start   end     ref     alt     sample   gerp;cds;variant.annotation.cds;network.hub;gene.under.negative.selection;
ENCODE.annotated;hot.region;motif.analysis;sensitive;ultra.sensitive;ultra.conserved;target.gene[known_cancer_gene/
TF_regulating_known_cancer_gene,differential_expressed_in_cancer,actionable_gene];coding.score;noncoding.score;
recurrence.within.samples;recurrence.database

Coding variant:
chr1    36205041        36205042        C       A       PR2832  5.6;Yes;VA=1:CLSPN:ENSG00000092853.9:-:prematureStop:
4/4:CLSPN-001:ENST00000251195.5:3999_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4020_3232_1078_E->*:
CLSPN-003:ENST00000373220.3:3828_3040_1014_E->*:CLSPN-004:ENST00000520551.1:3861_3073_1025_E-
>*;PPI;Yes;.;.;.;.;.;.;CLSPN;5;.;.;.

Non-coding variant:
chr6    152304995       152304996       A       G       PR2832  2.63;No;.;ESR1:PHOS(0.276)PPI(0.995)REG(0.994);.;.;.;.;.;.;.;
ESR1(Intron)[TF_regulating_known_cancer_gene:H3F3A,MN1,PRCC,RARA,SLC34A2,TPM3][actionable];.;1.60983633568013;.;.
  • Sample VCF format output
Header: 
##fileformat=VCFv4.0
##INFO=<ID=OTHER,Number=.,Type=String, Description = "Other Information From Original File">
##INFO=<ID=SAMPLE,Number=.,Type=String,Description="Sample id">
##INFO=<ID=CDS,Number=.,Type=String,Description="Coding Variants or not">
##INFO=<ID=VA,Number=.,Type=String,Description="Coding Variant Annotation">
##INFO=<ID=HUB,Number=.,Type=String,Description="Network Hubs, PPI (protein protein interaction network), REG (regulatory network),  
PHOS (phosphorylation network)...">
##INFO=<ID=GNEG,Number=.,Type=String,Description="Gene Under Negative Selection">
##INFO=<ID=GERP,Number=.,Type=String,Description="Gerp Score">
##INFO=<ID=NCENC,Number=.,Type=String,Description="NonCoding ENCODE Annotation">
##INFO=<ID=HOT,Number=.,Type=String,Description="Highly Occupied Target Region">
##INFO=<ID=MOTIFBR,Number=.,Type=String,Description="Motif Breaking">
##INFO=<ID=MOTIFG,Number=.,Type=String,Description="Motif Gain">
##INFO=<ID=SEN,Number=.,Type=String,Description="In Sensitive Region">
##INFO=<ID=USEN,Number=.,Type=String,Description="In Ultra-Sensitive Region">
##INFO=<ID=UCONS,Number=.,Type=String,Description="In Ultra-Conserved Region">
##INFO=<ID=GENE,Number=.,Type=String,Description="Target Gene (For coding - directly affected genes ; For non-coding - promoter or  
distal regulatory module)">
##INFO=<ID=CANG,Number=.,Type=String,Description="Prior Gene Information, e.g.[cancer][TF_regulating_known_cancer_gene]
[up_regulated][actionable]...";
##INFO=<ID=CDSS,Number=.,Type=String,Description="Coding Score">
##INFO=<ID=NCDS,Number=.,Type=String,Description="NonCoding Score">
##INFO=<ID=RECUR,Number=.,Type=String,Description="Recurrent elements / variants">
##INFO=<ID=DBRECUR,Number=.,Type=String,Description="Recurrence database">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

Coding variant:
chr1    36205042        .       C       A       .       .       SAMPLE=PR2832;GERP=5.6;CDS=Yes;VA=1:CLSPN:ENSG00000092853.9:-
:prematureStop:4/4:CLSPN-001:ENST00000251195.5:3999_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4020_3232_1078_E-
>*:CLSPN-003:ENST00000373220.3:3828_3040_1014_E->*:CLSPN-004:ENST00000520551.1:3861_3073_1025_E-
>*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=5

Non-coding variant:
chr6    152304996       .       A       G       .       .       SAMPLE=PR2832;GERP=2.63;CDS=No;HUB=ESR1:PHOS(0.276)PPI(0.995)REG(0.994);
GENE=ESR1(Intron);CANG=ESR1[TF_regulating_known_cancer_gene:H3F3A,MN1,PRCC,RARA,SLC34A2,TPM3]
[actionable];NCDS=1.60983633568013
  • Output description (VCF format as an example)
VA (variants annotation)
      This is the output produced from VAT (variant annotation tool) for coding variations. 
      Please refer to ‘http://vat.gersteinlab.org’ for documentations. 

NCENC (Non-coding ENCODE annotation)
      Example: ‘NCENC=TFP(CEBPB|chr5:139639150-139639496),TFP(STAT3|chr5:139638936-139640136),TFP(STAT3|chr5:139638976-
      139639553),TFP(STAT3|chr5:139638989-139639544),TFP(STAT3|chr5:139638999-139639716)’ 
      This is formatted as “category(element_name|chromosome:coordinates)” (0-based, end exclusive). 
             TFP - transcription factor binding peak. 
             TFM - transcription factor bound motifs in peak regions. 
             DHS - DNase1 hypersensitive sites, with number of cell lines (MCV, total 125 cell lines). For cell-line info, please refer to DHS cell lines 
             ncRNA - non-coding RNA 

             Pseudogene 
             Enhancer - chmm/segway (genome segmentation), drm (distal regulatory module) 


HOT (highly occupied region)
      Example: ‘HOT=Helas3’
      If a variant occurs in HOT regions, the corresponding cell lines (5 in total) are shown. This annotation is from (Yip, et al., 2012). 

MOTIFBR (motif-breaking analysis)
      SNV Example: ‘MOTIFBR=MAX#Myc_known9_8mer#102248644#102248656#-#9#0.068966#0.931034’
      The variant causes a motif-breaking event. This field is a hash tag delimited, defined as follows: 
TF name # motif name # motif start # 
      motif end # motif strand # mutation position # alternative allele frequency in PWM # reference allele frequency in PWM
. (0-based, end exclusive)     
      
      Indel Example: ‘MOTIFBR=TCF12#TCF12_disc5_8mer#115719379#115719390#+’
      This field is a hash tag delimited, defined as follows: 
TF name # motif name # motif start # motif end # motif strand. (0-based, end exclusive)
 
MOTIFG (motif-gaining analysis)
      SNV Example: ‘MOTIFG=GATA_known5#75658824#75658829#-#1#4.839#4.181’

      The variant causes a motif-gaining event. Hash tag delimited: motif name # motif start # motif end # motif strand # mutation position 
      # sequence score with alternative allele # sequence score with reference allele. (0-based, end exclusive)
      
      Indel example: ‘MOTIFG=Ets_known10#CGGAAA#6#+#5.743’

      Hash tag delimited: motif name # motif sequence discovered # motif length # motif strand # sequence score with alternative allele.

GENE (target gene - for coding: directly affected genes; for non-coding: promoter or distal regulatory module)
      Example: ‘GENE=ARNT2(Enhancer),C15orf26(Intron),IL16(Enhancer)’
      For noncoding variants, ‘intron’, ‘promoter’, ‘UTR’ and ‘Enhancer’ tags are annotated.

CANG (cancer related information)
      Example: ‘CANG=EGFR[actionable][cancer]’
      This field stores all the gene related information. Currently there are five possible tags:
             [cancer]: the gene have been annotated as an cancer gene.
             [TF_regulating_known_cancer_gene]: the gene is a transcription factor regulating known cancer genes. The regulated cancer genes are also shown. 
             [actionable]: the gene is potentially actionable (“druggable”). 
             [up_regulated]: the gene is up-regulated in cancers, when providing RNA-Seq gene expression data.
             [down_regulated]: the gene is down-regulated in cancers, when providing RNA-Seq gene expression data.  
      When users provide new gene lists, tags about these gene lists will be shown in this field. 

USER_ANNO (user annotations)
     Example: ‘USER_ANNO=REPEAT(FLAM_A|chr1:100544744-100544854)’ 
     This field stores all user provided annotations. 

RECUR (recurrent genes, regulatory elements and mutations within samples)
      Example: ‘RECUR=Pseudogene(ENST00000467115.1|chr1:568914-569121):PR1783(chr1:568941,chr1:569004*),PR2832(chr1:569004*)’
      When analyzing multiple genomes, if genes or regulatory elements are shown in >= 2 samples, they are annotated as ‘gene/regulatory
      element name: recurrent samples (variants in corresponding samples (position is 1-based))’.  If it is a same site mutation, ‘*’ is tagged. 

DBRECUR (Recurrence databse) 
      Example: ‘DBRECUR=Enhancer(chmm/segway|chr15:22517400-22521103):Lung_Adeno(Altered in 4/24(16.67%) samples.)|
      Prostate(Altered in 2/64(3.12%) samples.),Enhancer(drm|chr15:22517700-22521100):Lung_Adeno(Altered in 4/24(16.67%) samples.)|
      Prostate(Altered in 2/64(3.12%) samples.)’
      If genes, regulatory elements or mutations are observed in the recurrence database (currently including 570 cancer samples of 10
      types), the recurrence information is shown here. ‘recurrent element(name|coordinates):cancer type(recurrence information in this 
      cancer type)’. Recurrence information is separated by ‘,’.

Building data context

FunSVPT offers a flexible framework for users to incorporate their own data into the data context. All the data files used in current data context can be replaced with user-specific data. Below is the detailed description. Scripts can be found under ‘Downloads’ of the web server.

  • Define novel sensitive/ultra-sensitive regions

We provide scripts for users to define novel conserved regions in human populations. The algorithm is described in (Khurana, et al., 2013). To define sensitive/ultra-sensitive regions, users need to prepare category files in BED format. The BED files contain the region coordinates under particular categories. For example, the BED file for category - ‘GATA1 binding sites’ – has all the binding coordinates of transcription factor GATA1. Scripts will identify categories under strong human-specific negative selection and define those categories as sensitive/ultra-sensitive regions based on the selection pressure. We use the criteria – enrichment of rare variants – to measure negative selection constraints.

‘0.define.proximal.distal.regions.pl’ . We provide this script for users to split categories into proximal or distal subsets. The proximal or distal subsets can be used as new categories.

Scripts used to identify sensitive/ultra-sensitive regions totally from scratch –‘1.Randomization.pl’ and ‘1.2.FDR.r’. ‘1.Randomization.pl’ uses GSC (genome structure correction) like method to generate null distributions for enrichment of rare variants for each category. ‘1.2.FDR.r’ calculates FDR for the randomization. This script can also be used to generate significant categories based on user-selected FDR.

Scripts used to identify novel sensitive/ultra-sensitive regions, in addition to those defined in (Khurana, et al., 2013) – ‘2.sensitive.regions.delta.increment.pl’. This script is only applicable to small number of categories ~ 5.

Note: please prepare your polymorphisms file with only non-coding variants.

  • Process GENCODE GTF file

We provide ‘3.gencode.process.pl’ to process GENCODE GTF file to obtain necessary files for data context. The script will generate ‘promoter’, ‘cds’, ‘intron’ and ‘UTR’ region files, which are used by the variants prioritization step. The ‘cds’ file could also be used to filter polymorphisms to obtain non-coding variants. Please put all the generated GENCODE files under ‘data/gencode’. GENCODE version 16 is used in the current data context.

  • Add new networks

The networks data used are under ‘data/networks’ folder. FunSVPT will automatically read all the files in the folder and use the first field separated by ‘.’ as the network name. For example, ‘PPI.degree’ file will be used as network ‘PPI’. So to add new networks, simply put the network files into this folder and use the first field to denote the network name.

The files under the folder have two columns, ‘gene name’ and ‘centrality’. We provide ‘4.network.analysis.r’ for users to generate these files (either degree or betweenness centrality) from tab-delimited network files. Tab-delimited network files are two-columns files showing the interacting genes (for each row, ‘gene A’ ‘gene B’).

  • Identify potential target genes of regulatory elements

We pack the scripts and current REMC data for users to define novel associations. Scripts can be found under ‘Downloads’ of the web server. The scripts are written in C/C++. Please note that the data files are huge ~ 40G.

  • Add new gene lists to annotate variants

The procedure is similar to ‘Add new networks’. Users can just put new files under ‘data/gene_lists’ folder and use the first field separated by ‘.’ as the gene list name.

  • Add recurrent data for new cancer types

This is similar to ‘Add new networks’. Please put files under ‘data/cancer_recurrence’ and use the first field as the cancer type name. This file can be produced by running FunSVPT (file ‘Recur.Summary’ produced by the tool) on cancer samples of a particular type.

  • Add user-specific annotation sets, such as epigenetic modifications

Please put files under directory ‘data/user_annotations’ or specific directory with option (-ua). The first field separated by ‘.’ will be used as annotation name. Please prepare your files in BED format and use the 4th column for additional information, if needed. We have placed repeat regions obtained from UCSC there as an example.

  • All of other files can be replaced with user-specific data. Please refer to the files under ‘data/’ to correctly format them.
Personal tools