FunSVPT
From GersteinInfo
Contents |
Variants Prioritization
A. Required Tools
The following tools are REQUIRED:
- sed, awk, grep
- bedtools (version bedtools-2.17.0)
- tabix (version tabix-0.2.6 and up)
- VAT (snpMapper Module) - A good installation guide for VAT can be found here.
If you are only interested in non-coding variants, you don't need to install VAT. But remember to use '-nc' option in Funseq
Retrieve GERP scores. Note that GERP data file is ~7G. If you are not interested in GERP scores, the GERP file and bigWigAverageOverBed are not needed.
Only needed for differential gene expression analysis.
Required for parallel running.
Please make sure you have Perl 5 and up.
B. Tool installation
This is a PERL- and Linux/UNIX-based tool. At the command-line prompt, type the following. The purpose is to write the path of funSVPT.pm to your environment.
$ tar xvf funSVPT.v.0.1.tar $ cd funsvpt-0.1/ $ cd funSVPT/ $ perl Makefile.PL $ make $ make test $ make install
If you don’t have the permission to modify the environment, open the ‘.bashrc’ file and put the following lines to the end of the file. Then ‘source .bashrc’.
PERL5LIB=${PERL5LIB}: $path_of_the_tool/funsvpt-0.1/funSVPT/lib export PERL5LIB
C. Pre-built Data Context
All of the data can be downloaded under ‘Downloads’ in the web server. If you would like to use the data, please download them and put them under ‘funsvpt-0.1/data’.
D. Tool Usage
To display the usage of tool, type ‘./run.sh’.
* Usage : ./run.sh -f file -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file -cls file -exf <rpkm/raw> -p int -cancer cancer_type -s score -uw Options : -f [Required] User Input SNVs File -inf [Required] Input format - BED or VCF -maf [Optional] Minor Allele Frequency Threshold to filter 1KG SNVs,default = 0 -m [Optional] 1 - Somatic Genome (default); 2 - Germline or Personal Genome -outf [Optional] Output format - BED or VCF,default is VCF -nc [Optional] Only do non-coding analysis, no need of VAT (variant annotation tool) -o [Optional] Output path, default is the directory 'out' -g [Optional] gene list, only output variants associated with selected genes. -exp [Optional] gene expression matrix -cls [Optional] class file for samples in gene expression matrix -exf [Optional] gene expression format - rpkm / raw -p [Optional] Number of genomes to parallel, default = 5 -cancer [Optional] cancer type from recurrence database, default is all of the cancer type -uw [Optional] Use unweighted scoring scheme, defalut is weighted -s [Optional] Score threshold to call non-coding candidates, default = 1.5 for weighted scoring & default = 5 for unweighted scoring Multiple Genomes with Recurrent Output Option 1: Separate multiple files by ',' Example: ./run.sh -f file1,file2,file3,... -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> ... Option 2: Use the 6th column of BED file to specify samples Example: ./run.sh -f file -maf MAF -m <1/2> -inf bed -outf <bed/vcf> ... NOTE: Please make sure you have sufficient memory, at least 3G.
E. Input files
- User input file (-f): could be either BED format or VCF format.
BED format. In addition to the three required BED fields, please prepare your file as follows (5 required fields, tab delimited; the 6th column is reserved for sample names, do not put other information there): chromosome, start position, end position, reference allele, and alternative allele. Chromosome - name of the chromosome (e.g. chr3, chrX) Start position - start position of variants. (0-based) End position - ending position of variants. (end exclusive) e.g., chr1 0 100 spanning bases numbered 0-99 Reference allele - reference allele of variants Alternative allele - alternative allele of variants VCF forma. The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited): #CHROM POS ID REF ALT QUAL FILTER INFO Recurrent analysis input format Option 1: separate files for each genome (BED or VCF). Use “-f file1, file2, file3” separated by comma. Option 2: put all variants in one file (only for BED format, use the 6th column labeling sample names). Use “-f file”.
- Gene list format (-g): If you are interested in particular set of genes, you can put your genes in one file (one gene per row) and use “-g file” to instruct the program to only analyze variants in or linked to those genes. Please use gene symbols.
- Gene expression format (-exp): Users can also upload gene expression data for the program to detect differentially expressed genes between cancer and benign samples and highlight variants linked to these genes. The gene expression data should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in rpkm or raw read counts. Tab delimited.
Gene Sample1 Sample2 Sample3 Sample4 … A1BG 1 5 40 0 … A1CF 20 9 0 23 … … … … … … …
- Sample class format (-cls): In addition to the expression data, users need to upload annotations of samples as “cancer” or “benign” (only two classes “cancer” or “benign”). The number of samples in this file should equal to that in expression data. And sample names should match.
Sample1 benign Sample2 cancer Sample3 cancer Sample4 benign … …
F. Output files
Four output files will be generated: ‘Output.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detail results from all samples; Recur.Summary: the recurrence analysis when having multiple genomes; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop mutations, non-coding mutations with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and mutations in or linked to known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.
When providing gene_expression data, two additional files will be produced – ‘DE.gene.txt’ is the differentially expression genes from RNA-Seq analysis and ‘DE.pdf ’is the differential gene expression plot.
- Sample BED format output
Header: chr start end ref alt sample gerp;cds;variant.annotation.cds;network.hub;gene.under.negative.selection; ENCODE.annotated;hot.region;motif.analysis;sensitive;ultra.sensitive;ultra.conserved;target.gene[known_cancer_gene/ TF_regulating_known_cancer_gene,differential_expressed_in_cancer,actionable_gene];coding.score;noncoding.score; recurrence.within.samples;recurrence.database Coding variant: chr1 36205041 36205042 C A PR2832 5.6;Yes;VA=1:CLSPN:ENSG00000092853.9:-:prematureStop: 4/4:CLSPN-001:ENST00000251195.5:3999_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4020_3232_1078_E->*: CLSPN-003:ENST00000373220.3:3828_3040_1014_E->*:CLSPN-004:ENST00000520551.1:3861_3073_1025_E- >*;PPI;Yes;.;.;.;.;.;.;CLSPN;5;.;.;. Non-coding variant: chr6 152304995 152304996 A G PR2832 2.63;No;.;ESR1:PHOS(0.276)PPI(0.995)REG(0.994);.;.;.;.;.;.;.; ESR1(Intron)[TF_regulating_known_cancer_gene:H3F3A,MN1,PRCC,RARA,SLC34A2,TPM3][actionable];.;1.60983633568013;.;.
- Sample VCF format output
Header: ##fileformat=VCFv4.0 ##INFO=<ID=OTHER,Number=.,Type=String, Description = "Other Information From Original File"> ##INFO=<ID=SAMPLE,Number=.,Type=String,Description="Sample id"> ##INFO=<ID=CDS,Number=.,Type=String,Description="Coding Variants or not"> ##INFO=<ID=VA,Number=.,Type=String,Description="Coding Variant Annotation"> ##INFO=<ID=HUB,Number=.,Type=String,Description="Network Hubs, PPI (protein protein interaction network), REG (regulatory network), PHOS (phosphorylation network)..."> ##INFO=<ID=GNEG,Number=.,Type=String,Description="Gene Under Negative Selection"> ##INFO=<ID=GERP,Number=.,Type=String,Description="Gerp Score"> ##INFO=<ID=NCENC,Number=.,Type=String,Description="NonCoding ENCODE Annotation"> ##INFO=<ID=HOT,Number=.,Type=String,Description="Highly Occupied Target Region"> ##INFO=<ID=MOTIFBR,Number=.,Type=String,Description="Motif Breaking"> ##INFO=<ID=MOTIFG,Number=.,Type=String,Description="Motif Gain"> ##INFO=<ID=SEN,Number=.,Type=String,Description="In Sensitive Region"> ##INFO=<ID=USEN,Number=.,Type=String,Description="In Ultra-Sensitive Region"> ##INFO=<ID=UCONS,Number=.,Type=String,Description="In Ultra-Conserved Region"> ##INFO=<ID=GENE,Number=.,Type=String,Description="Target Gene (For coding - directly affected genes ; For non-coding - promoter or distal regulatory module)"> ##INFO=<ID=CANG,Number=.,Type=String,Description="Prior Gene Information, e.g.[cancer][TF_regulating_known_cancer_gene] [up_regulated][actionable]..."; ##INFO=<ID=CDSS,Number=.,Type=String,Description="FunSEQ Coding Score"> ##INFO=<ID=NCDS,Number=.,Type=String,Description="FunSEQ NonCoding Score"> ##INFO=<ID=RECUR,Number=.,Type=String,Description="Recurrent elements / variants"> ##INFO=<ID=DBRECUR,Number=.,Type=String,Description="Recurrence database"> #CHROM POS ID REF ALT QUAL FILTER INFO Coding variant: chr1 36205042 . C A . . SAMPLE=PR2832;GERP=5.6;CDS=Yes;VA=1:CLSPN:ENSG00000092853.9:- :prematureStop:4/4:CLSPN-001:ENST00000251195.5:3999_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4020_3232_1078_E- >*:CLSPN-003:ENST00000373220.3:3828_3040_1014_E->*:CLSPN-004:ENST00000520551.1:3861_3073_1025_E- >*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=5 Non-coding variant: chr6 152304996 . A G . . SAMPLE=PR2832;GERP=2.63;CDS=No;HUB=ESR1:PHOS(0.276)PPI(0.995)REG(0.994); GENE=ESR1(Intron);CANG=ESR1[TF_regulating_known_cancer_gene:H3F3A,MN1,PRCC,RARA,SLC34A2,TPM3] [actionable];NCDS=1.60983633568013
- Output description (VCF format as an example)
VA (variants annotation) This is the output produced from VAT (variant annotation tool) for coding variations. Please refer to ‘http://vat.gersteinlab.org’ for documentations. NCENC (Non-coding ENCODE annotation) Example: ‘NCENC=TFP(CEBPB|chr5:139639150-139639496),TFP(STAT3|chr5:139638936-139640136),TFP(STAT3|chr5:139638976- 139639553),TFP(STAT3|chr5:139638989-139639544),TFP(STAT3|chr5:139638999-139639716)’ This is formatted as “category(element_name|chromosome:position)” (0-based, end exclusive). TFP - transcription factor binding peak. TFM - transcription factor bound motifs in peak regions. DHS - DNase1 hypersensitive sites, with number of cell lines (MCV, total 125 cell lines). ncRNA - non-coding RNA Pseudogene Enhancer - chmm/segway (genome segmentation), drm (distal regulatory module) HOT (highly occupied region) Example: ‘HOT=Helas3’ If a mutation occurs in HOT regions, the corresponding cell lines (5 in total) are shown. This annotation is from (Yip, et al., 2012). MOTIFBR (motif-breaking analysis) Example: ‘MOTIFBR=MAX#Myc_known9_8mer#102248644#102248656#-#9#0.068966#0.931034’ The variant causes a motif-breaking event. This field is a hash-delimited tag, defined as follows: TF name # motif name # motif start # motif end # motif strand # mutation position # alternative allele frequency in PWM # reference allele frequency in PWM . (0-based, end exclusive) MOTIFG (motif-gaining analysis) Example: ‘MOTIFG=GATA_known5#75658824#75658829#-#1#4.839#4.181’ The variant causes a motif-gaining event. Hash-delimited: motif name # motif start # motif end # motif strand # mutation position # motif score with alternative allele # motif score with reference allele. (0-based, end exclusive) GENE (target gene - for coding: directly affected genes; for non-coding: promoter or distal regulatory module) Example: ‘GENE=ARNT2(Enhancer),C15orf26(Intron),IL16(Enhancer)’ For noncoding variants, ‘intron’, ‘promoter’, ‘UTR’ and ‘Enhancer’ tags are annotated. CANG (cancer related information) Example: ‘CANG=EGFR[actionable][cancer]’ This field stores all the gene related information. Currently there are five possible tags: [cancer]: the gene have been annotated as an cancer gene. [TF_regulating_known_cancer_gene]: the gene is a transcription factor regulating known cancer genes. The regulated cancer genes are also shown. [actionable]: the gene is potentially actionable (“druggable”). [up_regulated]: the gene is up-regulated in cancers, when providing RNA-Seq gene expression data. [down_regulated]: the gene is down-regulated in cancers, when providing RNA-Seq gene expression data. When users provide new gene lists, tags about these gene lists will be shown in this field. RECUR (recurrent genes, regulatory elements and mutations within samples) Example: ‘RECUR=Pseudogene(ENST00000467115.1|chr1:568914-569121):PR1783(chr1:568941,chr1:569004*),PR2832(chr1:569004*)’ When analyzing multiple genomes, if genes or regulatory elements are recurrent in >= 2 samples, it is annotated as ‘gene/regulatory element name: recurrent samples (mutations in corresponding samples (position is 1-based))’. If it is a same site mutation, ‘*’ is tagged. DBRECUR (Recurrence databse) Example: ‘DBRECUR=Enhancer(chmm/segway|chr15:22517400-22521103):Lung_Adeno(Altered in 4/24(16.67%) samples.)| Prostate(Altered in 2/64(3.12%) samples.),Enhancer(drm|chr15:22517700-22521100):Lung_Adeno(Altered in 4/24(16.67%) samples.)| Prostate(Altered in 2/64(3.12%) samples.)’ If genes, regulatory elements or mutations are observed in the recurrence database (currently including 570 cancer samples of 10 types), the recurrence information is shown here. ‘recurrent element(name|coordinates):cancer type(recurrence information in this cancer type)’. Recurrence information is separated by ‘,’.
Building data context
Usage
Usage : ./funseq -f file -maf maf -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc Options : -f user input SNVs file -maf Minor Allele Frequency (MAF) threshold to filter 1KG phaseI SNVs (value 0 ~ 1) -m 1 - somatic Genome; 2 - germline or personal Genome -inf input format - BED or VCF -outf output format - BED or VCF -nc [Optional] Only do non-coding analysis.
Default : -maf 0 -m 1 -outf vcf
Input
FunSEQ takes BED or VCF files as input
1. BED format
In addition to the three required BED fields, please prepare your file as follows (5 required fields, tab-delimited):
chrom chromStart chromEnd Reference.allele Alterative.allele ...
* chrom - The name of the chromosome (e.g. chr3, chrY).
* chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0. * chromEnd - The ending position of the feature in the chromosome. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. * Reference.allele - The reference allele of SNVs * Alternative.allele - The alternative allele of SNVs.
2. VCF format
The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited):
CHROM POS ID REF ALT QUAL FILTER INFO
Output
FunSEQ can produce either BED format or VCF format files.
An example of the VCF annotation of a coding variant:
chr1 36205042 . C A . . OTHER=MAF(1kg-phase1)=0;CDS=Yes;VA=1:CLSPN:ENSG00000092853.8:-:prematureStop:4/5:CLSPN-001: \ ENST00000251195.5:3996_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4017_3232_1078_E->*:CLSPN-003:ENST00000373220.3:3825_3040_1014_E->*:CLSPN-004:ENST00000520551.1: \ 3858_3073_1025_E->*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=4
- OTHER field contains other original information other than the 5 required ones (chrom, chromStart, chromEnd, reference, alternative). When input file is less than 3,000 lines, OTHER also contains the MAF (minor allele frequency) of SNVs in 1KG Phase1 data.
An example of the VCF annotation of a non-coding variant:
chr5 85913480 . T C . . OTHER=MAF(1kg-phase1)=0;CDS=NO;HUB=REG;NCENC=TFP(ETS1),TFP(ELF1),TFP(GATA2),TFP(POU2F2), \ TFP(TBP),TFP(SRF),TFP(ELK4),TPM(TAF1),TFP(STAT3),TFP(GATA3),TFP(SIX5),TFP(YY1),TPM(TBP),TFP(CHD2),TFP(MYC),TFP(IRF1),DHS(MCV-2),TFP(TAF1),TFP(GATA1), \ TFP(ZEB1),TFP(SETDB1),TFP(ZNF143),TFP(NFKB1),TFP(MAX),TFP(GABPA),Enhancer(chromHmm),TFP(STAT1); \ MOTIFBR=85913478#85913493#+#TATA_known1_8mer#TAF1,85913478#85913493#+#TATA_known1_8mer#TBP;GENE=COX7C(promoter);NCDS=4
- NCENC (Non-coding ENCODE annotation) field.
TFP -transcription factor binding peak.
TFM - transcription factor motifs in peak regions.
DHS - DNase1 hypersensitive sites, with number of cell lines (MCV- , total 125 cell lines) information (R.E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489,75, Sep 2012).
ncRNA - non-coding RNA
Pseudogene
Enhancer - chromHmm (genome segmentation), drm (distal regulatory module)
- MOTIFBR field.
This field is a hash-delimited tag, defined as follows:
TF name # motif name # motif start # motif end # motif strand # mutation position # alternative allele frequency in PFM # reference allele frequency in PFM
An example: " TAF1#TATA_known1_8mer#85913478#85913493#+#3#0.02#0.4 "
- NCRECUR field.
Please be aware of large TF peak and chromHMM regions. Because of the low resolution issues, recurrent information may not indicate functional importance.