FunSVPT

From GersteinInfo

Revision as of 23:14, 5 January 2014 by Public (Talk | contribs)
Jump to: navigation, search

Contents


Variants Prioritization

A. Required Tools

The following tools are REQUIRED:

  • sed, awk, grep
  • bedtools (version bedtools-2.17.0)
  • tabix (version tabix-0.2.6 and up)
  • VAT (snpMapper Module) - A good installation guide for VAT can be found here.

If you are only interested in non-coding variants, you don't need to install VAT. But remember to use '-nc' option in Funseq

Retrieve GERP scores. Note that GERP data file is ~7G. If you are not interested in GERP scores, the GERP file and bigWigAverageOverBed are not needed.

Only needed for differential gene expression analysis.

Required for parallel running.
Please make sure you have Perl 5 and up.

B. Tool installation

This is a PERL- and Linux/UNIX-based tool. At the command-line prompt, type the following. The purpose is to write the path of funSVPT.pm to your environment.

$ tar xvf funSVPT.v.0.1.tar
$ cd funsvpt-0.1/
$ cd funSVPT/
$ perl Makefile.PL
$ make 
$ make test
$ make install

If you don’t have the permission to modify the environment, open the ‘.bashrc’ file and put the following lines to the end of the file. Then ‘source .bashrc’.

PERL5LIB=${PERL5LIB}: $path_of_the_tool/funsvpt-0.1/funSVPT/lib
export PERL5LIB

C. Required Data Files

All of the data can be downloaded under ‘Downloads’ in the web server. If you would like to use the data, please download them and put them under ‘funsvpt-0.1/data’.

D. Tool Usage

To display the usage of tool, type ‘./run.sh’.

* Usage : ./run.sh -f file -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file 
          -cls file -exf   <rpkm/raw> -p int -cancer cancer_type -s score -uw
       Options :
               	-f		[Required] User Input SNVs File
               	-inf	 	[Required] Input format - BED or VCF
               	-maf 		[Optional] Minor Allele Frequency Threshold to filter 1KG SNVs,default = 0 
               	-m		[Optional] 1 - Somatic Genome (default); 2 - Germline or Personal Genome
               	-outf	 	[Optional] Output format - BED or VCF,default is VCF
               	-nc		[Optional] Only do non-coding analysis, no need of VAT (variant annotation tool)
               	-o		[Optional] Output path, default is the directory 'out'
               	-g		[Optional] gene list, only output variants associated with selected genes. 	
               	-exp		[Optional] gene expression matrix
               	-cls		[Optional] class file for samples in gene expression matrix
               	-exf		[Optional] gene expression format - rpkm / raw
               	-p		[Optional] Number of genomes to parallel, default = 5
               	-cancer		[Optional] cancer type from recurrence database, default is all of the cancer type
               	-uw		[Optional] Use unweighted scoring scheme, defalut is weighted
               	-s		[Optional] Score threshold to call non-coding candidates, default = 1.5 
               			for weighted scoring & default = 5 for unweighted scoring
       
       Multiple Genomes with Recurrent Output	
               	Option 1: Separate multiple files by ','
               	Example: ./run.sh -f file1,file2,file3,... -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> ...
               	Option 2: Use the 6th column of BED file to specify samples
               	Example: ./run.sh -f file -maf MAF -m <1/2> -inf bed -outf <bed/vcf> ...
               	
               	NOTE: Please make sure you have sufficient memory, at least 3G.

Input

FunSEQ takes BED or VCF files as input
1. BED format
In addition to the three required BED fields, please prepare your file as follows (5 required fields, tab-delimited):
chrom chromStart chromEnd Reference.allele Alterative.allele ...

* chrom - The name of the chromosome (e.g. chr3, chrY). 
* chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0. * chromEnd - The ending position of the feature in the chromosome. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. * Reference.allele - The reference allele of SNVs * Alternative.allele - The alternative allele of SNVs.

2. VCF format
The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited):
CHROM POS ID REF ALT QUAL FILTER INFO

Output

FunSEQ can produce either BED format or VCF format files.

An example of the VCF annotation of a coding variant:

chr1    36205042        .       C       A       .       .       OTHER=MAF(1kg-phase1)=0;CDS=Yes;VA=1:CLSPN:ENSG00000092853.8:-:prematureStop:4/5:CLSPN-001:  \
ENST00000251195.5:3996_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4017_3232_1078_E->*:CLSPN-003:ENST00000373220.3:3825_3040_1014_E->*:CLSPN-004:ENST00000520551.1: \
3858_3073_1025_E->*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=4 
  • OTHER field contains other original information other than the 5 required ones (chrom, chromStart, chromEnd, reference, alternative). When input file is less than 3,000 lines, OTHER also contains the MAF (minor allele frequency) of SNVs in 1KG Phase1 data.


An example of the VCF annotation of a non-coding variant:

chr5    85913480        .       T       C       .       .       OTHER=MAF(1kg-phase1)=0;CDS=NO;HUB=REG;NCENC=TFP(ETS1),TFP(ELF1),TFP(GATA2),TFP(POU2F2),  \
TFP(TBP),TFP(SRF),TFP(ELK4),TPM(TAF1),TFP(STAT3),TFP(GATA3),TFP(SIX5),TFP(YY1),TPM(TBP),TFP(CHD2),TFP(MYC),TFP(IRF1),DHS(MCV-2),TFP(TAF1),TFP(GATA1),  \
TFP(ZEB1),TFP(SETDB1),TFP(ZNF143),TFP(NFKB1),TFP(MAX),TFP(GABPA),Enhancer(chromHmm),TFP(STAT1);  \
MOTIFBR=85913478#85913493#+#TATA_known1_8mer#TAF1,85913478#85913493#+#TATA_known1_8mer#TBP;GENE=COX7C(promoter);NCDS=4 
  • NCENC (Non-coding ENCODE annotation) field.

TFP -transcription factor binding peak.
TFM - transcription factor motifs in peak regions.
DHS - DNase1 hypersensitive sites, with number of cell lines (MCV- , total 125 cell lines) information (R.E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489,75, Sep 2012).
ncRNA - non-coding RNA
Pseudogene
Enhancer - chromHmm (genome segmentation), drm (distal regulatory module)

  • MOTIFBR field.

This field is a hash-delimited tag, defined as follows:
TF name # motif name # motif start # motif end # motif strand # mutation position # alternative allele frequency in PFM # reference allele frequency in PFM
An example: " TAF1#TATA_known1_8mer#85913478#85913493#+#3#0.02#0.4 "

  • NCRECUR field.

Please be aware of large TF peak and chromHMM regions. Because of the low resolution issues, recurrent information may not indicate functional importance.

Building data context

Usage

Usage : ./funseq -f file -maf maf -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc
       Options :
               	-f              user input SNVs file
               	-maf            Minor Allele Frequency (MAF) threshold to filter 1KG phaseI SNVs (value 0 ~ 1)
               	-m              1 - somatic Genome; 2 - germline or personal Genome
               	-inf            input format - BED or VCF
               	-outf           output format - BED or VCF
                -nc              [Optional] Only do non-coding analysis. 
       Default : -maf 0 -m 1 -outf vcf

Input

FunSEQ takes BED or VCF files as input
1. BED format
In addition to the three required BED fields, please prepare your file as follows (5 required fields, tab-delimited):
chrom chromStart chromEnd Reference.allele Alterative.allele ...

* chrom - The name of the chromosome (e.g. chr3, chrY). 
* chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0. * chromEnd - The ending position of the feature in the chromosome. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. * Reference.allele - The reference allele of SNVs * Alternative.allele - The alternative allele of SNVs.

2. VCF format
The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited):
CHROM POS ID REF ALT QUAL FILTER INFO

Output

FunSEQ can produce either BED format or VCF format files.

An example of the VCF annotation of a coding variant:

chr1    36205042        .       C       A       .       .       OTHER=MAF(1kg-phase1)=0;CDS=Yes;VA=1:CLSPN:ENSG00000092853.8:-:prematureStop:4/5:CLSPN-001:  \
ENST00000251195.5:3996_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4017_3232_1078_E->*:CLSPN-003:ENST00000373220.3:3825_3040_1014_E->*:CLSPN-004:ENST00000520551.1: \
3858_3073_1025_E->*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=4 
  • OTHER field contains other original information other than the 5 required ones (chrom, chromStart, chromEnd, reference, alternative). When input file is less than 3,000 lines, OTHER also contains the MAF (minor allele frequency) of SNVs in 1KG Phase1 data.


An example of the VCF annotation of a non-coding variant:

chr5    85913480        .       T       C       .       .       OTHER=MAF(1kg-phase1)=0;CDS=NO;HUB=REG;NCENC=TFP(ETS1),TFP(ELF1),TFP(GATA2),TFP(POU2F2),  \
TFP(TBP),TFP(SRF),TFP(ELK4),TPM(TAF1),TFP(STAT3),TFP(GATA3),TFP(SIX5),TFP(YY1),TPM(TBP),TFP(CHD2),TFP(MYC),TFP(IRF1),DHS(MCV-2),TFP(TAF1),TFP(GATA1),  \
TFP(ZEB1),TFP(SETDB1),TFP(ZNF143),TFP(NFKB1),TFP(MAX),TFP(GABPA),Enhancer(chromHmm),TFP(STAT1);  \
MOTIFBR=85913478#85913493#+#TATA_known1_8mer#TAF1,85913478#85913493#+#TATA_known1_8mer#TBP;GENE=COX7C(promoter);NCDS=4 
  • NCENC (Non-coding ENCODE annotation) field.

TFP -transcription factor binding peak.
TFM - transcription factor motifs in peak regions.
DHS - DNase1 hypersensitive sites, with number of cell lines (MCV- , total 125 cell lines) information (R.E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489,75, Sep 2012).
ncRNA - non-coding RNA
Pseudogene
Enhancer - chromHmm (genome segmentation), drm (distal regulatory module)

  • MOTIFBR field.

This field is a hash-delimited tag, defined as follows:
TF name # motif name # motif start # motif end # motif strand # mutation position # alternative allele frequency in PFM # reference allele frequency in PFM
An example: " TAF1#TATA_known1_8mer#85913478#85913493#+#3#0.02#0.4 "

  • NCRECUR field.

Please be aware of large TF peak and chromHMM regions. Because of the low resolution issues, recurrent information may not indicate functional importance.

Personal tools