Funseq

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
(Input)
 
(88 intermediate revisions not shown)
Line 6: Line 6:
The following tools are REQUIRED for FunSeq: <br>
The following tools are REQUIRED for FunSeq: <br>
1) [http://code.google.com/p/bedtools/downloads/list Bedtools] <br>
1) [http://code.google.com/p/bedtools/downloads/list Bedtools] <br>
-
2) [http://sourceforge.net/projects/samtools/files/ Samtools] <br>
 
3) [http://sourceforge.net/projects/samtools/files/tabix/ Tabix] <br>
3) [http://sourceforge.net/projects/samtools/files/tabix/ Tabix] <br>
-
4) [http://vat.gersteinlab.org/index.php VAT] - A good installation guide for VAT can be found [http://ngsda.blogspot.com/2011/06/vat.html here]. <br>
+
3) [http://vat.gersteinlab.org/index.php VAT] - A good installation guide for VAT can be found [http://ngsda.blogspot.com/2011/06/vat.html here]. Only needed for coding analysis. When use '-nc' option in FunSeq, no need to install VAT. <br>
 +
4) [http://bioinfo.lifl.fr/TFM/TFMpvalue/ TFMpvalue-sc2pv]<br>
 +
5) [http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ bigWigAverageOverBed]<br>
 +
6) [http://www.r-project.org R] - Only needed for differential gene expression analysis
 +
<br>
==B. PERL Requirement==
==B. PERL Requirement==
1) Please make sure you have Perl 5 and up. Latest PERL can be downloaded [http://www.perl.org/ here]. <br>
1) Please make sure you have Perl 5 and up. Latest PERL can be downloaded [http://www.perl.org/ here]. <br>
2) Install package Parallel::ForkManager (this package is used for parallel running). The PERL library can be found [http://search.cpan.org/~szabgab/Parallel-ForkManager-1.03/lib/Parallel/ForkManager.pm here].
2) Install package Parallel::ForkManager (this package is used for parallel running). The PERL library can be found [http://search.cpan.org/~szabgab/Parallel-ForkManager-1.03/lib/Parallel/ForkManager.pm here].
-
 
+
<br>
==C. FunSeq tool installation==
==C. FunSeq tool installation==
FunSeq is a PERL- and Linux/UNIX-based tool. At the command-line prompt, enter the following: <br>
FunSeq is a PERL- and Linux/UNIX-based tool. At the command-line prompt, enter the following: <br>
-
$ cd FUNSEQ/ <br>
+
::$ ''cd FUNSEQ/''
-
$ perl Makefile.PL <br>
+
::$ ''perl Makefile.PL''
-
$ make <br>
+
::$ ''make''
-
$ make test <br>
+
::$ ''make test''
-
$ make install <br>
+
::$ ''make install''
-
<br>
+
-
<br>
+
==D. Required Data Files==
==D. Required Data Files==
-
Please download all the following data files from ' http://funseq.gersteinlab.org/data/ ' and put them in a new folder ' $path/funseq-0.1/data/ ': <br>
+
Please download all the following data files from ' http://funseq.gersteinlab.org/data/ ' and put them in a new folder ' $path/funseq-0.2/data/ ': <br><br>
'''1. 1kg.phase1.snp.bed.gz  (bed format)''' <br>
'''1. 1kg.phase1.snp.bed.gz  (bed format)''' <br>
-
Contents : all 1KG phaseI SNVs in bed format. <br>
+
Contents : 1000 Genomes Phase I data with minor allele frequency in bed format. <br>
-
Columns : chromosome , SNVs start position (0-based), SNVs end position, MAF (minor allele frequency) <br>
+
Columns : chromosome, start position (0-based), end position, MAF (minor allele frequency)<br>
-
Purpose : to filter out common variants against 1KG SNVs. <br><br>
+
Purpose : to filter out input SNVs based on user-defined allele-frequency threshold. <br>
-
'''2. ENCODE.annotation.gz  (bed format) '''<br>
+
'''2.   All_hg19_RS.bw''' <br>
-
Contents : compiled annotation files from ENCODE, Gencode v7 and others, includes DHS, TF peak, Pseudogene, ncRNA, enhancers <br>
+
Contents : Contents: binary file containing base-wise gerp score. Downloaded from http://hgdownload.cse.ucsc. edu/gbdb/hg19/bbi/All_hg19_RS.bw 
<br>
-
Columns : chromosome , annotation start position (0-based), annotation end position, annotation name. <br>
+
* Note : This file is ~7G. If you don’t want to retrieve gerp score for variants, then no need to download this file. <br>
-
Purpose : to find SNVs in annotated regions. <br><br>
+
-
'''3. ENCODE.tf.bound.union.bed (bed format) '''<br>
+
'''3.   HOT_region.bed (bed format)''' <br>
-
Contents : transcription factor (TF) motifs in ENCODE TF peaks. <br>
+
Contents : highly occupied region from Yip et al., (Yip, et al., 2012) 
<br>
-
Columns : chromosome, start position (0-based), end position, motif name, , strand, TF name <br>
+
Columns : chromosome, start position, end position, cell line info<br>
-
Purpose : used for motif breaking analysis <br><br>
+
Purpose : to examine whether variants occur in hot regions. <br>
-
'''4. gencode7.cds.bed  (bed format) '''<br>
+
'''4.   ENCODE.annotation.gz (bed format)'''<br>
-
Contents : extracted CDS information from Gencode7. <br>
+
Contents : compiled annotation files from ENCODE, GENCODE v7 and others, including Dnase I hypersensitive sites, transcription factor binding peak, pseudo-genes, non-coding RNAs, enhancer regions (chromhmm, segway and distal regulatory modules (Yip, et al., 2012)). 
<br>
-
Columns : chromosome, start position, end position <br>
+
Columns : chromosome, start position, end position, annotation. 
<br>
-
Purpose : extract SNVs in CDS region <br><br>
+
Purpose : to annotate SNVs in ENCODE regions. 
<br>
-
'''5. gencode.v7.promoter.bed (bed format) '''<br>
+
'''5.   ENCODE.tf.bound.union.bed (bed format)''' <br>
-
Contents : compiled promoter regions, -2.5kb from transcription start site (TSS) <br>
+
Contents : transcription factor (TF) binding motifs under peak regions. <br>
-
Columns : chromosome, start, end, gene, whether the gene is a hub in protein-protein interaction network (PPI) or regulatory network (REG). <br>
+
Columns : chromosome, start position, end position, motif name, , strand, TF name 
<br>
-
Purpose : correlate promoter SNVs with gene <br><br>
+
Purpose : used for motif breaking analysis <br>
-
'''6. gencode.v7.annotation.GRCh37.cds.gtpc.ttpc.interval '''<br>
+
'''6.   gencode.v7.cds.bed (bed format)'''<br>
-
Purpose : For variant annotation tool (VAT); Gencode v7. <br><br>
+
Contents : extracted CDS information from GENCODE v7. <br>
 +
Columns : chromosome, start position, end position <br>
 +
Purpose : locate coding SNVs. <br>
-
'''7. gencode.v7.annotation.GRCh37.cds.gtpc.ttpc.fa <br>'''
+
'''7.   gencode.v7.promoter.bed (bed format)'''<br>
-
Purpose : For Variant Annotation Tool (VAT); Gencode v7. <br><br>
+
Contents : promoter regions, defined as -2.5kb from transcription start site (TSS) <br>
 +
Columns : chromosome, start position, end position, gene. <br>
 +
Purpose : to associate promoter SNVs with genes<br>
-
'''8. DRM_transcript_pairs_modify''' <br>
+
'''8.   gencode.v7.annotation.GRCh37.cds.gtpc.ttpc.interval'''<br>
-
Contents : distal regulatory module with gene information. <br>
+
Purpose : used by variant annotation tool (VAT). <br>
-
Purpose : correlate enhancer SNVs with gene <br><br>
+
-
'''9. Pouya.motif''' <br>
+
'''9.   gencode.v7.annotation.GRCh37.cds.gtpc.ttpc.fa'''<br>
-
Contents : PWMs <br>
+
Purpose : used by variant annotation tool (VAT). 
<br>
-
Purpose : used for motif breaking calculation <br><br>
+
-
'''10. PPI.hubs.txt''' <br>
+
'''10.   drm.gene.bed (bed format)'''<br>
-
Purpose : defined hub genes in protein-protein interaction network <br><br>
+

Contents : distal regulatory module linked to genes. 
<br>
 +
Columns : chromosome, start position, end position, gene, p-value, cell-lines<br>
 +
Purpose : to associate enhancer SNVs with genes 
<br>
-
'''11. REG.hubs.txt''' <br>
+
'''11.   motif.PFM'''<br>
-
Purpose : defined hub genes in regulatory network <br><br>
+
Contents : position frequency matrix (PFM) for ENCODE TFs.<br>
 +

Purpose : used for motif breaking and gain of motif calculation<br>
-
'''12. GENE.strong_selection.txt''' <br>
+
'''12.   PPI.hubs.txt'''<br>
-
Purpose : genes under strong negative selection, use fraction of rare SNVs among non-synonymous variants. <br><br>
+
Purpose : defined hub genes in protein-protein interaction network 
<br>
-
'''13. human_ancestor_GRCh37_e59/*''' <br>
+
'''13.   REG.hubs.txt'''<br>
-
Contents : this directory contains human ancestral allele in hg19, Ch37. <br>
+
Purpose : defined hub genes in regulatory network<br>
-
Purpose : for motif breaking calculation in personal or germ-line genome. <br>
+
 
-
* Note : for somatic analysis, these files are not needed. <br>
+
'''14.  GENE.strong_selection.txt'''<br>
 +
Purpose : genes under strong negative selection (fraction of rare SNVs among non-synonymous variants). 
<br>
 +
 
 +
'''15.  human_ancestor_GRCh37_e59.fa'''<br>
 +

Contents : contains human ancestral allele in hg19, Ch37. <br>
 +
Purpose : for motif breaking calculation in personal or germline genome. <br>
 +
* Note : for somatic analysis, this file is not needed. 
<br>
 +
 
 +
'''16.  human_g1k_v37.fasta'''<br>
 +
Contents : human reference genome<br>
 +
Purpose : for gain-of-motif analysis<br>
 +
 
 +
'''17.  sensitive.nc.bed (bed format)'''<br>
 +

Contents : coordinates of sensitive/ultra-sensitive regions. <br>
 +
Purpose : to find SNVs in sensitive/ultra-sensitive regions. 
<br>
 +
 
 +
'''18.  ultra.conserved.hg19.bed'''<br>
 +
Contents : ultra-conserved region in (Bejerano, et al., 2004).<br>
 +
 
 +
'''19.  motif.score.cut'''<br>
 +
Contents : pre-calculated PWM scores corresponding to 4e-8. <br>
 +
Purpose : to speed up the gain-of-motif analysis<br>
 +
 
 +
'''20.  regulatory.network'''<br>
 +
Contents : human regulatory network from (Gerstein, et al., 2012)<br>
 +
 
 +
'''21.  cancer.genes'''<br>
 +
Contents : cancer genes from Cancer Gene Census (Futreal, et al., 2004)<br>
 +
 
 +
'''22.  actionable.gene'''<br>
 +
Contents : actionable genes from (Wagle, et al., 2012)<br>
 +
<br>
=Running FunSeq=
=Running FunSeq=
 +
==Usage==
 +
*'' Usage: ./funseq –f file -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file -cls file -exf <rpkm/raw>''
 +
:''Options : '' <br>
-
Usage : ./funseq -f file -maf maf -m <1/2> -inf <bed/vcf> -outf <bed/vcf>
+
                        -f         User Input SNVs File <br>
-
        Options :
+
                        -maf         Minor Allele Frequency Threshold to filter 1KG SNVs
-
                -f             user input SNVs file
+
                        -m         1 - Somatic Genome; 2 - Germline or Personal Genome
-
                -maf           Minor Allele Frequency (MAF) threshold to filter 1KG phaseI SNVs (value 0 ~ 1)
+
                        -inf         input format - BED or VCF
-
                -m              1 - somatic Genome; 2 - germline or personal Genome
+
                        -outf         output format - BED or VCF
-
                -inf            input format - BED or VCF
+
                        -nc         [Optional] Only do non-coding analysis, no need of VAT (variant annotation tool)
-
                -outf          output format - BED or VCF
+
                        -o         [Optional] Output path, default is the directory 'out'
 +
                        -g         [Optional] gene list, only output variants associated with selected genes.
 +
                        -exp         [Optional] gene expression matrix''
 +
                        -cls         [Optional] class file for samples in gene expression matrix
 +
                        -exf         [Optional] gene expression format - rpkm / raw
 +
 
 +
:''Default Options: -maf 0 -m 1 -outf vcf -o out''
 +
 
 +
*'' Multiple Genomes with Recurrent Output''
 +
:''Option 1: Separate multiple files by ',' '' <br>
 +
::''Example: ./funseq -f file1,file2,file3,... -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf>…''
 +
:''Option 2: Use the 6th column of BED file to specify samples''
 +
::''Example: ./funseq -f file -maf MAF -m <1/2> -inf bed -outf <bed/vcf> …''
 +
*'' NOTE: Please make sure you have sufficient memory, at least 3G.''
 +
 
 +
==Input==
 +
FunSEQ takes BED or VCF files as input <br>
 +
1. [https://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format] <br>
 +
In addition to the three required BED fields, please prepare your file as follows (5 required fields, tab-delimited): <br>
 +
chrom chromStart chromEnd Reference.allele Alterative.allele ... <br>
 +
:# chrom - The name of the chromosome (e.g. chr3, chrY). <br>
 +
:# chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
 +
:# chromEnd - The ending position of the feature in the chromosome. The chromEnd base is not included in the display of the feature. <br>For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
 +
:# Reference.allele - The reference allele of SNVs
 +
:# Alternative.allele - The alternative allele of SNVs.
 +
 +
2. VCF format (http://www.1000genomes.org/node/101) <br>
 +
The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited): <br>
 +
#CHROM POS ID REF ALT QUAL FILTER INFO <br><br>
 +
 
 +
==Output==
 +
You can download a sample of the output VCF [http://funseq.gersteinlab.org/PR2832.FunSEQ.vcf here]. <br><br>
 +
FunSEQ can produce either BED format or VCF format files. <br><br>
 +
An example of the VCF annotation of a coding variant: <br>
 +
chr1    36205042        .      C      A      .      .      OTHER=MAF(1kg-phase1)=0;CDS=Yes;VA=1:CLSPN:ENSG00000092853.8:-:prematureStop:4/5:CLSPN-001:  \
 +
ENST00000251195.5:3996_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4017_3232_1078_E->*:CLSPN-003:ENST00000373220.3:3825_3040_1014_E->*:CLSPN-004:ENST00000520551.1: \
 +
3858_3073_1025_E->*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=4 <br>
 +
 
 +
* OTHER field contains other original information other than the 5 required ones (chrom, chromStart, chromEnd, reference, alternative). When input file is less than 3,000 lines, OTHER also contains the MAF (minor    allele frequency) of SNVs in 1KG Phase1 data.
 +
<br>
 +
An example of the VCF annotation of a non-coding variant: <br>
 +
chr5    85913480        .      T      C      .      .      OTHER=MAF(1kg-phase1)=0;CDS=NO;HUB=REG;NCENC=TFP(ETS1),TFP(ELF1),TFP(GATA2),TFP(POU2F2),  \
 +
TFP(TBP),TFP(SRF),TFP(ELK4),TPM(TAF1),TFP(STAT3),TFP(GATA3),TFP(SIX5),TFP(YY1),TPM(TBP),TFP(CHD2),TFP(MYC),TFP(IRF1),DHS(MCV-2),TFP(TAF1),TFP(GATA1),  \
 +
TFP(ZEB1),TFP(SETDB1),TFP(ZNF143),TFP(NFKB1),TFP(MAX),TFP(GABPA),Enhancer(chromHmm),TFP(STAT1);  \
 +
MOTIFBR=85913478#85913493#+#TATA_known1_8mer#TAF1,85913478#85913493#+#TATA_known1_8mer#TBP;GENE=COX7C(promoter);NCDS=4
-
        Default : -maf 0 -m 1 -outf vcf
+
* NCENC (Non-coding ENCODE annotation) field.  <br>
 +
TFP -transcription factor binding peak.  <br>
 +
TFM - transcription factor motifs in peak regions. <br>
 +
DHS - DNase1 hypersensitive sites, with number of cell lines (MCV- , total 125 cell lines) information (R.E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489,75, Sep 2012). <br>
 +
ncRNA - non-coding RNA <br>
 +
Pseudogene <br>
 +
Enhancer - chromHmm (genome segmentation), drm (distal regulatory module) <br><br>
 +
 +
* MOTIFBR field. <br>
 +
This field is a hash-delimited tag, defined as follows: <br>
 +
'''motif start''' # '''motif end''' # '''motif strand''' # '''motif name''' # '''transcription factor name''' <br>
 +
An example: " 85913478#85913493#+#TATA_known1_8mer#TAF1 " <br><br>
 +
 +
* NCRECUR field. <br>
 +
Please be aware of large TF peak and chromHMM regions. Because of the low resolution issues, recurrent information may not indicate functional importance.

Latest revision as of 21:42, 17 October 2013

Contents


Installation

A. Required Tools

The following tools are REQUIRED for FunSeq:
1) Bedtools
3) Tabix
3) VAT - A good installation guide for VAT can be found here. Only needed for coding analysis. When use '-nc' option in FunSeq, no need to install VAT.
4) TFMpvalue-sc2pv
5) bigWigAverageOverBed
6) R - Only needed for differential gene expression analysis

B. PERL Requirement

1) Please make sure you have Perl 5 and up. Latest PERL can be downloaded here.
2) Install package Parallel::ForkManager (this package is used for parallel running). The PERL library can be found here.

C. FunSeq tool installation

FunSeq is a PERL- and Linux/UNIX-based tool. At the command-line prompt, enter the following:

$ cd FUNSEQ/
$ perl Makefile.PL
$ make
$ make test
$ make install

D. Required Data Files

Please download all the following data files from ' http://funseq.gersteinlab.org/data/ ' and put them in a new folder ' $path/funseq-0.2/data/ ':

1. 1kg.phase1.snp.bed.gz (bed format)
Contents : 1000 Genomes Phase I data with minor allele frequency in bed format. 

Columns : chromosome, start position (0-based), end position, MAF (minor allele frequency).
Purpose : to filter out input SNVs based on user-defined allele-frequency threshold.

2. All_hg19_RS.bw
Contents : Contents: binary file containing base-wise gerp score. Downloaded from http://hgdownload.cse.ucsc. edu/gbdb/hg19/bbi/All_hg19_RS.bw 

* Note : This file is ~7G. If you don’t want to retrieve gerp score for variants, then no need to download this file.

3. HOT_region.bed (bed format)
Contents : highly occupied region from Yip et al., (Yip, et al., 2012) 

Columns : chromosome, start position, end position, cell line info
Purpose : to examine whether variants occur in hot regions.

4. ENCODE.annotation.gz (bed format)
Contents : compiled annotation files from ENCODE, GENCODE v7 and others, including Dnase I hypersensitive sites, transcription factor binding peak, pseudo-genes, non-coding RNAs, enhancer regions (chromhmm, segway and distal regulatory modules (Yip, et al., 2012)). 

Columns : chromosome, start position, end position, annotation. 

Purpose : to annotate SNVs in ENCODE regions. 


5. ENCODE.tf.bound.union.bed (bed format)
Contents : transcription factor (TF) binding motifs under peak regions.
Columns : chromosome, start position, end position, motif name, , strand, TF name 

Purpose : used for motif breaking analysis

6. gencode.v7.cds.bed (bed format)
Contents : extracted CDS information from GENCODE v7. 

Columns : chromosome, start position, end position
Purpose : locate coding SNVs.

7. gencode.v7.promoter.bed (bed format)
Contents : promoter regions, defined as -2.5kb from transcription start site (TSS)
Columns : chromosome, start position, end position, gene.
Purpose : to associate promoter SNVs with genes

8. gencode.v7.annotation.GRCh37.cds.gtpc.ttpc.interval
Purpose : used by variant annotation tool (VAT). 


9. gencode.v7.annotation.GRCh37.cds.gtpc.ttpc.fa
Purpose : used by variant annotation tool (VAT). 


10. drm.gene.bed (bed format)

Contents : distal regulatory module linked to genes. 

Columns : chromosome, start position, end position, gene, p-value, cell-lines
Purpose : to associate enhancer SNVs with genes 


11. motif.PFM
Contents : position frequency matrix (PFM) for ENCODE TFs.

Purpose : used for motif breaking and gain of motif calculation

12. PPI.hubs.txt
Purpose : defined hub genes in protein-protein interaction network 


13. REG.hubs.txt
Purpose : defined hub genes in regulatory network

14. GENE.strong_selection.txt
Purpose : genes under strong negative selection (fraction of rare SNVs among non-synonymous variants). 


15. human_ancestor_GRCh37_e59.fa

Contents : contains human ancestral allele in hg19, Ch37. 

Purpose : for motif breaking calculation in personal or germline genome. 

* Note : for somatic analysis, this file is not needed. 


16. human_g1k_v37.fasta
Contents : human reference genome
Purpose : for gain-of-motif analysis

17. sensitive.nc.bed (bed format)

Contents : coordinates of sensitive/ultra-sensitive regions.
Purpose : to find SNVs in sensitive/ultra-sensitive regions. 


18. ultra.conserved.hg19.bed
Contents : ultra-conserved region in (Bejerano, et al., 2004).

19. motif.score.cut
Contents : pre-calculated PWM scores corresponding to 4e-8.
Purpose : to speed up the gain-of-motif analysis

20. regulatory.network
Contents : human regulatory network from (Gerstein, et al., 2012)

21. cancer.genes
Contents : cancer genes from Cancer Gene Census (Futreal, et al., 2004)

22. actionable.gene
Contents : actionable genes from (Wagle, et al., 2012)

Running FunSeq

Usage

  • Usage: ./funseq –f file -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file -cls file -exf <rpkm/raw>
Options :
                       -f		        User Input SNVs File 
-maf Minor Allele Frequency Threshold to filter 1KG SNVs -m 1 - Somatic Genome; 2 - Germline or Personal Genome -inf input format - BED or VCF -outf output format - BED or VCF -nc [Optional] Only do non-coding analysis, no need of VAT (variant annotation tool) -o [Optional] Output path, default is the directory 'out' -g [Optional] gene list, only output variants associated with selected genes. -exp [Optional] gene expression matrix -cls [Optional] class file for samples in gene expression matrix -exf [Optional] gene expression format - rpkm / raw
Default Options: -maf 0 -m 1 -outf vcf -o out
  • Multiple Genomes with Recurrent Output
Option 1: Separate multiple files by ','
Example: ./funseq -f file1,file2,file3,... -maf MAF -m <1/2> -inf <bed/vcf> -outf <bed/vcf>…
Option 2: Use the 6th column of BED file to specify samples
Example: ./funseq -f file -maf MAF -m <1/2> -inf bed -outf <bed/vcf> …
  • NOTE: Please make sure you have sufficient memory, at least 3G.

Input

FunSEQ takes BED or VCF files as input
1. BED format
In addition to the three required BED fields, please prepare your file as follows (5 required fields, tab-delimited):
chrom chromStart chromEnd Reference.allele Alterative.allele ...

  1. chrom - The name of the chromosome (e.g. chr3, chrY).
  2. chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome. The chromEnd base is not included in the display of the feature.
    For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. Reference.allele - The reference allele of SNVs
  5. Alternative.allele - The alternative allele of SNVs.

2. VCF format (http://www.1000genomes.org/node/101)
The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited):

  1. CHROM POS ID REF ALT QUAL FILTER INFO

Output

You can download a sample of the output VCF here.

FunSEQ can produce either BED format or VCF format files.

An example of the VCF annotation of a coding variant:

chr1    36205042        .       C       A       .       .       OTHER=MAF(1kg-phase1)=0;CDS=Yes;VA=1:CLSPN:ENSG00000092853.8:-:prematureStop:4/5:CLSPN-001:  \
ENST00000251195.5:3996_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4017_3232_1078_E->*:CLSPN-003:ENST00000373220.3:3825_3040_1014_E->*:CLSPN-004:ENST00000520551.1: \
3858_3073_1025_E->*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=4 
  • OTHER field contains other original information other than the 5 required ones (chrom, chromStart, chromEnd, reference, alternative). When input file is less than 3,000 lines, OTHER also contains the MAF (minor allele frequency) of SNVs in 1KG Phase1 data.


An example of the VCF annotation of a non-coding variant:

chr5    85913480        .       T       C       .       .       OTHER=MAF(1kg-phase1)=0;CDS=NO;HUB=REG;NCENC=TFP(ETS1),TFP(ELF1),TFP(GATA2),TFP(POU2F2),  \
TFP(TBP),TFP(SRF),TFP(ELK4),TPM(TAF1),TFP(STAT3),TFP(GATA3),TFP(SIX5),TFP(YY1),TPM(TBP),TFP(CHD2),TFP(MYC),TFP(IRF1),DHS(MCV-2),TFP(TAF1),TFP(GATA1),  \
TFP(ZEB1),TFP(SETDB1),TFP(ZNF143),TFP(NFKB1),TFP(MAX),TFP(GABPA),Enhancer(chromHmm),TFP(STAT1);  \
MOTIFBR=85913478#85913493#+#TATA_known1_8mer#TAF1,85913478#85913493#+#TATA_known1_8mer#TBP;GENE=COX7C(promoter);NCDS=4 
  • NCENC (Non-coding ENCODE annotation) field.

TFP -transcription factor binding peak.
TFM - transcription factor motifs in peak regions.
DHS - DNase1 hypersensitive sites, with number of cell lines (MCV- , total 125 cell lines) information (R.E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489,75, Sep 2012).
ncRNA - non-coding RNA
Pseudogene
Enhancer - chromHmm (genome segmentation), drm (distal regulatory module)

  • MOTIFBR field.

This field is a hash-delimited tag, defined as follows:
motif start # motif end # motif strand # motif name # transcription factor name
An example: " 85913478#85913493#+#TATA_known1_8mer#TAF1 "

  • NCRECUR field.

Please be aware of large TF peak and chromHMM regions. Because of the low resolution issues, recurrent information may not indicate functional importance.

Personal tools