FunSeq
From GersteinInfo
(→A. Required Tools) |
(→Output) |
||
(12 intermediate revisions not shown) | |||
Line 7: | Line 7: | ||
1) [http://code.google.com/p/bedtools/downloads/list Bedtools] <br> | 1) [http://code.google.com/p/bedtools/downloads/list Bedtools] <br> | ||
2) [http://sourceforge.net/projects/samtools/files/tabix/ Tabix] <br> | 2) [http://sourceforge.net/projects/samtools/files/tabix/ Tabix] <br> | ||
- | 3) [http://vat.gersteinlab.org/index.php VAT] - A good installation guide for VAT can be found [http://ngsda.blogspot.com/2011/06/vat.html here]. <br> | + | 3) [http://vat.gersteinlab.org/index.php VAT] (snpMapper Module) - A good installation guide for VAT can be found [http://ngsda.blogspot.com/2011/06/vat.html here]. <br> |
- | + | If you are only interested in non-coding variants, you don't need to install VAT. But remember to use '-nc' option in Funseq | |
<br> | <br> | ||
Line 85: | Line 85: | ||
=Running FunSeq= | =Running FunSeq= | ||
==Usage== | ==Usage== | ||
- | Usage : ./funseq -f file -maf maf -m <1/2> -inf <bed/vcf> -outf <bed/vcf> | + | Usage : ./funseq -f file -maf maf -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc |
Options : | Options : | ||
-f user input SNVs file | -f user input SNVs file | ||
Line 92: | Line 92: | ||
-inf input format - BED or VCF | -inf input format - BED or VCF | ||
-outf output format - BED or VCF | -outf output format - BED or VCF | ||
+ | -nc [Optional] Only do non-coding analysis. | ||
Default : -maf 0 -m 1 -outf vcf | Default : -maf 0 -m 1 -outf vcf | ||
Line 112: | Line 113: | ||
==Output== | ==Output== | ||
- | |||
FunSEQ can produce either BED format or VCF format files. <br><br> | FunSEQ can produce either BED format or VCF format files. <br><br> | ||
An example of the VCF annotation of a coding variant: <br> | An example of the VCF annotation of a coding variant: <br> | ||
Line 130: | Line 130: | ||
TFP -transcription factor binding peak. <br> | TFP -transcription factor binding peak. <br> | ||
TFM - transcription factor motifs in peak regions. <br> | TFM - transcription factor motifs in peak regions. <br> | ||
- | DHS - DNase1 hypersensitive sites, with number of cell lines (MCV- , total 125 cell lines) information (R.E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489,75, Sep 2012). <br> | + | DHS - DNase1 hypersensitive sites, with number of cell lines (MCV- , total 125 cell lines) information (R.E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489,75, Sep 2012). For cell-line info, please refer to [http://archive.gersteinlab.org/yaofu/DHS/ DHS cell lines] |
+ | <br> | ||
ncRNA - non-coding RNA <br> | ncRNA - non-coding RNA <br> | ||
Pseudogene <br> | Pseudogene <br> |
Latest revision as of 19:11, 13 January 2014
Contents |
Installation
A. Required Tools
The following tools are REQUIRED for FunSeq:
1) Bedtools
2) Tabix
3) VAT (snpMapper Module) - A good installation guide for VAT can be found here.
If you are only interested in non-coding variants, you don't need to install VAT. But remember to use '-nc' option in Funseq
B. PERL Requirement
1) Please make sure you have Perl 5 and up. Latest PERL can be downloaded here.
2) Install package Parallel::ForkManager (this package is used for parallel running). The PERL library can be found here.
C. FunSeq tool installation
FunSeq is a PERL- and Linux/UNIX-based tool. At the command-line prompt, enter the following:
$ cd FUNSEQ/ $ perl Makefile.PL $ make $ make test $ make install
D. Required Data Files
Please download all the following data files from ' http://funseq.gersteinlab.org/data/ ' and put them in a new folder ' $path/funseq-0.1/data/ ':
1. 1kg.phase1.snp.bed.gz (bed format)
Contents : all 1KG phaseI SNPs in bed format.
Columns : chromosome , SNVs start position (0-based), SNVs end position, MAF (minor allele frequency)
Purpose : to filter out 1KG SNVs based on allele frequnecies.
2. ENCODE.annotation.gz (bed format)
Contents : compiled annotation files from ENCODE, Gencode v7 and others, includes DHS, TF peak, Pseudogene, ncRNA, enhancers
Columns : chromosome , annotation start position (0-based), annotation end position, annotation name.
Purpose : to find SNVs in annotated regions.
3. ENCODE.tf.bound.union.bed (bed format)
Contents : transcription factor (TF) motifs in ENCODE TF peaks.
Columns : chromosome, start position (0-based), end position, motif name, , strand, TF name
Purpose : used for motif breaking analysis
4. gencode7.cds.bed (bed format)
Contents : extracted CDS information from Gencode7.
Columns : chromosome, start position, end position
Purpose : to find SNVs in CDS region
5. gencode.v7.promoter.bed (bed format)
Contents : compiled promoter regions, -2.5kb from transcription start site (TSS)
Columns : chromosome, start, end, gene, whether the gene is a hub in protein-protein interaction network (PPI) or regulatory network (REG).
Purpose : to associate promoter SNVs with genes
6. gencode.v7.annotation.GRCh37.cds.gtpc.ttpc.interval
Purpose : For variant annotation tool (VAT); Gencode v7.
7. gencode.v7.annotation.GRCh37.cds.gtpc.ttpc.fa
Purpose : For Variant Annotation Tool (VAT); Gencode v7.
8. DRM_transcript_pairs_modify
Contents : distal regulatory module with gene information.
Purpose : to associate enhancer SNVs with genes
9. motif.PWM
Contents : PWMs
Purpose : used for motif breaking calculation
10. PPI.hubs.txt
Purpose : defined hub genes in protein-protein interaction network
11. REG.hubs.txt
Purpose : defined hub genes in regulatory network
12. GENE.strong_selection.txt
Purpose : genes under strong negative selection (fraction of rare SNVs among non-synonymous variants).
13. human_ancestor_GRCh37_e59.fa
Contents : contains human ancestral allele in hg19, Ch37.
Purpose : for motif breaking calculation in personal or germ-line genome.
* Note : for somatic analysis, these files are not needed.
14. sensitive.nc.bed
Contents : coordinates of sensitive/ultra-sensitive regions.
Purpose : to find SNVs in sensitive/ultra-sensitive regions.
Running FunSeq
Usage
Usage : ./funseq -f file -maf maf -m <1/2> -inf <bed/vcf> -outf <bed/vcf> -nc Options : -f user input SNVs file -maf Minor Allele Frequency (MAF) threshold to filter 1KG phaseI SNVs (value 0 ~ 1) -m 1 - somatic Genome; 2 - germline or personal Genome -inf input format - BED or VCF -outf output format - BED or VCF -nc [Optional] Only do non-coding analysis.
Default : -maf 0 -m 1 -outf vcf
Input
FunSEQ takes BED or VCF files as input
1. BED format
In addition to the three required BED fields, please prepare your file as follows (5 required fields, tab-delimited):
chrom chromStart chromEnd Reference.allele Alterative.allele ...
* chrom - The name of the chromosome (e.g. chr3, chrY).
* chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0. * chromEnd - The ending position of the feature in the chromosome. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. * Reference.allele - The reference allele of SNVs * Alternative.allele - The alternative allele of SNVs.
2. VCF format
The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited):
CHROM POS ID REF ALT QUAL FILTER INFO
Output
FunSEQ can produce either BED format or VCF format files.
An example of the VCF annotation of a coding variant:
chr1 36205042 . C A . . OTHER=MAF(1kg-phase1)=0;CDS=Yes;VA=1:CLSPN:ENSG00000092853.8:-:prematureStop:4/5:CLSPN-001: \ ENST00000251195.5:3996_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4017_3232_1078_E->*:CLSPN-003:ENST00000373220.3:3825_3040_1014_E->*:CLSPN-004:ENST00000520551.1: \ 3858_3073_1025_E->*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=4
- OTHER field contains other original information other than the 5 required ones (chrom, chromStart, chromEnd, reference, alternative). When input file is less than 3,000 lines, OTHER also contains the MAF (minor allele frequency) of SNVs in 1KG Phase1 data.
An example of the VCF annotation of a non-coding variant:
chr5 85913480 . T C . . OTHER=MAF(1kg-phase1)=0;CDS=NO;HUB=REG;NCENC=TFP(ETS1),TFP(ELF1),TFP(GATA2),TFP(POU2F2), \ TFP(TBP),TFP(SRF),TFP(ELK4),TPM(TAF1),TFP(STAT3),TFP(GATA3),TFP(SIX5),TFP(YY1),TPM(TBP),TFP(CHD2),TFP(MYC),TFP(IRF1),DHS(MCV-2),TFP(TAF1),TFP(GATA1), \ TFP(ZEB1),TFP(SETDB1),TFP(ZNF143),TFP(NFKB1),TFP(MAX),TFP(GABPA),Enhancer(chromHmm),TFP(STAT1); \ MOTIFBR=85913478#85913493#+#TATA_known1_8mer#TAF1,85913478#85913493#+#TATA_known1_8mer#TBP;GENE=COX7C(promoter);NCDS=4
- NCENC (Non-coding ENCODE annotation) field.
TFP -transcription factor binding peak.
TFM - transcription factor motifs in peak regions.
DHS - DNase1 hypersensitive sites, with number of cell lines (MCV- , total 125 cell lines) information (R.E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489,75, Sep 2012). For cell-line info, please refer to DHS cell lines
ncRNA - non-coding RNA
Pseudogene
Enhancer - chromHmm (genome segmentation), drm (distal regulatory module)
- MOTIFBR field.
This field is a hash-delimited tag, defined as follows:
TF name # motif name # motif start # motif end # motif strand # mutation position # alternative allele frequency in PFM # reference allele frequency in PFM
An example: " TAF1#TATA_known1_8mer#85913478#85913493#+#3#0.02#0.4 "
- NCRECUR field.
Please be aware of large TF peak and chromHMM regions. Because of the low resolution issues, recurrent information may not indicate functional importance.