GRAM
From GersteinInfo
(→F. Output files) |
(→E. Pipeline) |
||
(19 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
- | |||
- | |||
=Variant Prioritization= | =Variant Prioritization= | ||
Line 9: | Line 7: | ||
* [http://code.google.com/p/bedtools/downloads/list bedtools] (version bedtools-2.17.0) <br> | * [http://code.google.com/p/bedtools/downloads/list bedtools] (version bedtools-2.17.0) <br> | ||
* [http://sourceforge.net/projects/samtools/files/tabix/ tabix] (version tabix-0.2.6 and up) <br> | * [http://sourceforge.net/projects/samtools/files/tabix/ tabix] (version tabix-0.2.6 and up) <br> | ||
- | * | + | * [http://www.r-project.org/ R] (require packages: andomForest, glmnet, reshape2, gplots) <br> |
==B. Tool Download== | ==B. Tool Download== | ||
This is a Linux/UNIX-based tool. At the command-line prompt, type the following. | This is a Linux/UNIX-based tool. At the command-line prompt, type the following. | ||
- | $ | + | $ git clone https://github.com/gersteinlab/gram.git |
- | ==C. | + | ==C. Configuration== |
- | + | The pipeline grammar.sh should be configured prior to the first use. Please fill in the value of the below variables as instructed: | |
+ | |||
+ | genome: the path for genomic sequences (e.g. hg19.fa), the chromosome name is with "chr" prefix. <br> | ||
+ | You can download from UCSC ftp: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz <br><br> | ||
+ | gencode: the path for gencode annotation of cds regions (e.g. gencode.v19.cds.bed is available from GRAM github: https://github.com/gersteinlab/gram) <br><br> | ||
+ | dpath: the path for DeepBind model, there is a copy from GRAM github:https://github.com/gersteinlab/gram <br><br> | ||
+ | path_funseq: the path for FunSeq whole genome score file, which can be downloaded from funseq3.gersteinlab.org | ||
==D. Tool Usage== | ==D. Tool Usage== | ||
Line 24: | Line 28: | ||
To display the usage of tool, type ‘./grammar.sh -h’. <br> | To display the usage of tool, type ‘./grammar.sh -h’. <br> | ||
- | * Usage : ./grammar | + | * Usage : ./grammar -i bed -e exp |
- | Options : | + | Options :<br> |
- | + | -i [Required] User Input SNVs file (BED format: chr st ed ref mut sample-id rsid) | |
- | + | -e [Required] User Input gene expression matrix | |
- | + | ||
- | + | ||
- | + | ||
- | + | NOTE: Please make sure you have sufficient memory, at least 3G. | |
- | + | ||
-i : Required format: chr st ed ref mut sample-id rsid<br> | -i : Required format: chr st ed ref mut sample-id rsid<br> | ||
-e: The rows correspond to genes and columns correspond to samples. Sample ids need to match with those in the variant bed file. <br> | -e: The rows correspond to genes and columns correspond to samples. Sample ids need to match with those in the variant bed file. <br> | ||
- | ==E. Input files== | + | ==E. Pipeline== |
+ | |||
+ | The pipeline: | ||
+ | [[File::http://funseq3.gersteinlab.org/GRAMMAR.pipeline.png]] | ||
+ | |||
+ | The pipeline GRAMMAR runs as follows: | ||
+ | * Filter non-coding SNPs based on provided CDS annotation | ||
+ | * Extract sequences from the provided genome | ||
+ | * Run DeepBind on the extracted sequences | ||
+ | * Run GRAM with expression data and DeepBind output | ||
+ | * Extract Funseq scores of the variants | ||
+ | * Print and visualize the results | ||
+ | |||
+ | |||
+ | Sample report: | ||
+ | GRAMMAR: parameter parsing done | ||
+ | GRAMMAR: input variant bed file: snptest.bed | ||
+ | GRAMMAR: input gene expression file: gexpr.test.bed | ||
+ | GRAMMAR: checking dependency done | ||
+ | GRAMMAR: checking genome fasta and gencode annotation configs done | ||
+ | -------------------------------------------------- | ||
+ | GRAMMAR: Files checking done | ||
+ | ================================================== | ||
+ | GRAMMAR: Filtering non-coding SNVs only. InDels will also be removed. | ||
+ | -------------------------------------------------- | ||
+ | GRAMMAR: SNP filtering done | ||
+ | ================================================== | ||
+ | GRAMMAR: Running Deepbind on the selected genomic regions.. | ||
+ | GRAMMAR: Preparing input sequences for DeepBind score calculation. Please make sure you have added a correct DeepBind path and also put the parameter file db/params in the correct folder. | ||
+ | GRAMMAR: Running Deepbind now. It may take long time if your SNV input list is very long | ||
+ | GRAMMAR: cd gram/deepbind | ||
+ | GRAMMAR: cd grammar.test | ||
+ | GRAMMAR: checking DeepBind output... | ||
+ | -------------------------------------------------- | ||
+ | GRAMMAR: Deepbind score prediction done | ||
+ | ================================================== | ||
+ | Step 2: Running GRAM predictor. | ||
+ | GRAMMAR: cd gram | ||
+ | GRAMMAR: Rscript gram/gram.predict.r grammar.test ref.db.out mut.db.out nc.var.bed snp.4deepbind.bed gexpr.test.bed GRAMMAR/model.rdata snptest.bed.out/gram.score.txt<br> | ||
+ | [1] 20242 102 | ||
+ | [1] "After filting, only 102 of 199 sample left for the prediction"<br> | ||
+ | GRAMMAR: Prediction results have been saved in snptest.bed.out<br> | ||
+ | -------------------------------------------------- | ||
+ | GRAMMAR: the prediction of GRAM score done | ||
+ | ================================================== | ||
+ | GRAMMAR: Visualizing results. | ||
+ | GRAMMAR: Getting Funseq score for the variants.<br> | ||
+ | -------------------------------------------------- | ||
+ | GRAMMAR: Results visualization done | ||
+ | ================================================== | ||
+ | GRAMMAR: your job is done. please go to snptest.bed.out to find your results. | ||
+ | ================================================== | ||
+ | |||
+ | ==F. Input files== | ||
* User input SNV file (-i): BED format | * User input SNV file (-i): BED format | ||
- | In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited; | + | In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited); |
the 6th column is reserved for sample names, do not put other information there): | the 6th column is reserved for sample names, do not put other information there): | ||
chromosome, start position, end position, reference allele, alternative allele, sample id, rsid. | chromosome, start position, end position, reference allele, alternative allele, sample id, rsid. | ||
Line 60: | Line 113: | ||
- | * User input expression matrix (-e) | + | * User input expression matrix (-e) |
+ | The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited. | ||
+ | |||
e.g. | e.g. | ||
Gene Sample1 Sample2 Sample3 Sample4 … | Gene Sample1 Sample2 Sample3 Sample4 … | ||
Line 67: | Line 122: | ||
… … … … … … | … … … … … … | ||
- | == | + | ==G. Output files== |
Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1. | Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1. | ||
When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot. | When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot. | ||
- | * Sample | + | * Sample GRAM output |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | chr2 242382863 242382864 C T Patient-1 rs3771570 0.554 0.452 -0.590695386674223 0.32 0.474 0.25859476288468 | |
- | + | chr2 242382863 242382864 C T Patient-2 rs3771570 0.554 0.452 -0.590695386674223 0.32 0.474 0.25859476288468 | |
+ | chr2 242382863 242382864 C T Patient-3 rs3771570 0.554 0.452 -0.590695386674223 0.418 0.474 0.428287551353096 | ||
- | + | Columns: | |
+ | 1: (chrome) name of the chromosome | ||
+ | 2: (start) start coordinates of variants. (0-based) | ||
+ | 3: (end) end coordinates of variants. (end exclusive) | ||
+ | 4: (ref) reference allele of variants | ||
+ | 5: (mut) mutant allele of variants | ||
+ | 6: (sampleid) the ID of the sample | ||
+ | 7: (snpid) the ID of the SNV | ||
+ | 8: (ref.enhAct) general regulatory activity of the reference allele | ||
+ | 9: (alt.enhAct) general regulatory activity of the mutant allele | ||
+ | 10: (logodds) logodds calculated from reference and mutant allele regulatory activity | ||
+ | 11: (expr.modifier) cell type modifier score predicted from TF expression | ||
+ | 12: (binding.modifier) cell type modifier score predicted from TF binding | ||
+ | 13: (gram.prob) predicted GRAM score | ||
=Contact= | =Contact= | ||
shaoke DOT lou AT yale DOT edu | shaoke DOT lou AT yale DOT edu |
Latest revision as of 03:00, 2 May 2019
Contents |
Variant Prioritization
A. Dependencies
The following tools are required:
- sed, awk, grep
- DeepBind (version deepbind-v0.11)
- bedtools (version bedtools-2.17.0)
- tabix (version tabix-0.2.6 and up)
- R (require packages: andomForest, glmnet, reshape2, gplots)
B. Tool Download
This is a Linux/UNIX-based tool. At the command-line prompt, type the following.
$ git clone https://github.com/gersteinlab/gram.git
C. Configuration
The pipeline grammar.sh should be configured prior to the first use. Please fill in the value of the below variables as instructed:
genome: the path for genomic sequences (e.g. hg19.fa), the chromosome name is with "chr" prefix.
You can download from UCSC ftp: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gencode: the path for gencode annotation of cds regions (e.g. gencode.v19.cds.bed is available from GRAM github: https://github.com/gersteinlab/gram)
dpath: the path for DeepBind model, there is a copy from GRAM github:https://github.com/gersteinlab/gram
path_funseq: the path for FunSeq whole genome score file, which can be downloaded from funseq3.gersteinlab.org
D. Tool Usage
$ cd gram/
To display the usage of tool, type ‘./grammar.sh -h’.
* Usage : ./grammar -i bed -e exp Options :
-i [Required] User Input SNVs file (BED format: chr st ed ref mut sample-id rsid) -e [Required] User Input gene expression matrix NOTE: Please make sure you have sufficient memory, at least 3G.
-i : Required format: chr st ed ref mut sample-id rsid
-e: The rows correspond to genes and columns correspond to samples. Sample ids need to match with those in the variant bed file.
E. Pipeline
The pipeline: [[File::http://funseq3.gersteinlab.org/GRAMMAR.pipeline.png]]
The pipeline GRAMMAR runs as follows:
- Filter non-coding SNPs based on provided CDS annotation
- Extract sequences from the provided genome
- Run DeepBind on the extracted sequences
- Run GRAM with expression data and DeepBind output
- Extract Funseq scores of the variants
- Print and visualize the results
Sample report:
GRAMMAR: parameter parsing done GRAMMAR: input variant bed file: snptest.bed GRAMMAR: input gene expression file: gexpr.test.bed GRAMMAR: checking dependency done GRAMMAR: checking genome fasta and gencode annotation configs done -------------------------------------------------- GRAMMAR: Files checking done ================================================== GRAMMAR: Filtering non-coding SNVs only. InDels will also be removed. -------------------------------------------------- GRAMMAR: SNP filtering done ================================================== GRAMMAR: Running Deepbind on the selected genomic regions.. GRAMMAR: Preparing input sequences for DeepBind score calculation. Please make sure you have added a correct DeepBind path and also put the parameter file db/params in the correct folder. GRAMMAR: Running Deepbind now. It may take long time if your SNV input list is very long GRAMMAR: cd gram/deepbind GRAMMAR: cd grammar.test GRAMMAR: checking DeepBind output... -------------------------------------------------- GRAMMAR: Deepbind score prediction done ================================================== Step 2: Running GRAM predictor. GRAMMAR: cd gram GRAMMAR: Rscript gram/gram.predict.r grammar.test ref.db.out mut.db.out nc.var.bed snp.4deepbind.bed gexpr.test.bed GRAMMAR/model.rdata snptest.bed.out/gram.score.txt
[1] 20242 102 [1] "After filting, only 102 of 199 sample left for the prediction"
GRAMMAR: Prediction results have been saved in snptest.bed.out
-------------------------------------------------- GRAMMAR: the prediction of GRAM score done ================================================== GRAMMAR: Visualizing results. GRAMMAR: Getting Funseq score for the variants.
-------------------------------------------------- GRAMMAR: Results visualization done ================================================== GRAMMAR: your job is done. please go to snptest.bed.out to find your results. ==================================================
F. Input files
- User input SNV file (-i): BED format
In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited);
the 6th column is reserved for sample names, do not put other information there): chromosome, start position, end position, reference allele, alternative allele, sample id, rsid. Chromosome - name of the chromosome (e.g. chr3, chrX) Start position - start coordinates of variants. (0-based) End position - end coordinates of variants. (end exclusive) e.g., chr1 0 100 spanning bases numbered 0-99 Reference allele - germlime allele of variants Alternative allele - mutated allele of variants Sample id - the sample id, specifying the input sample or cell line (e.g. "Patient-1", "GM12878") RSID - the id for the variant (e.g. rs9347341)
e.g.
chr2 242382863 242382864 C T Patient-1 rs3771570 chr2 242382863 242382864 C T Patient-2 rs3771570 chr6 117210051 117210052 T C Patient-3 rs339331 … … … … … … …
- User input expression matrix (-e)
The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited.
e.g.
Gene Sample1 Sample2 Sample3 Sample4 … A1BG 1 5 40 0 … A1CF 20 9 0 23 … … … … … … …
G. Output files
Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.
When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot.
- Sample GRAM output
chr2 242382863 242382864 C T Patient-1 rs3771570 0.554 0.452 -0.590695386674223 0.32 0.474 0.25859476288468 chr2 242382863 242382864 C T Patient-2 rs3771570 0.554 0.452 -0.590695386674223 0.32 0.474 0.25859476288468 chr2 242382863 242382864 C T Patient-3 rs3771570 0.554 0.452 -0.590695386674223 0.418 0.474 0.428287551353096
Columns:
1: (chrome) name of the chromosome 2: (start) start coordinates of variants. (0-based) 3: (end) end coordinates of variants. (end exclusive) 4: (ref) reference allele of variants 5: (mut) mutant allele of variants 6: (sampleid) the ID of the sample 7: (snpid) the ID of the SNV 8: (ref.enhAct) general regulatory activity of the reference allele 9: (alt.enhAct) general regulatory activity of the mutant allele 10: (logodds) logodds calculated from reference and mutant allele regulatory activity 11: (expr.modifier) cell type modifier score predicted from TF expression 12: (binding.modifier) cell type modifier score predicted from TF binding 13: (gram.prob) predicted GRAM score
Contact
shaoke DOT lou AT yale DOT edu