Variant Prioritization

A. Dependencies

The following tools are required:

sed, awk, grep
DeepBind (version deepbind-v0.11)
bedtools (version bedtools-2.17.0)
tabix (version tabix-0.2.6 and up)
R (require packages: andomForest, glmnet, reshape2, gplots)

B. Tool Download

This is a Linux/UNIX-based tool. At the command-line prompt, type the following.

$ tar xvf gram.tar

C. Configuration

The pipeline grammar.sh should be configured prior to the first use. Please fill in the value of the below variables as instructed:

       genome: the path for genomic sequences (e.g. hg19.fa)
       gencode: the path for gencode annotation of cds regions (e.g. gencode.v19.cds.bed)
       dpath: the path for DeepBind model
       path_funseq: the path for FunSeq

D. Tool Usage

$ cd gram/

To display the usage of tool, type ‘./grammar.sh -h’.

* Usage : ./grammar.sh -i bed -e exp -d dpath -f fpath -o path
       Options :
               	-i		[Required] User Input SNVs file (BED format: chr st ed ref mut sample-id rsid)
               	-e	 	[Required] User Input gene expression matrix
               	-o	 	[Required] Output path
            
               	
               	NOTE: Please make sure you have sufficient memory, at least 3G.

-i : Required format: chr st ed ref mut sample-id rsid
-e: The rows correspond to genes and columns correspond to samples. Sample ids need to match with those in the variant bed file.

E. Input files

User input SNV file (-i): BED format

In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited;

the 6th column is reserved for sample names, do not put other information there): 
chromosome, start position, end position, reference allele, alternative allele, sample id, rsid.
       Chromosome - name of the chromosome (e.g. chr3, chrX)
       Start position - start coordinates of variants. (0-based)
       End position - end coordinates of variants. (end exclusive)
               e.g., chr1   0     100  spanning bases numbered 0-99
       Reference allele - germlime allele of variants
       Alternative allele - mutated allele of variants
       Sample id - the sample id, specifying the input sample or cell line (e.g. "Patient-1", "GM12878")
       RSID -  the id for the variant (e.g. rs9347341)

e.g.

       chr2	242382863	242382864	C	T	Patient-1	rs3771570
       chr2 	242382863	242382864	C	T	Patient-2	rs3771570
       chr6	117210051 	117210052	T	C	Patient-3	rs339331
       …	…		…		…	…	…		…

User input expression matrix (-e): The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited.

e.g.

       Gene	Sample1	Sample2	Sample3	Sample4	…
       A1BG	1	5	40	0	…
       A1CF	20	9	0	23	…
       …	…	…	…	…	…

F. Output files

Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.

When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot.

Sample GRAM output

      chr2 242382863 242382864 C T Patient-1  rs3771570 0.554 0.452 -0.590695386674223  0.32  0.474 0.25859476288468
      chr2 242382863 242382864 C T Patient-2  rs3771570 0.554 0.452 -0.590695386674223  0.32  0.474 0.25859476288468
      chr2 242382863 242382864 C T Patient-3  rs3771570 0.554 0.452 -0.590695386674223  0.418 0.474 0.428287551353096

Columns:

       1: (chrome) name of the chromosome
       2: (start) start coordinates of variants. (0-based)
       3: (end) end coordinates of variants. (end exclusive)
       4: (ref) reference allele of variants
       5: (mut) mutant allele of variants
       6: (sampleid) the ID of the sample
       7: (snpid) the ID of the SNV
       8: (ref.enhAct) general regulatory activity of the reference allele
       9: (alt.enhAct) general regulatory activity of the mutant allele
       10: (logodds) logodds calculated from reference and mutant allele regulatory activity
       11: (expr.modifier) cell type modifier score predicted from TF expression
       12: (binding.modifier) cell type modifier score predicted from TF binding
       13: (gram.prob) predicted GRAM score

Contact

shaoke DOT lou AT yale DOT edu

GRAM

From GersteinInfo

Contents

Variant Prioritization

A. Dependencies

B. Tool Download

C. Configuration

D. Tool Usage

E. Input files

F. Output files

Contact

Views

Personal tools

GersteinLab Public Wiki

Search

Toolbox