From GersteinInfo

Revision as of 17:37, 29 April 2019 by Public (Talk | contribs)
Jump to: navigation, search



Variant Prioritization

A. Dependencies

The following tools are required:

  • sed, awk, grep
  • DeepBind (version deepbind-v0.11)
  • bedtools (version bedtools-2.17.0)
  • tabix (version tabix-0.2.6 and up)
  • R

B. Tool Download

This is a Linux/UNIX-based tool. At the command-line prompt, type the following.

$ tar xvf gram.tar

C. Pre-built Data Context

All of the data can be downloaded under ‘Downloads’ in the web server.

D. Tool Usage

$ cd gram/

To display the usage of tool, type ‘./ -h’.

* Usage : ./ -i bed -e exp -d dpath -f fpath -o path
       Options :
               	-i		[Required] User Input SNVs file (BED format: chr st ed ref mut sample-id rsid)
               	-e	 	[Required] User Input gene expression matrix
               	-d 		[Required] The path for DeepBind model
               	-f		[Required] The path for FunSeq
               	-o	 	[Required] Output path
               	NOTE: Please make sure you have sufficient memory, at least 3G.

-i : Required format: chr st ed ref mut sample-id rsid
-e: The rows correspond to genes and columns correspond to samples. Sample ids need to match with those in the variant bed file.

E. Input files

  • User input SNV file (-i): BED format

In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited;

the 6th column is reserved for sample names, do not put other information there): 
chromosome, start position, end position, reference allele, alternative allele, sample id, rsid.
       Chromosome - name of the chromosome (e.g. chr3, chrX)
       Start position - start coordinates of variants. (0-based)
       End position - end coordinates of variants. (end exclusive)
               e.g., chr1   0     100  spanning bases numbered 0-99
       Reference allele - germlime allele of variants
       Alternative allele - mutated allele of variants
       Sample id - the sample id, specifying the input sample or cell line (e.g. "Patient-1", "GM12878")
       RSID -  the id for the variant (e.g. rs9347341)


       chr2	242382863	242382864	C	T	Patient-1	rs3771570
       chr2 	242382863	242382864	C	T	Patient-2	rs3771570
       chr6	117210051 	117210052	T	C	Patient-3	rs339331
       …	…		…		…	…	…		…

  • User input expression matrix (-e): The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited.


       Gene	Sample1	Sample2	Sample3	Sample4	…
       A1BG	1	5	40	0	…
       A1CF	20	9	0	23	…
       …	…	…	…	…	…

F. Output files

Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.

When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot.

  • Sample GRAM output
       chr6 29897016 29897017 C A Patient-1 6:29897017:C:A_rs7767188 0.58  0.494 -0.500289915458444  0.42  0.439 0.415735691763279
       chr6 29897016 29897017 C A Patient-2 6:29897017:C:A_rs7767188 0.58  0.494 -0.500289915458444  0.377 0.439 0.337216556180445
       chr6 29915765 29915766 C A Patient-3 6:29915766:C:A_rs7767188 0.296 0.286 -0.0699306742423629 0.351 0.666 0.325006854130755
       chr6 29915765 29915766 C A Patient-4 6:29915766:C:A_rs7767188 0.296 0.286 -0.0699306742423629 0.35  0.666 0.323297953713459


       1: (chr) name of the chromosome
       2: (st) start coordinates of variants. (0-based)
       3: (ed) end coordinates of variants. (end exclusive)
       4: (ref) reference allele of variants
       5: (mut) mutant allele of variants
       6: (sampleid) the ID of the sample
       7: (snpid) the ID of the SNV
       8: (ref.enhAct) general regulatory activity of the reference allele
       9: (alt.enhAct) general regulatory activity of the mutant allele
       10: (logodds) predicted cell type modifier score
       11: (rank.las)
       12: (db.las)
       13: (score) predicted GRAM score

Building data context

We offer a flexible framework for users to incorporate their own data into the data context. All the data files used in current data context can be replaced with user-specific data. Below is the detailed description. Scripts can be found under ‘Downloads’ of the web server.

  • Define novel sensitive/ultra-sensitive regions

We provide scripts for users to define novel conserved regions in human populations. The algorithm is described in (Khurana, et al., 2013). To define sensitive/ultra-sensitive regions, users need to prepare category files in BED format. The BED files contain the region coordinates under particular categories. For example, the BED file for category - ‘GATA1 binding sites’ – has all the binding coordinates of transcription factor GATA1. Scripts will identify categories under strong human-specific negative selection and define those categories as sensitive/ultra-sensitive regions based on the selection pressure. We use the criteria – enrichment of rare variants – to measure negative selection constraints.

‘’ . We provide this script for users to split categories into proximal or distal subsets. The proximal or distal subsets can be used as new categories.

Scripts used to identify sensitive/ultra-sensitive regions totally from scratch –‘’ and ‘1.2.FDR.r’. ‘’ uses GSC (genome structure correction) like method to generate null distributions for enrichment of rare variants for each category. ‘1.2.FDR.r’ calculates FDR for the randomization. This script can also be used to generate significant categories based on user-selected FDR.

Scripts used to identify novel sensitive/ultra-sensitive regions, in addition to those defined in (Khurana, et al., 2013) – ‘’. This script is only applicable to small number of categories ~ 5.

Note: please prepare your polymorphisms file with only non-coding variants.

  • Process GENCODE GTF file

We provide ‘’ to process GENCODE GTF file to obtain necessary files for data context. The script will generate ‘promoter’, ‘cds’, ‘intron’ and ‘UTR’ region files, which are used by the variant prioritization step. The ‘cds’ file could also be used to filter polymorphisms to obtain non-coding variants. Please put all the generated GENCODE files under ‘data/gencode’. GENCODE version 16 is used in the current data context.

  • Add new networks

The networks data used are under ‘data/networks’ folder. The tool will automatically read all the files in the folder and use the first field separated by ‘.’ as the network name. For example, ‘’ file will be used as network ‘PPI’. So to add new networks, simply put the network files into this folder and use the first field to denote the network name.

The files under the folder have two columns, ‘gene name’ and ‘centrality’. We provide ‘’ for users to generate these files (either degree or betweenness centrality) from tab-delimited network files. Tab-delimited network files are two-columns files showing the interacting genes (for each row, ‘gene A’ ‘gene B’).

  • Identify potential target genes of regulatory elements

We pack the scripts and current REMC data for users to define novel associations. Scripts can be found under ‘Downloads’ of the web server. The scripts are written in C/C++. Please note that the data files are huge ~ 40G.

  • Add new gene lists to annotate variants

The procedure is similar to ‘Add new networks’. Users can just put new files under ‘data/gene_lists’ folder and use the first field separated by ‘.’ as the gene list name.

  • Add recurrent data for new cancer types

This is similar to ‘Add new networks’. Please put files under ‘data/cancer_recurrence’ and use the first field as the cancer type name. This file can be produced by running FunSeq2 (file ‘Recur.Summary’ produced by the tool) on cancer samples of a particular type.

  • Add user-specific annotation sets, such as epigenetic modifications

Please put files under directory ‘data/user_annotations’ or specific directory with option (-ua). The first field separated by ‘.’ will be used as annotation name. Please prepare your files in BED format and use the 4th column for additional information, if needed. We have placed repeat regions obtained from UCSC there as an example.

  • All of other files can be replaced with user-specific data. Please refer to the files under ‘data/’ to correctly format them.


shaoke DOT lou AT yale DOT edu

Personal tools