GRAM

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
(D. Tool Usage)
(E. Pipeline)
 
(23 intermediate revisions not shown)
Line 1: Line 1:
-
_TOC__
 
-
 
=Variant Prioritization=
=Variant Prioritization=
Line 9: Line 7:
* [http://code.google.com/p/bedtools/downloads/list bedtools] (version bedtools-2.17.0) <br>
* [http://code.google.com/p/bedtools/downloads/list bedtools] (version bedtools-2.17.0) <br>
* [http://sourceforge.net/projects/samtools/files/tabix/ tabix] (version tabix-0.2.6 and up) <br>
* [http://sourceforge.net/projects/samtools/files/tabix/ tabix] (version tabix-0.2.6 and up) <br>
-
* [http://www.r-project.org/ R] <br>
+
* [http://www.r-project.org/ R] (require packages: andomForest, glmnet, reshape2, gplots) <br>
==B. Tool Download==
==B. Tool Download==
This is a  Linux/UNIX-based tool. At the command-line prompt, type the following.  
This is a  Linux/UNIX-based tool. At the command-line prompt, type the following.  
-
  $ tar xvf gram.tar
+
  $ git clone https://github.com/gersteinlab/gram.git
-
==C. Pre-built Data Context==
+
==C. Configuration==
-
All of the data can be downloaded under ‘Downloads’ in the web server.  
+
The pipeline grammar.sh should be configured prior to the first use. Please fill in the value of the below variables as instructed:
 +
 
 +
        genome: the path for genomic sequences (e.g. hg19.fa), the chromosome name is with "chr" prefix. <br>
 +
        You can download from UCSC ftp: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz <br><br>
 +
        gencode: the path for gencode annotation of cds regions (e.g. gencode.v19.cds.bed is available from GRAM github: https://github.com/gersteinlab/gram) <br><br>
 +
        dpath: the path for DeepBind model, there is a copy from GRAM github:https://github.com/gersteinlab/gram <br><br>
 +
        path_funseq: the path for FunSeq whole genome score file, which can be downloaded from funseq3.gersteinlab.org
==D. Tool Usage==
==D. Tool Usage==
Line 24: Line 28:
To display the usage of tool, type ‘./grammar.sh -h’. <br>
To display the usage of tool, type ‘./grammar.sh -h’. <br>
-
  * Usage : ./grammar.sh -i bed -e exp -d dpath -f fpath -o path
+
  * Usage : ./grammar -i bed -e exp  
-
         Options :
+
         Options :<br>
-
                -i [Required] User Input SNVs file (BED format: chr st ed ref mut sample-id rsid)
+
        -i [Required] User Input SNVs file (BED format: chr st ed ref mut sample-id rsid)
-
                -e [Required] User Input gene expression matrix
+
        -e [Required] User Input gene expression matrix
-
                -d [Required] The path for DeepBind model
+
-
                -f [Required] The path for FunSeq
+
-
                -o [Required] Output path
+
              
              
-
               
+
        NOTE: Please make sure you have sufficient memory, at least 3G.
-
                NOTE: Please make sure you have sufficient memory, at least 3G.
+
-i : Required format: chr st ed ref mut sample-id rsid<br>
-i : Required format: chr st ed ref mut sample-id rsid<br>
-e: The rows correspond to genes and columns correspond to samples. Sample ids need to match with those in the variant bed file. <br>
-e: The rows correspond to genes and columns correspond to samples. Sample ids need to match with those in the variant bed file. <br>
-
==E. Input files==
+
==E. Pipeline==
-
* User input file (-f): could be either BED or VCF format. For indels, please use “-” instead of other symbols in ‘allele’ columns for insertions or deletions. Indels will be analyzed for BED format.
+
 
 +
The pipeline:
 +
[[File::http://funseq3.gersteinlab.org/GRAMMAR.pipeline.png]]
 +
 
 +
The pipeline GRAMMAR runs as follows:
 +
* Filter non-coding SNPs based on provided CDS annotation
 +
* Extract sequences from the provided genome
 +
* Run DeepBind on the extracted sequences
 +
* Run GRAM with expression data and DeepBind output
 +
* Extract Funseq scores of the variants
 +
* Print and visualize the results
 +
 
 +
 
 +
Sample report:
 +
  GRAMMAR: parameter parsing done
 +
  GRAMMAR: input variant bed file: snptest.bed
 +
  GRAMMAR: input gene expression file: gexpr.test.bed
 +
  GRAMMAR: checking dependency done
 +
  GRAMMAR: checking genome fasta and gencode annotation configs done
 +
  --------------------------------------------------
 +
  GRAMMAR: Files checking done
 +
  ==================================================
 +
  GRAMMAR: Filtering non-coding SNVs only. InDels will also be removed.
 +
  --------------------------------------------------
 +
  GRAMMAR: SNP filtering done
 +
  ==================================================
 +
  GRAMMAR: Running Deepbind on the selected genomic regions..
 +
  GRAMMAR: Preparing input sequences for DeepBind score calculation. Please make sure you have added a correct DeepBind path and also put the parameter file db/params in the correct folder.
 +
  GRAMMAR: Running Deepbind now. It may take long time if your SNV input list is very long
 +
  GRAMMAR: cd gram/deepbind
 +
  GRAMMAR: cd grammar.test
 +
  GRAMMAR: checking DeepBind output...
 +
  --------------------------------------------------
 +
  GRAMMAR: Deepbind score prediction done
 +
  ==================================================
 +
  Step 2: Running GRAM predictor.
 +
  GRAMMAR: cd gram
 +
  GRAMMAR: Rscript gram/gram.predict.r grammar.test ref.db.out mut.db.out nc.var.bed snp.4deepbind.bed gexpr.test.bed GRAMMAR/model.rdata snptest.bed.out/gram.score.txt<br>
 +
  [1] 20242  102
 +
  [1] "After filting, only 102 of 199 sample left for the prediction"<br>
 +
  GRAMMAR: Prediction results have been saved in snptest.bed.out<br>
 +
  --------------------------------------------------
 +
  GRAMMAR: the prediction of GRAM score  done
 +
  ==================================================
 +
  GRAMMAR: Visualizing results.
 +
  GRAMMAR: Getting Funseq score for the variants.<br>
 +
  --------------------------------------------------
 +
  GRAMMAR: Results visualization done
 +
  ==================================================
 +
  GRAMMAR: your job is done. please go to snptest.bed.out to find your results.
 +
  ==================================================
 +
 
 +
==F. Input files==
 +
* User input SNV file (-i): BED format  
   
   
-
BED format. In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited;  
+
In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited);  
  the 6th column is reserved for sample names, do not put other information there):  
  the 6th column is reserved for sample names, do not put other information there):  
-
  chromosome, start position, end position, reference allele, and alternative allele.
+
  chromosome, start position, end position, reference allele, alternative allele, sample id, rsid.
         Chromosome - name of the chromosome (e.g. chr3, chrX)
         Chromosome - name of the chromosome (e.g. chr3, chrX)
         Start position - start coordinates of variants. (0-based)
         Start position - start coordinates of variants. (0-based)
Line 50: Line 103:
         Reference allele - germlime allele of variants
         Reference allele - germlime allele of variants
         Alternative allele - mutated allele of variants
         Alternative allele - mutated allele of variants
-
          
+
         Sample id - the sample id, specifying the input sample or cell line (e.g. "Patient-1", "GM12878")
-
VCF format. The header line names the 8 fixed, mandatory columns. These columns are as follows (tab-delimited): 

+
         RSID - the id for the variant (e.g. rs9347341)
-
#CHROM POS ID REF ALT  QUAL FILTER INFO
+
-
          
+
-
  Recurrent analysis input format
+
-
        Option 1: separated files for each genome (BED or VCF). Use “-f file1, file2, file3” separated by comma.
+
-
        Option 2: put all variants in one file (only for BED format, use the 6th column labeling sample names). Use “-f file”.
+
-
* Gene list format (-g): If you are interested in particular set of genes, you can put your genes in one file (one gene per row) and use “-g file” to only analyze variants in or associated with those genes. Please use gene symbols.
+
e.g.
 +
        chr2 242382863 242382864 C T Patient-1 rs3771570
 +
        chr2 242382863 242382864 C T Patient-2 rs3771570
 +
        chr6 117210051 117210052 T C Patient-3 rs339331
 +
        … … … … … … …
-
* Gene expression format (-exp): Users can also upload gene expression file for the program to detect differentially expressed genes between cancer and benign samples and highlight variants associated with these genes. The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited.  
+
 
-
+
* User input expression matrix (-e)
 +
The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited.  
 +
 
 +
e.g.
         Gene Sample1 Sample2 Sample3 Sample4 …
         Gene Sample1 Sample2 Sample3 Sample4 …
         A1BG 1 5 40 0 …
         A1BG 1 5 40 0 …
Line 67: Line 122:
         … … … … … …
         … … … … … …
-
* Sample class format (-cls): In addition to the expression file, users need to upload a file with samples annotated as “cancer” or “benign” (only two classes “cancer” or “benign”). The number of samples in this file should be equal to that in expression data. And sample names should match.
+
==G. Output files==
-
+
-
        Sample1 benign
+
-
        Sample2 cancer
+
-
        Sample3 cancer
+
-
        Sample4 benign
+
-
        … …
+
-
 
+
-
==F. Output files==
+
Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.  
Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.  
When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot.  
When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot.  
-
* Sample BED format output
+
* Sample GRAM output
-
Header:
+
-
chr    start  end    ref    alt    sample  gerp;cds;variant.annotation.cds;network.hub;gene.under.negative.selection;
+
-
ENCODE.annotated;hot.region;motif.analysis;sensitive;ultra.sensitive;ultra.conserved;target.gene[known_cancer_gene/
+
-
TF_regulating_known_cancer_gene,differential_expressed_in_cancer,actionable_gene];coding.score;noncoding.score;
+
-
recurrence.within.samples;recurrence.database
+
-
+
-
Coding variant:
+
-
chr1    36205041        36205042        C      A      PR2832  5.6;Yes;VA=1:CLSPN:ENSG00000092853.9:-:prematureStop:
+
-
4/4:CLSPN-001:ENST00000251195.5:3999_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4020_3232_1078_E->*:
+
-
CLSPN-003:ENST00000373220.3:3828_3040_1014_E->*:CLSPN-004:ENST00000520551.1:3861_3073_1025_E-
+
-
>*;PPI;Yes;.;.;.;.;.;.;CLSPN;5;.;.;.
+
-
+
-
Non-coding variant:
+
-
chr6    152304995      152304996      A      G      PR2832  2.63;No;.;ESR1:PHOS(0.276)PPI(0.995)REG(0.994);.;.;.;.;.;.;.;
+
-
ESR1(Intron)[TF_regulating_known_cancer_gene:H3F3A,MN1,PRCC,RARA,SLC34A2,TPM3][actionable];.;1.60983633568013;.;.
+
-
 
+
-
* Sample VCF format output
+
-
Header:
+
-
##fileformat=VCFv4.0
+
-
##INFO=<ID=OTHER,Number=.,Type=String, Description = "Other Information From Original File">
+
-
##INFO=<ID=SAMPLE,Number=.,Type=String,Description="Sample id">
+
-
##INFO=<ID=CDS,Number=.,Type=String,Description="Coding Variants or not">
+
-
##INFO=<ID=VA,Number=.,Type=String,Description="Coding Variant Annotation">
+
-
##INFO=<ID=HUB,Number=.,Type=String,Description="Network Hubs, PPI (protein protein interaction network), REG (regulatory network), 
+
-
PHOS (phosphorylation network)...">
+
-
##INFO=<ID=GNEG,Number=.,Type=String,Description="Gene Under Negative Selection">
+
-
##INFO=<ID=GERP,Number=.,Type=String,Description="Gerp Score">
+
-
##INFO=<ID=NCENC,Number=.,Type=String,Description="NonCoding ENCODE Annotation">
+
-
##INFO=<ID=HOT,Number=.,Type=String,Description="Highly Occupied Target Region">
+
-
##INFO=<ID=MOTIFBR,Number=.,Type=String,Description="Motif Breaking">
+
-
##INFO=<ID=MOTIFG,Number=.,Type=String,Description="Motif Gain">
+
-
##INFO=<ID=SEN,Number=.,Type=String,Description="In Sensitive Region">
+
-
##INFO=<ID=USEN,Number=.,Type=String,Description="In Ultra-Sensitive Region">
+
-
##INFO=<ID=UCONS,Number=.,Type=String,Description="In Ultra-Conserved Region">
+
-
##INFO=<ID=GENE,Number=.,Type=String,Description="Target Gene (For coding - directly affected genes ; For non-coding - promoter or 
+
-
distal regulatory module)">
+
-
##INFO=<ID=CANG,Number=.,Type=String,Description="Prior Gene Information, e.g.[cancer][TF_regulating_known_cancer_gene]
+
-
[up_regulated][actionable]...";
+
-
##INFO=<ID=CDSS,Number=.,Type=String,Description="Coding Score">
+
-
##INFO=<ID=NCDS,Number=.,Type=String,Description="NonCoding Score">
+
-
##INFO=<ID=RECUR,Number=.,Type=String,Description="Recurrent elements / variants">
+
-
##INFO=<ID=DBRECUR,Number=.,Type=String,Description="Recurrence database">
+
-
#CHROM  POS    ID      REF    ALT    QUAL    FILTER  INFO
+
-
+
-
Coding variant:
+
-
chr1    36205042        .      C      A      .      .      SAMPLE=PR2832;GERP=5.6;CDS=Yes;VA=1:CLSPN:ENSG00000092853.9:-
+
-
:prematureStop:4/4:CLSPN-001:ENST00000251195.5:3999_3232_1078_E->*:CLSPN-005:ENST00000318121.3:4020_3232_1078_E-
+
-
>*:CLSPN-003:ENST00000373220.3:3828_3040_1014_E->*:CLSPN-004:ENST00000520551.1:3861_3073_1025_E-
+
-
>*;HUB=PPI;GNEG=Yes;GENE=CLSPN;CDSS=5
+
-
+
-
Non-coding variant:
+
-
chr6    152304996      .      A      G      .      .      SAMPLE=PR2832;GERP=2.63;CDS=No;HUB=ESR1:PHOS(0.276)PPI(0.995)REG(0.994);
+
-
GENE=ESR1(Intron);CANG=ESR1[TF_regulating_known_cancer_gene:H3F3A,MN1,PRCC,RARA,SLC34A2,TPM3]
+
-
[actionable];NCDS=1.60983633568013
+
-
 
+
-
* Output description (VCF format as an example)
+
-
VA (variants annotation)
+
-
      This is the output produced from VAT (variant annotation tool) for coding variations.
+
-
      Please refer to ‘http://vat.gersteinlab.org’ for documentations.
+
-
+
-
NCENC (Non-coding ENCODE annotation)
+
-
      Example: ‘NCENC=TFP(CEBPB|chr5:139639150-139639496),TFP(STAT3|chr5:139638936-139640136),TFP(STAT3|chr5:139638976-
+
-
      139639553),TFP(STAT3|chr5:139638989-139639544),TFP(STAT3|chr5:139638999-139639716)’
+
-
      This is formatted as “category(element_name|chromosome:coordinates)” (0-based, end exclusive).
+
-
              TFP - transcription factor binding peak.
+
-
              TFM - transcription factor bound motifs in peak regions.
+
-
              DHS - DNase1 hypersensitive sites, with number of cell lines (MCV, total 125 cell lines). For cell-line info, please refer to [http://archive.gersteinlab.org/yaofu/DHS/  DHS cell lines]
+
-
              ncRNA - non-coding RNA 

+
-
              Pseudogene
+
-
              Enhancer - chmm/segway (genome segmentation), drm (distal regulatory module) 

+
-
+
-
HOT (transcription factor highly occupied region)
+
-
      Example: ‘HOT=Helas3’
+
-
      If a variant occurs in HOT regions, the corresponding cell lines (5 in total) are shown. This annotation is from (Yip, et al., 2012).
+
-
+
-
MOTIFBR (motif-breaking analysis)
+
-
      SNV Example: ‘MOTIFBR=MAX#Myc_known9_8mer#102248644#102248656#-#9#0.068966#0.931034’
+
-
      The variant causes a motif-breaking event. This field is a hash tag delimited, defined as follows: 
TF name # motif name # motif start #
+
-
      motif end # motif strand # mutation position # alternative allele frequency in PWM # reference allele frequency in PWM
. (0-based, end exclusive)   
+
-
     
+
-
      Indel Example: ‘MOTIFBR=TCF12#TCF12_disc5_8mer#115719379#115719390#+’
+
-
      This field is a hash tag delimited, defined as follows: 
TF name # motif name # motif start # motif end # motif strand. (0-based, end exclusive)
+
-
 
+
-
MOTIFG (motif-gaining analysis)
+
-
      SNV Example: ‘MOTIFG=GATA_known5#75658824#75658829#-#1#4.839#4.181’

+
-
      The variant causes a motif-gaining event. Hash tag delimited: motif name # motif start # motif end # motif strand # mutation position
+
-
      # sequence score with alternative allele # sequence score with reference allele. (0-based, end exclusive)
+
-
     
+
-
      Indel example: ‘MOTIFG=Ets_known10#CGGAAA#6#+#5.743’

+
-
      Hash tag delimited: motif name # motif sequence discovered # motif length # motif strand # sequence score with alternative allele.
+
-
+
-
GENE (target gene - for coding: directly affected genes; for non-coding: promoter or distal regulatory module)
+
-
      Example: ‘GENE=ARNT2(Enhancer),C15orf26(Intron),IL16(Enhancer)’
+
-
      For noncoding variants, ‘intron’, ‘promoter’, ‘UTR’, ‘Distal’ and ‘Medial’ tags are annotated.
+
-
      For ‘Distal’ and ‘Medial’ tags, the corresponding association score (with histone modifications) is also shown.
+
-
      ‘Distal’ means that the regulatory element is >10kb away from TSS, whereas ‘Medial’ means within 10kb.
+
-
+
-
CANG (cancer related information)
+
-
      Example: ‘CANG=EGFR[actionable][cancer]’
+
-
      This field stores all the gene related information. Currently there are five possible tags:
+
-
              [cancer]: the gene have been annotated as an cancer gene.
+
-
              [TF_regulating_known_cancer_gene]: the gene is a transcription factor regulating known cancer genes. The regulated cancer genes are also shown.
+
-
              [actionable]: the gene is potentially actionable (“druggable”).
+
-
              [up_regulated]: the gene is up-regulated in cancers, when providing RNA-Seq gene expression data.
+
-
              [down_regulated]: the gene is down-regulated in cancers, when providing RNA-Seq gene expression data. 
+
-
      When users provide new gene lists, tags about these gene lists will be shown in this field.
+
-
+
-
USER_ANNO (user annotations)
+
-
      Example: ‘USER_ANNO=REPEAT(FLAM_A|chr1:100544744-100544854)’
+
-
      This field stores all user provided annotations.
+
-
+
-
RECUR (recurrent genes, regulatory elements and mutations within samples)
+
-
      Example: ‘RECUR=Pseudogene(ENST00000467115.1|chr1:568914-569121):PR1783(chr1:568941,chr1:569004*),PR2832(chr1:569004*)’
+
-
      When analyzing multiple genomes, if genes or regulatory elements are shown in >= 2 samples, they are annotated as ‘gene/regulatory
+
-
      element name: recurrent samples (variants in corresponding samples (position is 1-based))’.  If it is a same site mutation, ‘*’ is tagged.
+
-
+
-
DBRECUR (Recurrence databse)
+
-
      Example: ‘DBRECUR=Enhancer(chmm/segway|chr15:22517400-22521103):Lung_Adeno(Altered in 4/24(16.67%) samples.)|
+
-
      Prostate(Altered in 2/64(3.12%) samples.),Enhancer(drm|chr15:22517700-22521100):Lung_Adeno(Altered in 4/24(16.67%) samples.)|
+
-
      Prostate(Altered in 2/64(3.12%) samples.)’
+
-
      If genes, regulatory elements or mutations are observed in the recurrence database (currently including 570 samples of 10 cancer
+
-
      types and COSMIC), the recurrence information is shown here. ‘recurrent element(name|coordinates):cancer type(recurrence information in this
+
-
      cancer type)’. Recurrence information is separated by ‘,’.
+
-
 
+
-
=Docker Image Usage=
+
-
Please download Docker at https://www.docker.com/
+
-
* Export docker image to your computer:
+
-
$ docker load -i funseq2-docker-image.tar
+
-
* Run docker container
+
-
$ ./funseq2-docker.sh
+
-
 
+
-
=Building data context=
+
-
We offer a flexible framework for users to incorporate their own data into the data context. All the data files used in current data context can be replaced with user-specific data. Below is the detailed description. Scripts can be found under ‘Downloads’ of the web server.
+
-
 
+
-
* Define novel sensitive/ultra-sensitive regions
+
-
We provide scripts for users to define novel conserved regions in human populations. The algorithm is described in (Khurana, et al., 2013). To define sensitive/ultra-sensitive regions, users need to prepare category files in BED format. The BED files contain the region coordinates under particular categories. For example, the BED file for category - ‘GATA1 binding sites’ – has all the binding coordinates of transcription factor GATA1. Scripts will identify categories under strong human-specific negative selection and define those categories as sensitive/ultra-sensitive regions based on the selection pressure. We use the criteria – enrichment of rare variants – to measure negative selection constraints. 
+
-
 
+
-
‘0.define.proximal.distal.regions.pl’ . We provide this script for users to split categories into proximal or distal subsets. The proximal or distal subsets can be used as new categories.
+
-
 
+
-
Scripts used to identify sensitive/ultra-sensitive regions totally from scratch –‘1.Randomization.pl’ and ‘1.2.FDR.r’. ‘1.Randomization.pl’ uses GSC (genome structure correction) like method to generate null distributions for enrichment of rare variants for each category.  ‘1.2.FDR.r’ calculates FDR for the randomization. This script can also be used to generate significant categories based on user-selected FDR.
+
-
 
+
-
Scripts used to identify novel sensitive/ultra-sensitive regions, in addition to those defined in (Khurana, et al., 2013) – ‘2.sensitive.regions.delta.increment.pl’. This script is only applicable to small number of categories ~ 5.
+
-
 
+
-
Note: please prepare your polymorphisms file with only non-coding variants.
+
-
 
+
-
* Process GENCODE GTF file
+
-
We provide ‘3.gencode.process.pl’ to process GENCODE GTF file to obtain necessary files for data context. The script will generate ‘promoter’, ‘cds’, ‘intron’ and ‘UTR’ region files, which are used by the variant prioritization step. The ‘cds’ file could also be used to filter polymorphisms to obtain non-coding variants. Please put all the generated GENCODE files under ‘data/gencode’. GENCODE version 16 is used in the current data context.
+
-
 
+
-
* Add new networks
+
-
The networks data used are under ‘data/networks’ folder. The tool will automatically read all the files in the folder and use the first field separated by ‘.’ as the network name. For example, ‘PPI.degree’ file will be used as network ‘PPI’. So to add new networks, simply put the network files into this folder and use the first field to denote the network name.
+
-
 
+
-
The files under the folder have two columns, ‘gene name’ and ‘centrality’. We provide ‘4.network.analysis.r’ for users to generate these files (either degree or betweenness centrality) from tab-delimited network files. Tab-delimited network files are two-columns files showing the interacting genes (for each row, ‘gene A’ ‘gene B’).
+
-
 
+
-
* Identify potential target genes of regulatory elements
+
-
We pack the scripts and current REMC data for users to define novel associations. Scripts can be found under ‘Downloads’ of the web server. The scripts are written in C/C++. Please note that the data files are huge ~ 40G.
+
-
 
+
-
* Add new gene lists to annotate variants
+
-
The procedure is similar to ‘Add new networks’. Users can just put new files under ‘data/gene_lists’ folder and use the first field separated by ‘.’ as the gene list name.
+
-
 
+
-
* Add recurrent data for new cancer types
+
-
This is similar to ‘Add new networks’. Please put files under ‘data/cancer_recurrence’ and use the first field as the cancer type name. This file can be produced by running FunSeq2 (file ‘Recur.Summary’ produced by the tool) on cancer samples of a particular type.
+
-
* Add user-specific annotation sets, such as epigenetic modifications
+
      chr2 242382863 242382864 C T Patient-1  rs3771570 0.554 0.452 -0.590695386674223  0.32  0.474 0.25859476288468
-
Please put files under directory ‘data/user_annotations’ or specific directory with option (-ua). The first field separated by ‘.’ will be used as annotation name. Please prepare your files in BED format and use the 4th column for additional information, if needed. We have placed repeat regions obtained from UCSC there as an example.
+
      chr2 242382863 242382864 C T Patient-2  rs3771570 0.554 0.452 -0.590695386674223  0.32  0.474 0.25859476288468
 +
      chr2 242382863 242382864 C T Patient-3  rs3771570 0.554 0.452 -0.590695386674223  0.418 0.474 0.428287551353096
-
* All of other files can be replaced with user-specific data. Please refer to the files under ‘data/’ to correctly format them.
+
Columns:
 +
        1: (chrome) name of the chromosome
 +
        2: (start) start coordinates of variants. (0-based)
 +
        3: (end) end coordinates of variants. (end exclusive)
 +
        4: (ref) reference allele of variants
 +
        5: (mut) mutant allele of variants
 +
        6: (sampleid) the ID of the sample
 +
        7: (snpid) the ID of the SNV
 +
        8: (ref.enhAct) general regulatory activity of the reference allele
 +
        9: (alt.enhAct) general regulatory activity of the mutant allele
 +
        10: (logodds) logodds calculated from reference and mutant allele regulatory activity
 +
        11: (expr.modifier) cell type modifier score predicted from TF expression
 +
        12: (binding.modifier) cell type modifier score predicted from TF binding
 +
        13: (gram.prob) predicted GRAM score
=Contact=
=Contact=
shaoke DOT lou AT yale DOT edu
shaoke DOT lou AT yale DOT edu

Latest revision as of 03:00, 2 May 2019

Contents

Variant Prioritization

A. Dependencies

The following tools are required:

  • sed, awk, grep
  • DeepBind (version deepbind-v0.11)
  • bedtools (version bedtools-2.17.0)
  • tabix (version tabix-0.2.6 and up)
  • R (require packages: andomForest, glmnet, reshape2, gplots)

B. Tool Download

This is a Linux/UNIX-based tool. At the command-line prompt, type the following.

$ git clone https://github.com/gersteinlab/gram.git

C. Configuration

The pipeline grammar.sh should be configured prior to the first use. Please fill in the value of the below variables as instructed:

       genome: the path for genomic sequences (e.g. hg19.fa), the chromosome name is with "chr" prefix. 
You can download from UCSC ftp: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

gencode: the path for gencode annotation of cds regions (e.g. gencode.v19.cds.bed is available from GRAM github: https://github.com/gersteinlab/gram)

dpath: the path for DeepBind model, there is a copy from GRAM github:https://github.com/gersteinlab/gram

path_funseq: the path for FunSeq whole genome score file, which can be downloaded from funseq3.gersteinlab.org

D. Tool Usage

$ cd gram/


To display the usage of tool, type ‘./grammar.sh -h’.

* Usage : ./grammar -i bed -e exp 
       Options :
-i [Required] User Input SNVs file (BED format: chr st ed ref mut sample-id rsid) -e [Required] User Input gene expression matrix NOTE: Please make sure you have sufficient memory, at least 3G.

-i : Required format: chr st ed ref mut sample-id rsid
-e: The rows correspond to genes and columns correspond to samples. Sample ids need to match with those in the variant bed file.

E. Pipeline

The pipeline: [[File::http://funseq3.gersteinlab.org/GRAMMAR.pipeline.png]]

The pipeline GRAMMAR runs as follows:

  • Filter non-coding SNPs based on provided CDS annotation
  • Extract sequences from the provided genome
  • Run DeepBind on the extracted sequences
  • Run GRAM with expression data and DeepBind output
  • Extract Funseq scores of the variants
  • Print and visualize the results


Sample report:

 GRAMMAR: parameter parsing done
 GRAMMAR: input variant bed file: snptest.bed
 GRAMMAR: input gene expression file: gexpr.test.bed
 GRAMMAR: checking dependency done
 GRAMMAR: checking genome fasta and gencode annotation configs done
 --------------------------------------------------
 GRAMMAR: Files checking done
 ==================================================
 GRAMMAR: Filtering non-coding SNVs only. InDels will also be removed.
 --------------------------------------------------
 GRAMMAR: SNP filtering done
 ==================================================
 GRAMMAR: Running Deepbind on the selected genomic regions..
 GRAMMAR: Preparing input sequences for DeepBind score calculation. Please make sure you have added a correct DeepBind path and also put the parameter file db/params in the correct folder.
 GRAMMAR: Running Deepbind now. It may take long time if your SNV input list is very long
 GRAMMAR: cd gram/deepbind
 GRAMMAR: cd grammar.test
 GRAMMAR: checking DeepBind output...
 --------------------------------------------------
 GRAMMAR: Deepbind score prediction done
 ==================================================
 Step 2: Running GRAM predictor.
 GRAMMAR: cd gram
 GRAMMAR: Rscript gram/gram.predict.r grammar.test ref.db.out mut.db.out nc.var.bed snp.4deepbind.bed gexpr.test.bed GRAMMAR/model.rdata snptest.bed.out/gram.score.txt
[1] 20242 102 [1] "After filting, only 102 of 199 sample left for the prediction"
GRAMMAR: Prediction results have been saved in snptest.bed.out
-------------------------------------------------- GRAMMAR: the prediction of GRAM score done ================================================== GRAMMAR: Visualizing results. GRAMMAR: Getting Funseq score for the variants.
-------------------------------------------------- GRAMMAR: Results visualization done ================================================== GRAMMAR: your job is done. please go to snptest.bed.out to find your results. ==================================================

F. Input files

  • User input SNV file (-i): BED format

In addition to the three required BED fields, please prepare your files as following (5 required fields, tab delimited);

the 6th column is reserved for sample names, do not put other information there): 
chromosome, start position, end position, reference allele, alternative allele, sample id, rsid.
       Chromosome - name of the chromosome (e.g. chr3, chrX)
       Start position - start coordinates of variants. (0-based)
       End position - end coordinates of variants. (end exclusive)
               e.g., chr1   0     100  spanning bases numbered 0-99
       Reference allele - germlime allele of variants
       Alternative allele - mutated allele of variants
       Sample id - the sample id, specifying the input sample or cell line (e.g. "Patient-1", "GM12878")
       RSID -  the id for the variant (e.g. rs9347341)

e.g.

       chr2	242382863	242382864	C	T	Patient-1	rs3771570
       chr2 	242382863	242382864	C	T	Patient-2	rs3771570
       chr6	117210051 	117210052	T	C	Patient-3	rs339331
       …	…		…		…	…	…		…


  • User input expression matrix (-e)

The gene expression file should be prepared as a matrix with first column stores gene names (use gene symbols) and first row as sample names. Other fields are gene expression data either in RPKM or raw read counts format. Tab delimited.

e.g.

       Gene	Sample1	Sample2	Sample3	Sample4	…
       A1BG	1	5	40	0	…
       A1CF	20	9	0	23	…
       …	…	…	…	…	…

G. Output files

Five output files will be generated: ‘Output.format’, ‘Output.indel.format’, ‘Recur.Summary’, ‘Candidates.Summary’ and ‘Error.log’. Output.format: stores detailed results for all samples; Output.indel.format: contains results for indels; Recur.Summary: the recurrence result when having multiple samples; Candidates.Summary: brief output of potential candidates (coding nonsynonymous/prematurestop variants, non-coding variants with score (>= 5 of un-weighted scoring scheme and >=1.5 for weighted scoring scheme) and variants in or associated with known cancer genes); Error.log: error information. For un-weighted scoring scheme, each feature is given value 1.

When provided with gene expression files, two additional files will be produced – ‘DE.gene.txt’ stores differentially expressed genes and ‘DE.pdf ’is the differential gene expression plot.

  • Sample GRAM output
      chr2 242382863 242382864 C T Patient-1  rs3771570 0.554 0.452 -0.590695386674223  0.32  0.474 0.25859476288468
      chr2 242382863 242382864 C T Patient-2  rs3771570 0.554 0.452 -0.590695386674223  0.32  0.474 0.25859476288468
      chr2 242382863 242382864 C T Patient-3  rs3771570 0.554 0.452 -0.590695386674223  0.418 0.474 0.428287551353096

Columns:

       1: (chrome) name of the chromosome
       2: (start) start coordinates of variants. (0-based)
       3: (end) end coordinates of variants. (end exclusive)
       4: (ref) reference allele of variants
       5: (mut) mutant allele of variants
       6: (sampleid) the ID of the sample
       7: (snpid) the ID of the SNV
       8: (ref.enhAct) general regulatory activity of the reference allele
       9: (alt.enhAct) general regulatory activity of the mutant allele
       10: (logodds) logodds calculated from reference and mutant allele regulatory activity
       11: (expr.modifier) cell type modifier score predicted from TF expression
       12: (binding.modifier) cell type modifier score predicted from TF binding
       13: (gram.prob) predicted GRAM score

Contact

shaoke DOT lou AT yale DOT edu

Personal tools