How to execute FusionSeq
From GersteinInfo
| Line 3: | Line 3: | ||
| For data generated by the Genome Analyzer II and aligned with Eland, the program export2mrf provided with FusionSeq can be used to perform the conversion. All programs will give an brief explanation of their usage when run without parameters. A typical analysis session may look like the following: | For data generated by the Genome Analyzer II and aligned with Eland, the program export2mrf provided with FusionSeq can be used to perform the conversion. All programs will give an brief explanation of their usage when run without parameters. A typical analysis session may look like the following: | ||
| <pre> | <pre> | ||
| - | $ geneFusions data 4 < data.mrf | + | $ geneFusions data 4 < data.mrf > data.1.gfr 2> data.1.log | 
| - | $ ( | + | $ (gfrAbnormalInsertSizeFilter 0.01 < data.1.gfr | gfrPCRFilter 3 | gfrProximityFilter 1000 | gfrAddInfo | gfrAnnotationConsistencyFilter ribosomal | gfrBlackListFilter blackList.txt | gfrLargeScaleHomologyFilter | gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log | 
| - | $ gfrConfidenceValues  | + | $ gfrConfidenceValues data < data.gfr > data.confidence.gfr | 
| $ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log | $ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log | ||
| </pre> | </pre> | ||
| - | The first command will create the first list of candidate fusion transcripts. The parameters "data" and "4" correspond to the prefix (data) that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc. "4" is the minimum number of PE reads needed to call a candidate. The program reads data.mrf.gz from standard input and save the output to data.1.gfr as well as logging information on data.1.log. This can be considered the most comprehensive list of fusion candidates. The second command executes all the filters to generate a high-confidence list of candidates. The order of the filters affects only the computation time of the processing. However, gfrAddInfo needs to be execute prior to  | + | The first command will create the first list of candidate fusion transcripts. The parameters "data" and "4" correspond to the prefix (data) that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc. "4" is the minimum number of PE reads needed to call a candidate. The program reads data.mrf.gz from standard input and save the output to data.1.gfr as well as logging information on data.1.log. This can be considered the most comprehensive list of fusion candidates. The second command executes all the filters to generate a high-confidence list of candidates. The order of the filters affects only the computation time of the processing. However, gfrAddInfo needs to be execute prior to gfrAnnotationConsistencyFilter or gfrBlacklistFilter, because they require gene symbols and descriptions. Moreover, since gfrSmallScaleHomologyFilter is the most computationally intensive, it is probably better to run it toward the end of the pipeline. Note that each filter outputs the filtered gfr fils as well as logging information. The description of the filters is described in the supplemental material. | 
| Once the final list is generated, gfrConfidenceValues computes the various scores described in the manuscript.   | Once the final list is generated, gfrConfidenceValues computes the various scores described in the manuscript.   | ||
Revision as of 02:38, 21 August 2010
User documentation main
Here we provide an example about running a simple analysis with FusionSeq. We assume that PE sequencing data have been generated and transformed into MRF (data.mrf.gz). This data format is generated by RSEQtool, a suite for RNA-Seq data analysis. There you wil find conversion tools for some of the alignment tools, including SAM. For data generated by the Genome Analyzer II and aligned with Eland, the program export2mrf provided with FusionSeq can be used to perform the conversion. All programs will give an brief explanation of their usage when run without parameters. A typical analysis session may look like the following:
$ geneFusions data 4 < data.mrf > data.1.gfr 2> data.1.log $ (gfrAbnormalInsertSizeFilter 0.01 < data.1.gfr | gfrPCRFilter 3 | gfrProximityFilter 1000 | gfrAddInfo | gfrAnnotationConsistencyFilter ribosomal | gfrBlackListFilter blackList.txt | gfrLargeScaleHomologyFilter | gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log $ gfrConfidenceValues data < data.gfr > data.confidence.gfr $ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log
The first command will create the first list of candidate fusion transcripts. The parameters "data" and "4" correspond to the prefix (data) that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc. "4" is the minimum number of PE reads needed to call a candidate. The program reads data.mrf.gz from standard input and save the output to data.1.gfr as well as logging information on data.1.log. This can be considered the most comprehensive list of fusion candidates. The second command executes all the filters to generate a high-confidence list of candidates. The order of the filters affects only the computation time of the processing. However, gfrAddInfo needs to be execute prior to gfrAnnotationConsistencyFilter or gfrBlacklistFilter, because they require gene symbols and descriptions. Moreover, since gfrSmallScaleHomologyFilter is the most computationally intensive, it is probably better to run it toward the end of the pipeline. Note that each filter outputs the filtered gfr fils as well as logging information. The description of the filters is described in the supplemental material.
Once the final list is generated, gfrConfidenceValues computes the various scores described in the manuscript.
If the auxilliary modules are installed, then, the fourth command generates the corresponding files and print the gfr file to standar output. The files needs to be properly located into the directory structure described in Installing CGIs.
Junction-Sequence identifier
The junction-sequence identifier uses the high-confidence gfr file and look for the sequence of the junction for each candidate. This is the most computationally expensive part of FusionSeq. In order to run it efficiently we exploit a parallel computing architecture. Typically, one would execute:
$ gfr2bpJunctions data.confidence.gfr 40 200
where 40 correspond to the tile size and 200 is the number of nucleotides flanking the exons. For example, with 50-mers, having 40 as tile size ensures that at least 10 nucleotides of the reads will be mapped to either of the tiles. This program generates the fusion junction library in fasta format for each candidate, split in several files each one containing at most 2M entries. It also creates two additional files: one (data_joblist1.txt) including the instructions to index the library and align the reads to the files, and the second (data_joblist2.txt) including the instructions to aggregate the results of the alignment. How to run those jobs depends on the user architecture of the cluster.
After all the jobs are executed, for each candidate a breakpoint file is generated (e.g. data_00002.bp). This needs to be validated and filtered. Typically, one would run:
$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp $ bp2alignment data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt $ bp2wig data_00002_AB.filtered.bp
The first line checks that the junctions do not correspond to any other location on the genome and then filter the results according to a KS test or a simple heuristic depending on the number of reads aligned to the junction. Note: validateBpJunction expects a directory named hg18_nh under the genome directory specified by BOWTIE_INDEXES containing the index human reference genome. in the geneFusionConfig.h file. bp2alignment creates a text representation of the aligned reads to the junction and bp2wig generates the wig files that can be displayed by the UCSC Genome Browser showing the location of the breakpoints on each gene and the number of supporting reads.
