How to execute FusionSeq

From GersteinInfo

Revision as of 01:09, 1 November 2010 by Asboner (Talk | contribs)
Jump to: navigation, search
FusionSeq main web page
User documentation main

Here we provide an example about running a simple analysis with FusionSeq. We assume that PE sequencing data have been generated and transformed into MRF (data.mrf.gz). This data format is generated by RSEQtool, a suite for RNA-Seq data analysis. There you wil find conversion tools for some of the alignment tools, including SAM. For data generated by the Genome Analyzer II and aligned with Eland, the program export2mrf provided with FusionSeq can be used to perform the conversion. All programs provide a brief explanation of their usage when run without parameters. A typical analysis session may look like the following:

$ geneFusions data 4 < data.mrf > data.1.gfr 2> data.1.log
$ (gfrAbnormalInsertSizeFilter 0.01 < data.1.gfr | gfrPCRFilter 3 | gfrProximityFilter 1000 | gfrAddInfo | 
  gfrAnnotationConsistencyFilter ribosomal | gfrBlackListFilter blackList.txt | gfrLargeScaleHomologyFilter | 
  gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log
$ gfrConfidenceValues data < data.gfr > data.confidence.gfr
$ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log

The first command (geneFusions) will create the first list of candidate fusion transcripts. The parameters "data" and "4" correspond to the prefix (data) that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc. "4" is the minimum number of PE reads needed to call a candidate. The program reads data.mrf.gz from standard input and save the output to data.1.gfr as well as logging information on data.1.log. Note that all programs will generate some logging information in this format:

WARNING: <>_<type.of.information>: <value>

Hence we recommend to always capture the LOG information in a file by redirecting the STDERR with '2>'.

The output in data.1.gfr can be considered the most comprehensive list of fusion candidates. The second command executes all the filters to generate a high-confidence list of candidates. The order of the filters affects only the computation time of the processing. However, gfrAddInfo needs to be executed prior to gfrAnnotationConsistencyFilter or gfrBlackListFilter, because they require gene symbols and descriptions. Moreover, since gfrSmallScaleHomologyFilter is the most computationally intensive, it is probably better to run it toward the end of the pipeline. Note that each filter outputs the filtered gfr fils as well as logging information. The description of the filters is described in the supplemental material.

Once the final list is generated, gfrConfidenceValues computes the various scores described in the manuscript. Please see the documentation for gfrConfidenceValues for a description of the required parameters and files.

If the auxilliary modules are installed, then, the fourth command generates the corresponding files and print the gfr file to standar output. The files needs to be properly located into the directory structure described in Installing CGIs.


Junction-Sequence identifier

The junction-sequence identifier uses the high-confidence gfr file and look for the sequence of the junction for each candidate. This is the most computationally expensive part of FusionSeq. In order to run it efficiently we exploit a parallel computing architecture. Typically, one would execute:

$ gfr2bpJunctions data.confidence.gfr 40 200 

where 40 correspond to the tile size and 200 is the number of nucleotides flanking the exons. For example, with 50-mers, having 40 as tile size ensures that at least 10 nucleotides of the reads will be mapped to either of the tiles. This program generates the fusion junction library in fasta format for each candidate, split in several files each one containing at most 2M entries. It also creates two additional files: one (data_jobList1.txt) including the instructions to index the library and align the reads to the files, and the second (data_jobList2.txt) including the instructions to aggregate the results of the alignment. How to run those jobs depends on the user architecture of the cluster. It is worth noting that data_jobList1.txt expects a compressed file with all the single end reads called: data_allReads.txt.gz. The auxiliary program export2mrf generates this file automatically. Otherwise, the user should provide a file with all the reads, one sequence per line.

After all the jobs are executed, for each candidate a breakpoint file is generated (e.g. data_00002.bp). This needs to be validated and filtered. Typically, one would run:

$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp 
$ bp2alignment data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt
$ bp2wig data_00002_AB.filtered.bp 

The first line checks that the junctions do not correspond to any other location on the genome and then filter the results according to a KS test or a simple heuristic depending on the number of reads aligned to the junction. Note: validateBpJunction expects a directory named hg18_nh under the genome directory specified by BOWTIE_INDEXES containing the index human reference genome (this can be changed in the geneFusionConfig.h file). bp2alignment creates a text representation of the aligned reads to the junction and bp2wig generates the wig files that can be displayed by the UCSC Genome Browser showing the location of the breakpoints on each gene and the number of supporting reads.

Personal tools