How to execute FusionSeq

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
m (Junction-Sequence Identifier module)
 
(13 intermediate revisions not shown)
Line 1: Line 1:
{{FusionSeqHeader}}
{{FusionSeqHeader}}
-
Here we provide an example about running a simple analysis with FusionSeq. We assume that PE sequencing data have been generated and transformed into [[RSEQtools#Mapped_Read_Format (MRF)|MRF]] (data.mrf.gz). This data format is generated by [http://rseqtools.gersteinlab.org/ RSEQtool], a suite for RNA-Seq data analysis. There you wil find conversion tools for some of the alignment tools, including SAM.
+
Here we provide an example of a simple analysis with FusionSeq. We assume that PE sequencing data have been generated and transformed into [[RSEQtools#Mapped_Read_Format (MRF)|MRF]] (data.mrf). This data format is generated by [http://rseqtools.gersteinlab.org/ RSEQtool], a suite for RNA-Seq data analysis. There you could find conversion tools for some of the most common alignment tools, including SAM. Note that FusionSeq '''requires''' the MRF file to include the sequences of the reads.
 +
 
For data generated by the Genome Analyzer II and aligned with Eland, the program [[FusionSeq_List_of_programs#export2mrf|export2mrf]] provided with FusionSeq can be used to perform the conversion. All programs provide a brief explanation of their usage when run without parameters. A typical analysis session may look like the following:
For data generated by the Genome Analyzer II and aligned with Eland, the program [[FusionSeq_List_of_programs#export2mrf|export2mrf]] provided with FusionSeq can be used to perform the conversion. All programs provide a brief explanation of their usage when run without parameters. A typical analysis session may look like the following:
<pre>
<pre>
-
$ geneFusions data 4 < data.mrf > data.1.gfr 2> data.1.log
+
$ geneFusions data 4 < data.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log
-
$ (gfrAbnormalInsertSizeFilter 0.01 < data.1.gfr | gfrPCRFilter 3 | gfrProximityFilter 1000 | gfrAddInfo |  
+
 
-
  gfrAnnotationConsistencyFilter ribosomal | gfrBlackListFilter blackList.txt | gfrLargeScaleHomologyFilter |  
+
$ (gfrMitochondrialFilter < data.1.gfr | gfrRepeatMaskerFilter repeatMasker.interval 5 | gfrCountPairTypes | gfrExpressionConsistencyFilter  |
-
  gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log
+
  gfrAbnormalInsertSizeFilter 0.01 |  gfrPCRFilter 4 4 | gfrProximityFilter 1000 | gfrAddInfo | gfrAnnotationConsistencyFilter ribosomal |
 +
  gfrAnnotationConsistencyFilter pseudogenes | gfrBlackListFilter blackList.txt | gfrLargeScaleHomologyFilter |  
 +
  gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log
 +
 
$ gfrConfidenceValues data < data.gfr > data.confidence.gfr
$ gfrConfidenceValues data < data.gfr > data.confidence.gfr
 +
 +
$ gfr2bpJunctions data.confidence.gfr 40 200
 +
$ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log
$ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log
 +
 +
$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp
 +
 +
$ bp2alignment < data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt
 +
 +
$ bp2wig data_00002_AB.filtered.bp
</pre>
</pre>
-
The first command ([[FusionSeq_List of programs#geneFusions|geneFusions]]) will create the first list of candidate fusion transcripts. The parameters "data" and "4" correspond to the prefix (data) that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc. "4" is the minimum number of PE reads needed to call a candidate. The program reads data.mrf.gz from standard input and save the output to data.1.gfr as well as logging information on data.1.log. Note that all programs will generate some logging information in this format:
+
==Fusion detection module==
 +
The first command ([[FusionSeq_List of programs#geneFusions|geneFusions]]) will create the first list of candidate fusion transcripts:
 +
$ geneFusions data 4 < data.mrf > data.1.gfr 2> data.1.gfr.log
 +
The parameter "data" corresponds to the prefix that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc., whereas "4" is the minimum number of inter-transcript PE reads needed to call a candidate. The program reads data.mrf from standard input and save the output to data.1.gfr as well as logging information on data.1.log. Note that all programs will generate some logging information in this format:
  WARNING: <program.name>_<type.of.information>: <value>
  WARNING: <program.name>_<type.of.information>: <value>
Hence we recommend to always capture the LOG information in a file by redirecting the STDERR with '2>'.
Hence we recommend to always capture the LOG information in a file by redirecting the STDERR with '2>'.
-
The output in data.1.gfr can be considered the most comprehensive list of fusion candidates. The second command executes all the filters to generate a high-confidence list of candidates. The order of the filters affects only the computation time of the processing. However, [[FusionSeq_List of programs#gfrAddInfo|gfrAddInfo]] needs to be executed prior to [[FusionSeq_List of programs#gfrAnnotationConsistencyFilter|gfrAnnotationConsistencyFilter]] or [[FusionSeq_List of programs#gfrBlacklistFilter|gfrBlackListFilter]], because they require gene symbols and descriptions. Moreover, since [[FusionSeq_List of programs#gfrSmallScaleHomologyFilter|gfrSmallScaleHomologyFilter]] is the most computationally intensive, it is probably better to run it toward the end of the pipeline. Note that each filter outputs the filtered gfr fils as well as logging information. The description of the filters is described in the supplemental material.
+
The content of data.1.gfr is the most comprehensive list of fusion candidates.
-
Once the final list is generated, [[FusionSeq_List of programs#gfrConfidenceValues|gfrConfidenceValues]] computes the various scores described in the manuscript. Please see the documentation for [[FusionSeq_List of programs#gfrConfidenceValues|gfrConfidenceValues]] for a description of the required parameters and files.
+
==Filtration cascade module==
 +
The second command:
 +
<pre>
 +
$ (gfrMitochondrialFilter < data.1.gfr | gfrRepeatMaskerFilter repeatMasker.interval 5 | gfrCountPairTypes | gfrExpressionConsistencyFilter  |
 +
  gfrAbnormalInsertSizeFilter 0.01 |  gfrPCRFilter 4 4 | gfrProximityFilter 1000 | gfrAddInfo |  gfrAnnotationConsistencyFilter ribosomal |
 +
  gfrAnnotationConsistencyFilter pseudogenes | gfrBlackListFilter blackList.txt |  gfrLargeScaleHomologyFilter |
 +
  gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log
 +
</pre>
 +
runs several filters to remove potential artifacts and generate a high-confidence list of fusion candidates. The order of the filters affects only the computation time of the processing. However, [[FusionSeq_List of programs#gfrAddInfo|gfrAddInfo]] needs to be executed prior to [[FusionSeq_List of programs#gfrAnnotationConsistencyFilter|gfrAnnotationConsistencyFilter]] or [[FusionSeq_List of programs#gfrBlacklistFilter|gfrBlackListFilter]], because they require gene symbols and descriptions.  
-
If the auxilliary modules are installed, then, the fourth command generates the corresponding files and print the gfr file to standar output. The files needs to be properly located into the directory structure described in Installing CGIs.
+
Moreover, since [[FusionSeq_List of programs#gfrSmallScaleHomologyFilter|gfrSmallScaleHomologyFilter]] is the most computationally intensive, it is probably better to run it toward the end of the pipeline. Note that each filter outputs the filtered gfr fils as well as logging information. The description of the filters is described in the supplemental material.
<center>[[#top|Top]]</center>
<center>[[#top|Top]]</center>
-
==Junction-Sequence identifier==
+
===Scoring the candidates===
 +
Once the final list is generated, [[FusionSeq_List of programs#gfrConfidenceValues|gfrConfidenceValues]] computes the various scores described in the manuscript.
 +
$ gfrConfidenceValues data < data.gfr > data.confidence.gfr
 +
Please see the documentation for [[FusionSeq_List of programs#gfrConfidenceValues|gfrConfidenceValues]] for a description of the required parameters and files.
 +
 
 +
====Auxilliary data====
 +
If the auxilliary modules are installed, then, the fourth command generates the corresponding files and print the gfr file to standard output.
 +
$ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log
 +
The files needs to be properly located into the directory structure described in Installing CGIs. Also, note that [[#gfrCountPairTypes|gfrCountPairTypes]] should be executed before [[#gfr2images|gfr2images]].
 +
 
 +
<center>[[#top|Top]]</center>
 +
 
 +
==Junction-Sequence Identifier module==
The junction-sequence identifier uses the high-confidence gfr file and look for the sequence of the junction for each candidate. This is the most computationally expensive part of FusionSeq. In order to run it efficiently we exploit a parallel computing architecture. Typically, one would execute:
The junction-sequence identifier uses the high-confidence gfr file and look for the sequence of the junction for each candidate. This is the most computationally expensive part of FusionSeq. In order to run it efficiently we exploit a parallel computing architecture. Typically, one would execute:
-
  $ gfr2bpJunctions data.confidence.gfr 40 200  
+
  $ gfr2bpJunctions data.confidence.gfr 40 200 1.0
-
where 40 correspond to the tile size and 200 is the number of nucleotides flanking the exons. For example, with 50-mers, having 40 as tile size ensures that at least 10 nucleotides of the reads will be mapped to either of the tiles. This program generates the fusion junction library in fasta format for each candidate, split in several files each one containing at most 2M entries. It also creates two additional files: one (data_jobList1.txt) including the instructions to index the library and align the reads to the files, and the second (data_jobList2.txt) including the instructions to aggregate the results of the alignment. How to run those jobs depends on the user architecture of the cluster. It is worth noting that data_jobList1.txt expects a compressed file with all the single end reads called: data_allReads.txt.gz. The auxiliary program [[FusionSeq_List of programs#export2mrf|export2mrf]] generates this file automatically. Otherwise, the user should provide a file with all the reads, one sequence per line.
+
where 40 correspond to the tile size and 200 is the number of nucleotides flanking the exons. For example, with 50-mers, having 40 as tile size ensures that at least 10 nucleotides of the reads will be mapped to either of the tiles. This program generates the fusion junction library in fasta format for each candidate with DASPER>1.0, split in several files each one containing at most 2M entries. It also creates two additional files: one (data_jobList1.txt) including the instructions to index the library and align the reads to the files, and the second (data_jobList2.txt) including the instructions to aggregate the results of the alignment. How to run those jobs depends on the user architecture of the cluster. It is worth noting that data_jobList1.txt expects a compressed file with all the single end reads called: data_allReads.txt.gz. The auxiliary program [[FusionSeq_List of programs#export2mrf|export2mrf]] generates this file automatically from the output files (export) of the Illumina platform. Otherwise, the user should provide a file with all the reads, one sequence per line. One way to do this from the fastq data is:
 +
cat data_1.fastq data_2.fastq | gawk '$1 ~/^@/ { getline; print }' - > data_allReads.txt
 +
where data_1.fastq data_2.fastq are the fastq data of the two ends.
After all the jobs are executed, for each candidate a breakpoint file is generated (e.g. data_00002.bp). This needs to be validated and filtered. Typically, one would run:
After all the jobs are executed, for each candidate a breakpoint file is generated (e.g. data_00002.bp). This needs to be validated and filtered. Typically, one would run:
<pre>
<pre>
$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp  
$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp  
-
$ bp2alignment data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt
+
 
 +
$ bp2alignment < data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt
 +
 
$ bp2wig data_00002_AB.filtered.bp  
$ bp2wig data_00002_AB.filtered.bp  
</pre>
</pre>
-
The first line checks that the junctions do not correspond to any other location on the genome and then filter the results according to a KS test or a simple heuristic depending on the number of reads aligned to the junction. '''Note''': [[FusionSeq_List of programs#validateBpJunction|validateBpJunction]] expects a directory named hg18_nh under the genome directory specified by BOWTIE_INDEXES containing the index human reference genome (this can be changed in the geneFusionConfig.h file). bp2alignment creates a text representation of the aligned reads to the junction and [[FusionSeq_List of programs#bp2wig|bp2wig]] generates the wig files that can be displayed by the UCSC Genome Browser showing the location of the breakpoints on each gene and the number of supporting reads.
+
The first line checks that the junctions do not correspond to any other location on the genome and then filter the results according to a KS test or a simple heuristic depending on the number of reads aligned to the junction. '''Note''': [[FusionSeq_List of programs#validateBpJunction|validateBpJunction]] expects a directory named hg18_nh under the genome directory specified by BOWTIE_INDEXES containing the index human reference genome (this can be changed in the geneFusionConfig.h file). [[FusionSeq_List of programs#bp2alignment|bp2alignment]] creates a text representation of the aligned reads to the junction and [[FusionSeq_List of programs#bp2wig|bp2wig]] generates the wig files that can be displayed by the UCSC Genome Browser showing the location of the breakpoints on each gene and the number of supporting reads.
<center>[[#top|Top]]</center>
<center>[[#top|Top]]</center>

Latest revision as of 17:59, 5 April 2011

FusionSeq main web page
User documentation main

Here we provide an example of a simple analysis with FusionSeq. We assume that PE sequencing data have been generated and transformed into MRF (data.mrf). This data format is generated by RSEQtool, a suite for RNA-Seq data analysis. There you could find conversion tools for some of the most common alignment tools, including SAM. Note that FusionSeq requires the MRF file to include the sequences of the reads.

For data generated by the Genome Analyzer II and aligned with Eland, the program export2mrf provided with FusionSeq can be used to perform the conversion. All programs provide a brief explanation of their usage when run without parameters. A typical analysis session may look like the following:

$ geneFusions data 4 < data.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log

$ (gfrMitochondrialFilter < data.1.gfr | gfrRepeatMaskerFilter repeatMasker.interval 5 | gfrCountPairTypes | gfrExpressionConsistencyFilter  | 
   gfrAbnormalInsertSizeFilter 0.01 |   gfrPCRFilter 4 4 | gfrProximityFilter 1000 | gfrAddInfo |  gfrAnnotationConsistencyFilter ribosomal | 
   gfrAnnotationConsistencyFilter pseudogenes | gfrBlackListFilter blackList.txt |  gfrLargeScaleHomologyFilter | 
   gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log

$ gfrConfidenceValues data < data.gfr > data.confidence.gfr

$ gfr2bpJunctions data.confidence.gfr 40 200 

$ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log

$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp 

$ bp2alignment < data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt

$ bp2wig data_00002_AB.filtered.bp 

Contents

[hide]

Fusion detection module

The first command (geneFusions) will create the first list of candidate fusion transcripts:

$ geneFusions data 4 < data.mrf > data.1.gfr 2> data.1.gfr.log

The parameter "data" corresponds to the prefix that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc., whereas "4" is the minimum number of inter-transcript PE reads needed to call a candidate. The program reads data.mrf from standard input and save the output to data.1.gfr as well as logging information on data.1.log. Note that all programs will generate some logging information in this format:

WARNING: <program.name>_<type.of.information>: <value>

Hence we recommend to always capture the LOG information in a file by redirecting the STDERR with '2>'.

The content of data.1.gfr is the most comprehensive list of fusion candidates.

Filtration cascade module

The second command:

$ (gfrMitochondrialFilter < data.1.gfr | gfrRepeatMaskerFilter repeatMasker.interval 5 | gfrCountPairTypes | gfrExpressionConsistencyFilter  | 
   gfrAbnormalInsertSizeFilter 0.01 |  gfrPCRFilter 4 4 | gfrProximityFilter 1000 | gfrAddInfo |  gfrAnnotationConsistencyFilter ribosomal | 
   gfrAnnotationConsistencyFilter pseudogenes | gfrBlackListFilter blackList.txt |  gfrLargeScaleHomologyFilter | 
   gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log

runs several filters to remove potential artifacts and generate a high-confidence list of fusion candidates. The order of the filters affects only the computation time of the processing. However, gfrAddInfo needs to be executed prior to gfrAnnotationConsistencyFilter or gfrBlackListFilter, because they require gene symbols and descriptions.

Moreover, since gfrSmallScaleHomologyFilter is the most computationally intensive, it is probably better to run it toward the end of the pipeline. Note that each filter outputs the filtered gfr fils as well as logging information. The description of the filters is described in the supplemental material.

Top

Scoring the candidates

Once the final list is generated, gfrConfidenceValues computes the various scores described in the manuscript.

$ gfrConfidenceValues data < data.gfr > data.confidence.gfr

Please see the documentation for gfrConfidenceValues for a description of the required parameters and files.

Auxilliary data

If the auxilliary modules are installed, then, the fourth command generates the corresponding files and print the gfr file to standard output.

$ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log

The files needs to be properly located into the directory structure described in Installing CGIs. Also, note that gfrCountPairTypes should be executed before gfr2images.

Top

Junction-Sequence Identifier module

The junction-sequence identifier uses the high-confidence gfr file and look for the sequence of the junction for each candidate. This is the most computationally expensive part of FusionSeq. In order to run it efficiently we exploit a parallel computing architecture. Typically, one would execute:

$ gfr2bpJunctions data.confidence.gfr 40 200 1.0

where 40 correspond to the tile size and 200 is the number of nucleotides flanking the exons. For example, with 50-mers, having 40 as tile size ensures that at least 10 nucleotides of the reads will be mapped to either of the tiles. This program generates the fusion junction library in fasta format for each candidate with DASPER>1.0, split in several files each one containing at most 2M entries. It also creates two additional files: one (data_jobList1.txt) including the instructions to index the library and align the reads to the files, and the second (data_jobList2.txt) including the instructions to aggregate the results of the alignment. How to run those jobs depends on the user architecture of the cluster. It is worth noting that data_jobList1.txt expects a compressed file with all the single end reads called: data_allReads.txt.gz. The auxiliary program export2mrf generates this file automatically from the output files (export) of the Illumina platform. Otherwise, the user should provide a file with all the reads, one sequence per line. One way to do this from the fastq data is:

cat data_1.fastq data_2.fastq | gawk '$1 ~/^@/ { getline; print }' - > data_allReads.txt

where data_1.fastq data_2.fastq are the fastq data of the two ends.

After all the jobs are executed, for each candidate a breakpoint file is generated (e.g. data_00002.bp). This needs to be validated and filtered. Typically, one would run:

$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp 

$ bp2alignment < data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt

$ bp2wig data_00002_AB.filtered.bp 

The first line checks that the junctions do not correspond to any other location on the genome and then filter the results according to a KS test or a simple heuristic depending on the number of reads aligned to the junction. Note: validateBpJunction expects a directory named hg18_nh under the genome directory specified by BOWTIE_INDEXES containing the index human reference genome (this can be changed in the geneFusionConfig.h file). bp2alignment creates a text representation of the aligned reads to the junction and bp2wig generates the wig files that can be displayed by the UCSC Genome Browser showing the location of the breakpoints on each gene and the number of supporting reads.

Top
Personal tools