FusionSeq

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
m (Installing and configuring of FusionSeq)
 
(26 intermediate revisions not shown)
Line 1: Line 1:
-
=Introduction=
+
<center>[http://rnaseq.gersteinlab.org/fusionseq/ FusionSeq main web page]</center>
-
This page provides the source code for FusionSeq. Please note that these tools were tested on a multi-node cluster of computing nodes with Linux Red Hat as operating system and PBS as scheduler system. FusionSeq programs are written in C and should likely compile to most Unix/Linux platforms. We used the gcc complier (version 3.4.6 20060404) to compile the source code.  
+
This document provides the information for downloading, installing, compile and run FusionSeq. Please note that these tools were tested on a multi-node cluster of computing nodes with Linux Red Hat as operating system and PBS as scheduler system. FusionSeq programs are written in C and should likely compile to most Unix/Linux platforms. We used the gcc complier (version 3.4.6 20060404) to compile the source code.  
-
However, this is not a plug-and-play program, but it requires the user to compile, install and run a set of programs. Please read the requirements before [[#Download|downloading]] FusionSeq.
+
However, this is not a plug-and-play program, but it requires the user to compile, install and run a set of programs. Please read the [[Requirements|requirements]] '''before''' [[#Download|downloading]] FusionSeq.
-
<pre>
 
-
Important Note
 
-
==============
 
-
THIS PACKAGE (FusionSeq) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION,
 
-
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
 
-
</pre>
 
-
Also, consult the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage.
+
'''IMPORTANT''': Starting from version 0.7.0, we adopted a different approach to include the configuration file that allows the users to simply download the binary files for their platform. However, we still provide the source code, which now uses the more standard autoconf/automake tools, thus simplifying the installation of FusionSeq. Please note that at the time of writing (May 7th, 2011), version 0.7.0 is still in alpha. Any feedback will be very much appreciated.
-
==Software Requirements==
 
-
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data set are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.
 
-
===Alignment tools===
+
If you have any questions, please check the [[FusionSeq_FAQ|FAQ]] or send an email to [mailto:fusionseq-faq@gersteinlab.org fusionseq-faq@gersteinlab.org]
-
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
+
-
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
+
-
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBit2fa is also downloaded from the blat package and part of the PATH.
+
-
===Scientific and bioinformatics libraries===
 
-
* [http://www.gnu.org/software/gsl/ GNU Scientific Library]
 
-
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]
 
-
===Drawing tools===
+
==[[FusionSeq Requirements]]==
-
* [http://www.libgd.org/Main_Page GD library]
+
List of required programs and data. Please read this section '''before''' downloading FusionSeq.
-
===Data analysis===
+
==[[FusionSeq Download|Download]]==
-
* [http://root.cern.ch/drupal/ ROOT]
+
Links to FusionSeq source code and data sets.
-
==Data Requirements==
+
==[[Installation and Configuration of FusionSeq]]==
 +
Instructions to install and configure FusionSeq.
-
Here is the list of required data for a comprehensive use of FusionSeq tools.  
+
==[[How to execute FusionSeq]]==
 +
An example workflow of FusionSeq.
-
===External===
+
==[[FusionSeq_List of programs|List of programs]]==
-
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
+
Description of all the FusionSeq programs.
-
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
+
-
$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/
+
-
where hg18_nh.fa corresponds to the concatenation of all human chromosomes from chromFa.zip without the different haplotypes and "random" stuff.
+
-
===Provided===
+
==[[FusionSeq_Test Datasets|Test Datasets]]==
-
The following data sets, bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/tarballs/FusionSeq_Annotation_Data_1.0.tar.gz here].
+
A few datasets to test FusionSeq installation.
-
* knownGeneAnnotation.txt
+
-
* knownGeneAnnotationTranscriptCompositeModel.txt
+
-
* knownGeneAnnotationTranscriptCompositeModel.fa
+
-
* kgXref.txt
+
-
* knownToTreefam.txt
+
-
The composite model needs to be indexed by bowtie:
+
==[http://dynamic.gersteinlab.org/people/asboner/FusionSeq/geneFusions_cgi Demo]==
-
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa /path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
+
You can see some of the results of FusionSeq as described in the paper. Use the sample IDs reported in Table 1, e.g. 106_T, 1700_D.
-
Please make sure that the correct filenames are used.
+
==[[FusionSeq_FAQ|Frequently Asked Questions (FAQ)]]==
 +
Solutions to common problems. If your issue is not described in the [[FusionSeq_FAQ|FAQ]], please send an email to [mailto:fusionseq-faq@gersteinlab.org fusionseq-faq@gersteinlab.org]
-
=Download=
+
==[[FusionSeq Gallery|Gallery]]==
-
The latest version is: [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_0.5.0.tar.gz FusionSeq ver. 0.5.0].
+
Some figures about FusionSeq.
-
Previous versions can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/oldversions/ here].
+
==[[FusionSeq_Papers|In the news et al.]]==
-
 
+
News or scientific papers referring to FusionSeq
-
=Installation and configuration=
+
-
In order to install FusionSeq the external packages need to be installed first. Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
+
-
<pre>
+
-
$ cd path2gslSource/
+
-
$ ./configure --prefix=path2installation
+
-
$ make
+
-
$ make install
+
-
</pre>
+
-
 
+
-
If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the GD libraries can be installed in most systems with:
+
-
<pre>
+
-
$ cd path2gdSource
+
-
$ ./configure --prefix=path2installation --with-jpeg=/path2jpegLib/
+
-
$ make
+
-
$ make install
+
-
</pre>
+
-
 
+
-
although, they are NOT required for the core analysis. If you want to use them, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by gfr2images in order to create a figure depicting wich regions of two genes are connected by PE-reads. See [[#Installation and configuration of FusionSeq|Installation and configuration of FusionSeq]] for setting the appropriate environmental variables.
+
-
 
+
-
==Installing and configuring BIOS==
+
-
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
+
-
<pre>
+
-
$ export BIOINFOCONFDIR=/pathToBios/bios/conf
+
-
$ export BIOINFOGSLDIR=/pathTo/gsl-1.12
+
-
$ cd /pathToBios/bios
+
-
$ make
+
-
$ make prod
+
-
</pre>
+
-
 
+
-
Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.
+
-
 
+
-
==Installing and configuring ROOT==
+
-
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
+
-
<pre>
+
-
$ export ROOTSYS=/path2ROOT/
+
-
$ export PATH=$ROOTSYS/bin:$PATH
+
-
</pre>
+
-
 
+
-
==Installing and configuring FusionSeq==
+
-
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].
+
-
 
+
-
To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
+
-
<pre>
+
-
geneFusionConfig.h:
+
-
 
+
-
// --------------------------------- This section is required --------------------------------- // Location of the bowtie indexes
+
-
#define BOWTIE_INDEXES "/path2bowtieIndexes"
+
-
// Location of the bowtie indexes
+
-
// #define BLAT_TWO_BIT_TO_FA "path2blat/twoBit2fa"
+
-
// Location and filename for the reference genome in 2bit format (to be used by blat)
+
-
#define BLAT_DATA_DIR "path2blat/DATA/blat"
+
-
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"
+
-
// Location of the hg18 genome and name of the composite transcript model sequence file. Note that the file should be in the same directory
+
-
#define GENOME_DIR "path2genome/"
+
-
#define KNOWN_GENE_TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
+
-
// location of the annotation files
+
-
#define ANNOTATION_DIR "/path2annotationFiles"
+
-
// knownGeneAnnotation filename
+
-
#define KNOWN_GENE_ANNOTATION_FILENAME "knownGeneAnnotation.txt"
+
-
// conversion of knownGenes to gene symbols, description, etc.
+
-
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
+
-
// conversion of knownGenes to TreeFam
+
-
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"
+
-
// file name of the composite model file
+
-
#define KNOWN_GENE_TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"
+
-
 
+
-
// ----------------------- This section is optional: visualization tools -------------------------
+
-
// URL of the cgi directory on the web server
+
-
#define WEB_URL_CGI "http://cgiURL"
+
-
// location of the data directory on the web server, as seen from the web server
+
-
#define WEB_DATA_DIR "/path2data"
+
-
// URL of the data directory on the web server
+
-
#define WEB_DATA_LINK "http://dataURL"
+
-
// Number of nucleotides flanking the region (for UCSC Genome Browser)
+
-
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
+
-
// URL of the public website (non cgi)
+
-
#define WEB_PUB_DIR "http://publicURL"
+
-
// Location of the structural data for Circos
+
-
#define WEB_SDATA_DIR "/path2structuralDataCircos"
+
-
// Location of Circos installation
+
-
#define WEB_CIRCOS_DIR "/path2circos"
+
-
</pre>
+
-
 
+
-
Once the configuration file is ready, the core modules can be compiled and installed. For compiling the auxilliary modules, the Makefile needs to be properly edited (see Auxiliary modules). For the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
+
-
<pre>
+
-
$ make // for the core analysis elements
+
-
$ make all // for the core analysis elements as well as the additional programs and the visualization tools
+
-
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
+
-
$ make deploy// for installing the visualization/summary tools to the web server
+
-
</pre>
+
-
 
+
-
==Auxiliary modules==
+
-
These modules generate a set of useful data files for interpreting and visualizing the results. For example, gfr2gff generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; gfr2fasta generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, but gfr2images. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
+
-
<pre>
+
-
GDDIR = -L/path2gd/gd-2.0.35/
+
-
GDINC = -I$(GDDIR)/include
+
-
GDLIB = -L$(GDDIR)/lib
+
-
PNGLIB = -L/usr/lib64
+
-
JPEGLIB = -L/usr/X11/lib
+
-
ZLIB = -L/usr/lib
+
-
FREETYPELIB = -L/usr/lib64
+
-
</pre>
+
-
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.
+
-
 
+
-
==Installing CGIs==
+
-
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
+
-
<pre>
+
-
$ export FUSIONSEQWEBSERVER=web_server_name
+
-
$ export FUSIONSEQWEBUSER=webuserID
+
-
$ export FUSIONSEQCGIDIR=/path2cgiDir
+
-
$ export FUSIONSEQCGITARGET="-B target_architecture" // optional: to be specified only if using the CGIs on a machine with a different architecture than the core programs. On the webserver with the CGIs, excecute gcc -dumpmachine to get the target_architecture
+
-
</pre>
+
-
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
+
-
<pre>
+
-
./ALIGNMENTS
+
-
./BED
+
-
./FASTA
+
-
./GFF
+
-
./IMAGES
+
-
./WIGS
+
-
</pre>
+
-
BED, FASTA, GFF, and IMAGES contain the data generated by gfr2bed, gfr2fasta, gfr2gff and gfr2images, respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely bp2alignment and bp2wigs. The user is required to ensure that these directories have the expected files.
+
-
 
+
-
One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.
+
-
 
+
-
There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
+
-
The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
+
-
The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
+
-
The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.
+
-
 
+
-
=How to execute FusionSeq=
+
-
Here we provide an example about running a simple analysis with FusionSeq. We assume that PE sequencing data have been generated and transformed into MRF (data.mrf.gz). For data generated by the Genome Analyzer II and aligned with Eland, the program export2mrf can be used to perform the conversion. All programs, when run without parameters will give an brief explanation of their usage. A typical analysis session looks like the following:
+
-
<pre>
+
-
$ geneFusions data 4 < data.mrf.gz > data.1.gfr 2> data.1.log
+
-
$ (gfrPvalueFilter 0.01 < data.1.gfr | gfrPCRFilter 3 | gfrProximityFilter 1000 | gfrAddInfo | gfrNameFilter ribosomal | gfrBlackListFilter blackList.txt | gfrTreeFamFilter | gfrHomologyFilter) > data.gfr 2> data.log
+
-
$ gfrConfidenceValues num_mapped_reads < data.gfr > data.confidence.gfr
+
-
$ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log
+
-
</pre>
+
-
 
+
-
The first command will create the first list of candidate fusion transcripts. The parameters "data" and "4" correspond to the prefix (data) that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc. "4" is the minimum number of PE reads needed to call a candidate. The program reads data.mrf.gz from standard input and save the output to data.1.gfr as well as logging information on data.1.log. This can be considered the most comprehensive list of fusion candidates. The second command executes all the filters to generate a high-confidence list of candidates. The order of the filters affects only the computation time of the processing. However, gfrAddInfo needs to be execute prior to gfrNameFilter or gfrBlacklistFilter, because they require gene symbols and descriptions. Moreover, since gfrHomologyFilter is the most computationally intensive, it is probably better to run it toward the end of the pipeline. Note that each filter outputs the filtered gfr fils as well as logging information. The description of the filters is described in the supplemental material.
+
-
 
+
-
Once the final list is generated, gfrConfidenceValues computes the various scores described in the manuscript.
+
-
 
+
-
If the auxilliary modules are installed, then, the fourth command generates the corresponding files and print the gfr file to standar output. The files needs to be properly located into the directory structure described in Installing CGIs.
+
-
 
+
-
==Junction-Sequence identifier==
+
-
The junction-sequence identifier uses the high-confidence gfr file and look for the sequence of the junction for each candidate. This is the most computationally expensive part of FusionSeq. In order to run it efficiently we exploit a parallel computing architecture. Typically, one would execute:
+
-
$ gfr2bpJunctions data.confidence.gfr 40 200
+
-
where 40 correspond to the tile size and 200 is the number of nucleotides flanking the exons. For example, with 50-mers, having 40 as tile size ensures that at least 10 nucleotides of the reads will be mapped to either of the tiles. This program generates the fusion junction library in fasta format for each candidate, split in several files each one containing at most 2M entries. It also creates two additional files: one (data_joblist1.txt) including the instructions to index the library and align the reads to the files, and the second (data_joblist2.txt) including the instructions to aggregate the results of the alignment. How to run those jobs depends on the user architecture of the cluster.
+
-
 
+
-
After all the jobs are executed, for each candidate a breakpoint file is generated (e.g. data_00002.bp). This needs to be validated and filtered. Typically, one would run:
+
-
<pre>
+
-
$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp
+
-
$ bp2alignment data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt
+
-
$ bp2wig data_00002_AB.filtered.bp
+
-
</pre>
+
-
The first line checks that the junctions do not correspond to any other location on the genome and then filter the results according to a KS test or a simple heuristic depending on the number of reads aligned to the junction. '''Note''': validateBpJunction expects a directory named hg18_nh under the genome directory specified by BOWTIE_INDEXES containing the index human reference genome. in the geneFusionConfig.h file. bp2alignment creates a text representation of the aligned reads to the junction and bp2wig generates the wig files that can be displayed by the UCSC Genome Browser showing the location of the breakpoints on each gene and the number of supporting reads.
+
-
 
+
-
==Troubleshooting==
+
-
Here are some common issues when installing FusionSeq and the associated libraries:
+
-
 
+
-
* libraries compiled for different architectures:
+
-
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
+
-
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
+
-
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see Auxilliary modules).
+

Latest revision as of 11:43, 7 May 2011

FusionSeq main web page

This document provides the information for downloading, installing, compile and run FusionSeq. Please note that these tools were tested on a multi-node cluster of computing nodes with Linux Red Hat as operating system and PBS as scheduler system. FusionSeq programs are written in C and should likely compile to most Unix/Linux platforms. We used the gcc complier (version 3.4.6 20060404) to compile the source code. However, this is not a plug-and-play program, but it requires the user to compile, install and run a set of programs. Please read the requirements before downloading FusionSeq.


IMPORTANT: Starting from version 0.7.0, we adopted a different approach to include the configuration file that allows the users to simply download the binary files for their platform. However, we still provide the source code, which now uses the more standard autoconf/automake tools, thus simplifying the installation of FusionSeq. Please note that at the time of writing (May 7th, 2011), version 0.7.0 is still in alpha. Any feedback will be very much appreciated.


If you have any questions, please check the FAQ or send an email to fusionseq-faq@gersteinlab.org


Contents

FusionSeq Requirements

List of required programs and data. Please read this section before downloading FusionSeq.

Download

Links to FusionSeq source code and data sets.

Installation and Configuration of FusionSeq

Instructions to install and configure FusionSeq.

How to execute FusionSeq

An example workflow of FusionSeq.

List of programs

Description of all the FusionSeq programs.

Test Datasets

A few datasets to test FusionSeq installation.

Demo

You can see some of the results of FusionSeq as described in the paper. Use the sample IDs reported in Table 1, e.g. 106_T, 1700_D.

Frequently Asked Questions (FAQ)

Solutions to common problems. If your issue is not described in the FAQ, please send an email to fusionseq-faq@gersteinlab.org

Gallery

Some figures about FusionSeq.

In the news et al.

News or scientific papers referring to FusionSeq

Personal tools