Installation and Configuration of FusionSeq
From GersteinInfo
User documentation main
Contents |
Installing GSL and GD libraries
In order to install FusionSeq these external packages need to be installed first (see Requirements). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of BIOS. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the GNU Scientific Library website):
$ cd /path/to/gslSource/ $ ./configure --prefix=/path/to/installation/ $ make $ make install
If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the GD library can be installed in most systems with:
$ cd /path/to/gdSource/ $ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/ $ make $ make install
Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by gfr2images in order to create a schematic illustration depicting the connected regions of the two genes. See Installing and configuring FusionSeq for setting the appropriate environmental variables.
Note: we used gsl-1.14 and gd-2.0.35.
Installing and configuring libbios, libmrf, or BIOS
Please refer to version 0.6.1 for instructions of the stable version.
(versions 0.7.0 and later)
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
$ cd /path/to/libbios/ $ ./configure --prefix=/path/to/libbios $ make $ make install
Similarly, for libmrf, one would run:
$ cd /path/to/libmrf/ $ ./configure --prefix=/path/to/libmrf $ make $ make install
Note: if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..." $ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
If one doesn't want to list all relevant directories, a convenient approach is the creation of local include and lib directories and use symbolic links to the relevant files. For example:
$ mkdir ~/fusionseq/include $ mkdir ~/fusionseq/lib $ cd ~/fusionseq/include $ ln -s /path/to/libbios/include/* . $ ln -s /path/to/libmrf/include/* . $ ln -s /path/to/gsl/include/* . $ ln -s /path/to/gd/include/* . $ cd ~/fusionseq/lib $ ln -s /path/to/libbios/lib/* . $ ln -s /path/to/libmrf/lib/* . $ ln -s /path/to/gsl/lib/* . $ ln -s /path/to/gd/lib/* .
Hence, one could simply define:
$ export CPPFLAGS="-I/home/user/fusionseq/include" $ export LDFLAGS="-L/home/user/fusionseq/lib"
(versions up to 0.6.1)
To install BIOS a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
$ export BIOINFOCONFDIR=/pathToBios/conf/ $ export BIOINFOGSLDIR=/pathToGsl/ $ cd /pathToBios/ $ make $ make prod
Please refer to BIOS documentation for additional information.
Installing and configuring ROOT
To install ROOT, please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. NOTE: for Ubuntu users, the detailed instructions to install ROOT can be found here.
(versions 0.7.0 and later)
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for ROOT, however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link ROOT files for FusionSeq.
$ mkdir ~/fusionseq/include/root $ cd ~/fusionseq/include/root $ ln -s /path/to/root/include/* . $ mkdir ~/fusionseq/lib/root $ cd ~/fusionseq/lib/root $ ln -s /path/to/root/lib/* .
Also, for some versions of ROOT, one may get the following error:
[...]/root/include/Rtypes.h:35:67: error: snprintf.h: No such file or directory [...]/root/include/Rtypes.h:36:68: error: strlcpy.h: No such file or directory
This is because ROOT provides its own copy of the header files. One workaround is thus to create symbolic links
$ cd ~/fusionseq/include $ ln -s root/snprintf.h . $ ln -s root/strlcpy.h .
This should solve it.
(versions up to 0.6.1)
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
$ export ROOTSYS=/path/to/ROOT/ $ export PATH=$ROOTSYS/bin:$PATH
Installing and configuring FusionSeq
Please refer to version 0.6.1 for instruction of the stable version.
(versions 0.7.0 and later)
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
$ tar xzvf fusionseq-0.7.0.tar.gz $ cd fusionseq-0.7.0/ $ ./configure --prefix=/path/to/fusionseq/ $ make $ make optional $ make install
This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. IMPORTANT: the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_CONFPATH=/path/2/home/.fusionseqrc
Here after an example of the configuration file.
.fusionseqrc: // --------------------------------- This section is required --------------------------------- // Location of the bowtie indexes of the human genome and the composite model BOWTIE_INDEXES="/path/to/bowtie/Indexes" // the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build BOWTIE_GENOME="hg18_nh" // the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel" // Pointer to the program twoBitToFa part of the blat suite BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa" // Location and filename of the reference genome in 2bit format (to be used by blat) BLAT_DATA_DIR="/path/to/blat/Data/Dir" BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit" // Location and name of the transcript composite model sequence and interval files TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model" TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa" TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt" // location of the annotation files ANNOTATION_DIR="/path/to/annotationFiles" // conversion of knownGenes to gene symbols, description, etc. KNOWN_GENE_XREF_FILENAME="kgXref.txt" // conversion of knownGenes to TreeFam KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt" // Location and filename of the ribosomal library RIBOSOMAL_DIR="/path/to/ribosomal/Dir" RIBOSOMAL_FILENAME="ribosomal.2bit" # Used for gfrRibosomalFilter MAX_FRACTION_HOMOLOGOUS=0.05 MAX_OVERLAP_ALLOWED=0.75 # Used for gfr2bpJunctions MAX_NUMBER_OF_JUNCTION_PER_FILE=2000000 // ----------------------- This section is optional: visualization tools ------------------------- // URL of the cgi directory on the web server WEB_URL_CGI="http://cgiURL" // location of the data directory on the web server, as seen from the web server WEB_DATA_DIR="/path/to/data" // URL of the data directory on the web server WEB_DATA_LINK="http://dataURL" // Number of nucleotides flanking the region (for UCSC Genome Browser) UCSC_GENOME_BROWSER_FLANKING_REGION=500 // URL of the public website (non cgi) WEB_PUB_DIR="http://publicURL" // Location of the structural data for Circos WEB_SDATA_DIR="/path/to/structural/Data/Circos" // Location of Circos installation WEB_CIRCOS_DIR="/path/to/circos"
NB: use absolute paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.
(versions up to 0.6.1)
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read Installing CGIs.
Before starting with the installation of FusionSeq, please read the Requirements section to make sure all data sets and external tools are available.
To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
geneFusionConfig.h: // --------------------------------- This section is required --------------------------------- // Location of the bowtie indexes of the human genome and the composite model #define BOWTIE_INDEXES "/path2bowtieIndexes" // the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build #define BOWTIE_GENOME "hg18_nh" // the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build #define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel" // Pointer to the program twoBitToFa part of the blat suite #define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa" // Location and filename of the reference genome in 2bit format (to be used by blat) #define BLAT_DATA_DIR "/path2blatDataDir" #define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit" // Location and name of the transcript composite model sequence and interval files #define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel" #define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa" #define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt" // location of the annotation files #define ANNOTATION_DIR "/path2annotationFiles" // conversion of knownGenes to gene symbols, description, etc. #define KNOWN_GENE_XREF_FILENAME "kgXref.txt" // conversion of knownGenes to TreeFam #define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt" // Location and filename of the ribosomal library #define RIBOSOMAL_DIR "/path2ribosomalDir" #define RIBOSOMAL_FILENAME "ribosomal.2bit" // ----------------------- This section is optional: visualization tools ------------------------- // URL of the cgi directory on the web server #define WEB_URL_CGI "http://cgiURL" // location of the data directory on the web server, as seen from the web server #define WEB_DATA_DIR "/path2data" // URL of the data directory on the web server #define WEB_DATA_LINK "http://dataURL" // Number of nucleotides flanking the region (for UCSC Genome Browser) #define UCSC_GENOME_BROWSER_FLANKING_REGION 500 // URL of the public website (non cgi) #define WEB_PUB_DIR "http://publicURL" // Location of the structural data for Circos #define WEB_SDATA_DIR "/path2structuralDataCircos" // Location of Circos installation #define WEB_CIRCOS_DIR "/path2circos"
NB: use absolute paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.
Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see Auxiliary modules). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see Installing CGIs). Once the configuration file is set up, the compilation just requires:
$ make // for the core analysis elements $ make all // for the core analysis elements as well as the auxiliary programs $ make cgi // for compiling the visualization/summary tools (see Installing CGIs) $ make deploy// for installing the visualization/summary tools to the web server
Auxiliary modules
These modules generate a set of useful data files for interpreting and visualizing the results. For example, gfr2gff generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; gfr2fasta generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of gfr2images. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
GDDIR = /path/to/gd/gd-2.0.35/ GDINC = -I$(GDDIR)/include GDLIB = -L$(GDDIR)/lib PNGLIB = -L/usr/lib64 JPEGLIB = -L/usr/X11/lib ZLIB = -L/usr/lib FREETYPELIB = -L/usr/lib64
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.
Installing CGIs
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
$ export FUSIONSEQWEBSERVER=web_server_name $ export FUSIONSEQWEBUSER=webuserID $ export FUSIONSEQCGIDIR=/path/to/cgiDir $ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
./ALIGNMENTS ./BED ./FASTA ./GFF ./IMAGES ./WIGS
BED, FASTA, GFF, and IMAGES contain the data generated by gfr2bed, gfr2fasta, gfr2gff and gfr2images, respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely bp2alignment and bp2wig. The user is required to ensure that these directories contain the expected files.
One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the Circos website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.
There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded here. In the Web Files tarball, the following files are included:
- The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
- The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
- The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.
Troubleshooting
Here are some common issues when installing FusionSeq and the associated libraries:
- libraries compiled for different architectures:
- Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
- /usr/bin/ld: cannot find -lpng (or -ljpeg)
- This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see Auxilliary modules).