From GersteinInfo

Revision as of 12:44, 19 August 2010 by Asboner (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Introduction

This page provides the source code for FusionSeq. Please note that these tools were tested on a multi-node cluster of computing nodes with Linux Red Hat as operating system and PBS as scheduler system. FusionSeq programs are written in C and should likely compile to most Unix/Linux platforms. We used the gcc complier (version 3.4.6 20060404) to compile the source code. However, this is not a plug-and-play program, but it requires the user to compile, install and run a set of programs. Please read the requirements before downloading FusionSeq.

Software Requirements

FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data set are also required for the analysis (see Data Requirements). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

Alignment tools

bowtie (64bit)
Blat (source) (binaries)

Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBit2fa is also downloaded from the blat package and part of the PATH.

Scientific and bioinformatics libraries

Drawing tools

GD library

Data analysis

ROOT

Data Requirements

Here is the list of required data for a comprehensive use of FusionSeq tools.

External

Homo Sapiens Reference genome (hg18): the user should download both chromFa.zip and hg18.2bit.

The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:

$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/

where hg18_nh.fa corresponds to the concatenation of all human chromosomes from chromFa.zip without the different haplotypes and "random" stuff.

Provided

The following data sets, bundled in a tarball, can be downloaded here.

knownGeneAnnotation.txt
knownGeneAnnotationTranscriptCompositeModel.txt
knownGeneAnnotationTranscriptCompositeModel.fa
kgXref.txt
knownToTreefam.txt

The composite model needs to be indexed by bowtie:

$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa /path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel

Please make sure that the correct filenames are used.

Download

Installation and Configuration of FusionSeq

How to execute FusionSeq

Here we provide an example about running a simple analysis with FusionSeq. We assume that PE sequencing data have been generated and transformed into MRF (data.mrf.gz). For data generated by the Genome Analyzer II and aligned with Eland, the program export2mrf can be used to perform the conversion. All programs, when run without parameters will give an brief explanation of their usage. A typical analysis session looks like the following:

$ geneFusions data 4 < data.mrf.gz > data.1.gfr 2> data.1.log
$ (gfrPvalueFilter 0.01 < data.1.gfr | gfrPCRFilter 3 | gfrProximityFilter 1000 | gfrAddInfo | gfrNameFilter ribosomal | gfrBlackListFilter blackList.txt | gfrTreeFamFilter | gfrHomologyFilter) > data.gfr 2> data.log
$ gfrConfidenceValues num_mapped_reads < data.gfr > data.confidence.gfr
$ (gfr2images < data.confidence.gfr | gfr2bed | gfr2fasta | gfr2gff) 2> data.aux.log

The first command will create the first list of candidate fusion transcripts. The parameters "data" and "4" correspond to the prefix (data) that will be used to generate the IDs of the candidates, namely data_00001, data_00002, etc. "4" is the minimum number of PE reads needed to call a candidate. The program reads data.mrf.gz from standard input and save the output to data.1.gfr as well as logging information on data.1.log. This can be considered the most comprehensive list of fusion candidates. The second command executes all the filters to generate a high-confidence list of candidates. The order of the filters affects only the computation time of the processing. However, gfrAddInfo needs to be execute prior to gfrNameFilter or gfrBlacklistFilter, because they require gene symbols and descriptions. Moreover, since gfrHomologyFilter is the most computationally intensive, it is probably better to run it toward the end of the pipeline. Note that each filter outputs the filtered gfr fils as well as logging information. The description of the filters is described in the supplemental material.

Once the final list is generated, gfrConfidenceValues computes the various scores described in the manuscript.

If the auxilliary modules are installed, then, the fourth command generates the corresponding files and print the gfr file to standar output. The files needs to be properly located into the directory structure described in Installing CGIs.

Junction-Sequence identifier

The junction-sequence identifier uses the high-confidence gfr file and look for the sequence of the junction for each candidate. This is the most computationally expensive part of FusionSeq. In order to run it efficiently we exploit a parallel computing architecture. Typically, one would execute:

$ gfr2bpJunctions data.confidence.gfr 40 200

where 40 correspond to the tile size and 200 is the number of nucleotides flanking the exons. For example, with 50-mers, having 40 as tile size ensures that at least 10 nucleotides of the reads will be mapped to either of the tiles. This program generates the fusion junction library in fasta format for each candidate, split in several files each one containing at most 2M entries. It also creates two additional files: one (data_joblist1.txt) including the instructions to index the library and align the reads to the files, and the second (data_joblist2.txt) including the instructions to aggregate the results of the alignment. How to run those jobs depends on the user architecture of the cluster.

After all the jobs are executed, for each candidate a breakpoint file is generated (e.g. data_00002.bp). This needs to be validated and filtered. Typically, one would run:

$ validateBpJunctions < data_00002_AB.bp | bpFilter 4 4 100 0.01 30 > data_00002_AB.filtered.bp 
$ bp2alignment data_00002_AB.filtered.bp > data_00002_breakPointAlignments.txt
$ bp2wig data_00002_AB.filtered.bp

The first line checks that the junctions do not correspond to any other location on the genome and then filter the results according to a KS test or a simple heuristic depending on the number of reads aligned to the junction. Note: validateBpJunction expects a directory named hg18_nh under the genome directory specified by BOWTIE_INDEXES containing the index human reference genome. in the geneFusionConfig.h file. bp2alignment creates a text representation of the aligned reads to the junction and bp2wig generates the wig files that can be displayed by the UCSC Genome Browser showing the location of the breakpoints on each gene and the number of supporting reads.

Troubleshooting

Here are some common issues when installing FusionSeq and the associated libraries:

libraries compiled for different architectures:

Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.

/usr/bin/ld: cannot find -lpng (or -ljpeg)

This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see Auxilliary modules).

FusionSeq