GersteinInfo - User contributions [en]

RSEQtools

2013-04-22T14:08:49Z

Asboner: /* sam2mrf */

<center>[http://archive.gersteinlab.org/proj/rnaseq/rseqtools '''RSEQtools Main Page''']</center>

__TOC__

== Introduction ==

The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify that person, raising significant privacy concerns. In order to address these issues we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confi-dential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools that uses this format for the analysis of RNA-Seq experiments. RSEQtools consists of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions. In addition to the anonymization afforded by this format it also facilitates the decoupling of the alignment of reads from downstream analyses.

 

== Citation ==

Lukas Habegger*, Andrea Sboner*, Tara A. Gianoulis, Joel Rozowsky, Ashish Agarwal, Michael Snyder, Mark Gerstein. '''RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries'''. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq643?ijkey=GSZOpzLAEOqJtJ4&keytype=ref Bioinformatics] 2010,
doi: 10.1093/bioinformatics/btq643

 

== Overview ==

The following sections provide documenation for the modules that are part of RSEQtools (http://rseqtools.gersteinlab.org/). This documentation is intended for the end-users and can also be found at http://info.gersteinlab.org/RSEQtools.
'''RSEQtools is implemented in C''' and uses a general C library called BIOS.

The full '''documentation for developers''' can be found here:
* RSEQtools: http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/
* BIOS: http://archive.gersteinlab.org/proj/rnaseq/doc/bios/

 

== Data formats ==

<center>[[#top|Top]]</center>
=== Mapped Read Format (MRF) ===

The Mapped Read Format (MRF) flat file consists of '''three''' components and this format is closely associated with the software components of [http://rseqtools.gersteinlab.org RSEQtools]

1. Comment lines. Comment lines are optional and start with a '#' character.
2. Header line. The header line is required and specifies the type of each column.
3. Mapped reads. Each read (single-end or paired-end) is represented by on line.

'''Required''' column:

* AlignmentBlocks, each alignment block must contain the following attributes: TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd

''Optional'' columns:

* Sequence
* QualityScores
* QueryId

Example file:

<pre>
# Comments
# Required field: Blocks [TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd]
# Optional fields: Sequence,QualityScores,QueryId
AlignmentBlocks
chr1:+:2001:2050:1:50
chr1:+:2001:2025:1:25,chr1:+:3001:3025:26:50
chr2:-:3001:3051:1:51|chr11:+:4001:4051:1:51
chr2:-:6021:6050:1:30,chr2:-:7031:7051:31:51|chr11:+:4001:4051:1:51
contigA:+:5001:5200:1:200,contigB:-:1200:1400:200:400
</pre>

Notes:

* Paired-end reads are separated by ‘|’
* Alignment blocks are separated by ‘,’
* Features of a block are separated by ‘:’
* Columns are tab-delimited
* Columns can be arranged in any order
* Coordinates are '''one-based''' and '''closed (inclusive)'''
 

'''Use MRF for confidential data'''

It is straightforward to use MRF to separate the confidential information, i.e. the sequences, from the alignment data. The MRF file can be split in 2 files: one file can include ''AlignmentBlocks'' and ''QueryID'', whereas a second file would can contain ''Sequence'' and ''QueryID''. From a practical viewpoint it is also easy to create these two files.

Assuming we have the columns AlignmentBlocks, Sequence, and QueryID as column 1, 2, and 3, respectively:
$ cut -f1,3 file.mrf > alignments.mrf
$ cut -f2,3 file.mrf > sequences.mrf

* alignments.mrf would contain the alignment data and the query ID; it can be freely shared since it does not include confidential information;
* sequences.mrf would contain the sequence data; which, potentially, could be used to identify an individual and thus may be subjected to more stringent rules.
 
<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of '''eight''' tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of [http://homes.gersteinlab.org/people/lh372/SOFT/bios/index.html BIOS]. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

Example file:

uc001aaw.1 chr1 + 357521 358460 1 357521 358460
uc001aax.1 chr1 + 410068 411702 3 410068,410854,411258 410159,411121,411702
uc001aay.1 chr1 - 552622 554252 3 552622,553203,554161 553066,553466,554252
uc001aaz.1 chr1 + 556324 557910 1 556324 557910
uc001aba.1 chr1 + 558011 558705 1 558011 558705

In this example the intervals represent a transcripts, while the sub-intervals denote exons.

Note: the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== BED ===

The BED format is used to represent contiguous genomic regions. It consists of '''three''' required columns (tab-delimited).

1. Chromosome
2. Start
3. End

Example file:

chr1 1000 5000
chr3 500 600
chrX 4000 4250

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC]

Note: the coordinates in the BED format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== BedGraph ===

The BedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. This track type is similar to the wiggle (WIG) format, but unlike the wiggle format, data exported in the bedGraph format are preserved in their original state. It consists of '''four''' required columns (tab-delimited).

1. Chromosome
2. Start
3. End
4. Value

Example file:

chr1 1000 5000 12.3
chr3 500 600 3
chrX 4000 4250 54

Full documentation can be found at [ftp://hgdownload.cse.ucsc.edu/apache/htdocs-rr/goldenPath/help/bedgraph.html UCSC]

Note: the coordinates in the BedGraph format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== WIG ===

The WIG format is used to represent dense and continuous genomic data. There are two options for formatting wiggle data: '''variableStep''' and '''fixedStep'''.

In the context of [http://rseqtools.gersteinlab.org/ RSEQtools], the variable step formatting is used and only positions with '''non-zero values''' are represented.

Example file:

track type=wiggle_0 name="test_chr22"
variableStep chrom=chr22 span=1
17535712 1.67
17535713 1.67
17535714 1.67
17535715 1.67
17535716 1.67

Full documentation can be found at [http://genome.ucsc.edu/goldenPath/help/wiggle.html UCSC]

Note: the coordinates in the WIG format are '''zero-based'''.

 

<center>[[#top|Top]]</center>

=== GFF ===

The GFF format is used to describe genes and other features. It consists of '''nine''' tab-delimited columns.

1. Name
2. Source
3. Feature
4. Start
5. End
6. Score
7. Strand
8. Frame
9. Group

Example file:

browser hide all
track name="chr11" visibility=2
chr11 MRF feature 46772115 46772161 . - . TG5
chr11 MRF feature 46772668 46772695 . - . TG5
chr11 MRF feature 118521207 118521252 . + . TG21
chr11 MRF feature 118526315 118526343 . + . TG21

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format3 UCSC]

Note: the coordinates in the BED format are '''one-based''' and the '''end coordinate is included'''.

 

<center>[[#top|Top]]</center>
=== PSL ===

The PSL format represents alignments from the BLAT alignment program.

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format2 UCSC]

 

<center>[[#top|Top]]</center>
=== SAM ===

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

Full documentation can be found at [http://samtools.sourceforge.net/ SAMtools]

 

 

== List of programs ==

This is the documentation for the end-users. The full documentation for developers can be found [http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/files.html here].

=== Format conversion utilities ===

The following programs convert the output from various alignment programs into [[#MRF|MRF]].

<center>[[#top|Top]]</center>
==== bowtie2mrf ====

bowtie2mrf converts read alignments from Bowtie into [[#MRF|MRF]].

'''Usage''':

bowtie2mrf <genomic|junctions|paired> [-sequence] [-qualityScores] [-IDs]

* Inputs: Takes [http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output Bowtie output] from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** genomic - convert single-end reads that were aligned against a genomic reference sequence using Bowtie
** junctions - convert single-end reads that were aligned against a splice junction library (generated by '''createSpliceJunctionLibrary''') using Bowtie
** paired - convert paired-end reads that were aligned using Bowtie
* ''Optional arguments''
** sequence - include the read sequence in the [[#MRF|MRF]] output
** qualityScores - include the quality scores of the read in the [[#MRF|MRF]] output
** IDs - include the read IDs in the [[#MRF|MRF]] output
 
'''Note''': bowtie2mrf assumes that bowtie was run using the default option for the -B parameter
-B/--offbase <int> leftmost ref offset = <int> in bowtie output (default: 0)
 
'''Note''': If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary. Otherwise, bowtie2mrf will not be able to convert the splice junction coordinates correctly.

 
<center>[[#top|Top]]</center>

==== psl2mrf ====

psl2mrf converts read alignments from BLAT into [[#MRF|MRF]].

'''Usage''':

psl2mrf

* Inputs: Takes BLAT alignments in [[#PSL|PSL]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== singleExport2mrf ====

singleExport2mrf converts single-end read alignments from ELAND (export file) into [[#MRF|MRF]].

'''Usage''':

singleExport2mrf

* Inputs: Takes ELAND single-end alignments in export format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None
 
'''Note''': If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary and that its file name included 'splice' or 'junction'. Otherwise, singleExport2mrf will not be able to convert the splice junction coordinates correctly.
 
The output includes sequences and quality scores. If one wants the alignment only:
singleExport2mrf < file.export.txt | cut -f1 > file.mrf
 

<center>[[#top|Top]]</center>

==== mrfSorter ====

mrfSorter sort each MRF line according to its coordinate format, i.e. the left-most coordinate would be reported first, regardless of strand.

'''Usage''':

mrfSorter

* Inputs: Takes [[#MRF|MRF]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Example: <pre> mrfSorter <file.mrf > file.sorted.mrf </pre>

 

==== sam2mrf ====

sam2mrf converts [[#SAM|SAM]] format into [[#MRF|MRF]]

'''Usage''':

sam2mrf

* Inputs: Takes [[#SAM|SAM]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Please note that for paired-end data, sam2mrf requires the mate pairs to be on subsequent lines. You may want to sort the [[#SAM|SAM]] file first. Please note that you would need to run mrfSorter after sam2mrf to make a valid MRF file.

Example: <pre>samtools sort file.sam file.sorted; sam2mrf < file.sorted.sam | mrfSorter > file.mrf </pre>

 

=== Genome annotation tools ===

The following tools are helpful in manipulating annotation files.

<center>[[#top|Top]]</center>
==== createSpliceJunctionLibrary ====

This program is used to create a splice junction library from an annotation set. It creates all pair-wise splice junctions within a transcript.

'''Usage''':

createSpliceJunctionLibrary <file.2bit> <file.annotation> <sizeExonOverlap>

* Inputs: None
* Outputs: Reports the slice junctions in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** sizeExonOverlap - defines the number of nucleotides included from each exon
* ''Optional arguments''
** None

Example output:

>chr1|12162|12612|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGG
>chr1|12162|13220|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG
>chr1|12656|13220|65
CAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG

The identifier for each splice junction consists of '''four''' items:

1. Chromosome
2. Start position (with respect to the "+", zero-based) of the splice junction within the first exon
3. Start position (with respect to the "+", zero-based) of the splice junction within the second exon
4. Size of the exon overlap

'''Note''': Internally the program uses ''twoBitToFa'' (part of BLAT package). Thus, the executable must be in the PATH.

 

<center>[[#top|Top]]</center>

==== mergeTranscripts ====

Module to merge a set of transcripts from the same gene.

Obtain unique exons from various transcript isoforms based on:
# longest isoform
# composite model (union of the exons from the different transcript isoforms)
# intersection (intersection of the exons of the different transcript isoforms)

'''Usage''':

mergeTranscripts <knownIsoforms.txt> <file.annotation> <longestIsoform|compositeModel|intersection>

* Inputs: None
* Outputs: Reports a new annotation set of merged transcripts in [[#Interval|Interval]] format
* ''Required arguments''
** knownIsoforms.txt - file that determines which transcript isoforms belong together (see format below)
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < longestIsoform | compositeModel | intersection > - determines how transcript isoforms are selected/merged:
* ''Optional arguments''
** None

The file knownIsoforms.txt should have two columns (tab-delimited) and no header:

1. ID (int). Transcripts with the same id belong to the same gene.
2. Name of the transcript.

Example:

1 uc009vip.1
1 uc001aaa.2
2 uc009vis.1
2 uc001aae.2
2 uc009viu.1
2 uc009vit.1

 

<center>[[#top|Top]]</center>
==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== Gene expression analysis ===

<center>[[#top|Top]]</center>
==== mrfQuantifier ====

Module to calculate expression values (RPKM). Given a set of mapped reads in MRF and an annotation set (representing exons, transcripts, or gene models) mrfQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

'''Usage''':

mrfQuantifier <file.annotation> <singleOverlap|multipleOverlap>

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See [[#Interval|Interval]] for more details.
** < singleOverlap | multipleOverlap > - singleOverlap: reads that overlap with multiple annotated features are ignored; multipleOverlap: reads that overlap with multiple annotated features are counted multiple times.
* ''Optional arguments''
** None
 
'''Note''': All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

[[Image:mrfQuantifier.png|thumb|1000px|center|Determining overlaps between annotation entries and reads]]

 

<center>[[#top|Top]]</center>

==== bgrQuantifier ====

Module to calculate expression values (RPKM) from a signal track in bedGraph (bgr) format. Given a signal track and an annotation set (representing exons, transcripts, or gene models) bgrQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

'''Usage''':

bgrQuantifier <file.annotation>

* Inputs: Takes [[#BedGraph|BedGraph]] file from STDIN.
* Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See [[#Interval|Interval]] for more details.
* ''Optional arguments''
** None
 
'''Note''': All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

 

=== Visualization tools ===

The following programs are useful for converting [[#MRF|MRF]] into data formats that can be viewed in a genome browser.

<center>[[#top|Top]]</center>
==== mrf2wig ====

Generates signal track ([[#WIG|WIG]]) of mapped reads from a [[#MRF|MRF]] file. By default, the values in the WIG file are normalized by the total number of mapped reads per million.
Only positions with non-zero values are reported.

'''Usage''':

mrf2wig <prefix> [doNotNormalize]

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [[#WIG|WIG]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.wig
* ''Optional arguments''
** doNotNormailze - the counts are NOT normalized

 

<center>[[#top|Top]]</center>

==== mrf2gff ====

Generates a [[#GFF|GFF]] file of mapped splice junction reads from a [[#MRF|MRF]] file.

'''Usage'''
mrf2gff <prefix>

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [[#GFF|GFF]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.gff
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrf2bgr ====

Module to convert [[#MRF|MRF]] to [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph]. Generates a [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph], where the counts are normalized by the total number of mapped reads per million.

'''Usage''':

mrf2bgr <prefix> [doNotNormalize]

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.bgr
* ''Optional arguments''
** doNotNormailze - the counts are NOT normalized.

 

=== Segmentation of mapped reads ===

<center>[[#top|Top]]</center>
==== wigSegmenter ====

Module to segment a [[#WIG|WIG]] signal track using the [http://www.ncbi.nlm.nih.gov/pubmed/15979196 maxGap-minRun algorithm]. The output is a set of transcriptionally active regions (TARs) in [[#BED|BED]] format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

'''Usage''':

wigSegmenter <wigPrefix> <threshold> <maxGap> <minRun>

* Inputs: None
* Outputs: Generates a [[#BED|BED]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** wigPrefix - prefix used to generate the [[#WIG|WIG]] files using [[#mrf2wig|mrf2wig]]
** threshold - level at which the segmentation is performed
** maxGap - maximum number of consecutive positions that can have values less than the threshold
** minRun - minimal length of a TAR
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== bgrSegmenter ====

Module to segment a [[#BedGraph|BedGraph]] signal track using the [http://www.ncbi.nlm.nih.gov/pubmed/15979196 maxGap-minRun algorithm]. The output is a set of transcriptionally active regions (TARs) in [[#BED|BED]] format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

'''Usage''':

bgrSegmenter <bgrPrefix> <threshold> <maxGap> <minRun>

* Inputs: None
* Outputs: Generates a [[#BED|BED]] file for each chromosome occurring in the [[#BedGraph|BedGraph]] input file.
* ''Required arguments''
** bgrPrefix - prefix used to generate the [[#BedGraph|BedGraph]] files using [[#mrf2bgr|mrf2bgr]]
** threshold - level at which the segmentation is performed
** maxGap - maximum number of consecutive positions that can have values less than the threshold
** minRun - minimal length of a TAR
* ''Optional arguments''
** None

 

=== Annotation statistics tools ===

The following modules are useful for calculating annotation statistics given a set of mapped reads.

<center>[[#top|Top]]</center>
==== mrfAnnotationCoverage ====

Module to calculate annotation coverage. Sample a set of mapped reads and determine the fraction of transcripts (specified in file.annotation) that have at least <coverageFactor>-times uniform coverage.

'''Usage''':

mrfAnnotationCoverage <file.annotation> <numTotalReads> <numReadsToSample> <coverageFactor>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Reports the fraction of transcripts that have at least <coverageFactor>-times uniform coverage to STDOUT
* ''Required arguments''
** file.annotation -
** numTotalReads - total number of reads in the [[#MRF|MRF]] input file
** numReadsToSample - number of reads to sample from the [[#MRF|MRF]] input file
** coverageFactor - minimum level of uniform coverage required across a transcript
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrfMappingBias ====

Module to calculate mapping bias for a given annotation set. Aggregates mapped reads that overlap with transcripts (specified in file.annotation) and
outputs the counts over a standardized transcript (divided into 100 equally sized bins) where 0 represents the 5' end of the transcript and
1 denotes the 3' end of the transcripts. This analysis is done in a strand specific way.

'''Usage''':

mrfMappingBias <file.annotation>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs the number of mapped reads for each bin of the standardized transcript to STDOUT
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
* ''Optional arguments''
** None

 

=== MRF selection utilities ===

The following utilities are helpful to select subsets of an [[#MRF|MRF]] file. It should be noted that these utilities operate on ''existing'' MRF files.

<center>[[#top|Top]]</center>
==== mrfSampler ====

Randomly select a subset of [[#MRF|MRF]] entries.

'''Usage''':

mrfSampler <proportionOfReadsToSample>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** proportionOfReadsToSample - fraction of reads to sample (on average)
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfSelectRegion ====

Select reads that overlap with a specified genomic region.

'''Usage''':

mrfSelectRegion <targetName:targetStart-targetEnd>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** targetName:targetStart-targetEnd - region of interest
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfSelectSpliced ====

Select reads that span a splice junction.

'''Usage''':

mrfSelectSpliced

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None
 

'''Note''': The purpose of mrfSelectSpliced is to extract mapped reads that align to a splice junction from an '''existing MRF file'''. It is important to note that this utility is not used to convert the output of a specific mapping program.

 

<center>[[#top|Top]]</center>
==== mrfSubsetByTargetName ====

Split up an [[#MRF|MRF]] file by chromosome.

'''Usage''':

mrfSubsetByTargetName <prefix>

* Inputs: Takes [[#MRF\MRF]] from STDIN
* Outputs: Outputs a separate [[#MRF|MRF]] file for each chromosome using the following naming convention: <prefix>_chrXXX.mrf.
* ''Required arguments''
** prefix - prefix used for the output files.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrfSelectAnnotated ====

Module to select a subset of reads that overlap with a specified annotation set.

'''Usage''':

mrfSelectAnnotated <file.annotation> <include|exclude>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (one transcript per line).
** < include | exclude > - include: report reads that overlap with ''exonic'' regions of the annotation set; exclude: report reads that do not overlap with ''exonic'' regions of the annotation set
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfRegionCount ====

Module to count the total number of reads in a specified region.

'''Usage''':

mrfRegionCount <targetName:targetStart-targetEnd>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs the total number of reads in a specified region to STDOUT.
* ''Required arguments''
** targetName:targetStart-targetEnd, specifies the region of interest
* ''Optional arguments''
** None

 

=== Auxiliary utilities ===

This section includes various data format conversion utilities.

<center>[[#top|Top]]</center>
==== bed2interval ====

Utility to convert [[#BED|BED]] format into [[#Interval|Interval]] format.

'''Usage''':

bed2interval

* Inputs: Takes data in [[#BED|BED]] format from STDIN
* Outputs: Outputs data in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== interval2bed ====

Utility to convert [[#Interval|Interval]] format into [[#BED|BED]] format.

'''Usage''':

interval2bed

* Inputs: Takes data in [[#Interval|Interval]] format from STDIN
* Outputs: Outputs data in [[#BED|BED]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== interval2gff ====

Utility to convert [[#Interval|Interval]] format into [[#GFF|GFF]] format.

'''Usage''':

interval2gff <trackName>

* Inputs: Takes data in [[#Interval|Interval]] format from STDIN
* Outputs: Outputs data in [[#GFF|GFF]] format to STDOUT
* ''Required arguments''
** trackName - track name used in the [[#GFF|GFF]] file
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== gff2interval ====

Utility to convert [[#GFF|GFF]] format into [[#Interval|Interval]] format.

'''Usage''':

gff2interval

* Inputs: Takes data in [[#GFF|GFF]] format from STDIN
* Outputs: Outputs data in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== export2fastq ====

Module to generate FASTQ sequences from an ELAND export file.

'''Usage''':

export2fastq

* Inputs: Takes an ELAND export file from STDIN
* Outputs: Reports the extracted sequences in FASTQ format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrf2sam ====

Module to convert [[#MRF|MRF]] to [[#SAM|SAM]].

'''Usage''':

mrf2sam

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#SAM|SAM]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

RSEQtools

2013-04-22T13:55:47Z

Asboner: /* sam2mrf */

<center>[http://archive.gersteinlab.org/proj/rnaseq/rseqtools '''RSEQtools Main Page''']</center>

__TOC__

== Introduction ==

The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify that person, raising significant privacy concerns. In order to address these issues we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confi-dential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools that uses this format for the analysis of RNA-Seq experiments. RSEQtools consists of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions. In addition to the anonymization afforded by this format it also facilitates the decoupling of the alignment of reads from downstream analyses.

 

== Citation ==

Lukas Habegger*, Andrea Sboner*, Tara A. Gianoulis, Joel Rozowsky, Ashish Agarwal, Michael Snyder, Mark Gerstein. '''RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries'''. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq643?ijkey=GSZOpzLAEOqJtJ4&keytype=ref Bioinformatics] 2010,
doi: 10.1093/bioinformatics/btq643

 

== Overview ==

The following sections provide documenation for the modules that are part of RSEQtools (http://rseqtools.gersteinlab.org/). This documentation is intended for the end-users and can also be found at http://info.gersteinlab.org/RSEQtools.
'''RSEQtools is implemented in C''' and uses a general C library called BIOS.

The full '''documentation for developers''' can be found here:
* RSEQtools: http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/
* BIOS: http://archive.gersteinlab.org/proj/rnaseq/doc/bios/

 

== Data formats ==

<center>[[#top|Top]]</center>
=== Mapped Read Format (MRF) ===

The Mapped Read Format (MRF) flat file consists of '''three''' components and this format is closely associated with the software components of [http://rseqtools.gersteinlab.org RSEQtools]

1. Comment lines. Comment lines are optional and start with a '#' character.
2. Header line. The header line is required and specifies the type of each column.
3. Mapped reads. Each read (single-end or paired-end) is represented by on line.

'''Required''' column:

* AlignmentBlocks, each alignment block must contain the following attributes: TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd

''Optional'' columns:

* Sequence
* QualityScores
* QueryId

Example file:

<pre>
# Comments
# Required field: Blocks [TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd]
# Optional fields: Sequence,QualityScores,QueryId
AlignmentBlocks
chr1:+:2001:2050:1:50
chr1:+:2001:2025:1:25,chr1:+:3001:3025:26:50
chr2:-:3001:3051:1:51|chr11:+:4001:4051:1:51
chr2:-:6021:6050:1:30,chr2:-:7031:7051:31:51|chr11:+:4001:4051:1:51
contigA:+:5001:5200:1:200,contigB:-:1200:1400:200:400
</pre>

Notes:

* Paired-end reads are separated by ‘|’
* Alignment blocks are separated by ‘,’
* Features of a block are separated by ‘:’
* Columns are tab-delimited
* Columns can be arranged in any order
* Coordinates are '''one-based''' and '''closed (inclusive)'''
 

'''Use MRF for confidential data'''

It is straightforward to use MRF to separate the confidential information, i.e. the sequences, from the alignment data. The MRF file can be split in 2 files: one file can include ''AlignmentBlocks'' and ''QueryID'', whereas a second file would can contain ''Sequence'' and ''QueryID''. From a practical viewpoint it is also easy to create these two files.

Assuming we have the columns AlignmentBlocks, Sequence, and QueryID as column 1, 2, and 3, respectively:
$ cut -f1,3 file.mrf > alignments.mrf
$ cut -f2,3 file.mrf > sequences.mrf

* alignments.mrf would contain the alignment data and the query ID; it can be freely shared since it does not include confidential information;
* sequences.mrf would contain the sequence data; which, potentially, could be used to identify an individual and thus may be subjected to more stringent rules.
 
<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of '''eight''' tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of [http://homes.gersteinlab.org/people/lh372/SOFT/bios/index.html BIOS]. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

Example file:

uc001aaw.1 chr1 + 357521 358460 1 357521 358460
uc001aax.1 chr1 + 410068 411702 3 410068,410854,411258 410159,411121,411702
uc001aay.1 chr1 - 552622 554252 3 552622,553203,554161 553066,553466,554252
uc001aaz.1 chr1 + 556324 557910 1 556324 557910
uc001aba.1 chr1 + 558011 558705 1 558011 558705

In this example the intervals represent a transcripts, while the sub-intervals denote exons.

Note: the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== BED ===

The BED format is used to represent contiguous genomic regions. It consists of '''three''' required columns (tab-delimited).

1. Chromosome
2. Start
3. End

Example file:

chr1 1000 5000
chr3 500 600
chrX 4000 4250

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC]

Note: the coordinates in the BED format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== BedGraph ===

The BedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. This track type is similar to the wiggle (WIG) format, but unlike the wiggle format, data exported in the bedGraph format are preserved in their original state. It consists of '''four''' required columns (tab-delimited).

1. Chromosome
2. Start
3. End
4. Value

Example file:

chr1 1000 5000 12.3
chr3 500 600 3
chrX 4000 4250 54

Full documentation can be found at [ftp://hgdownload.cse.ucsc.edu/apache/htdocs-rr/goldenPath/help/bedgraph.html UCSC]

Note: the coordinates in the BedGraph format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== WIG ===

The WIG format is used to represent dense and continuous genomic data. There are two options for formatting wiggle data: '''variableStep''' and '''fixedStep'''.

In the context of [http://rseqtools.gersteinlab.org/ RSEQtools], the variable step formatting is used and only positions with '''non-zero values''' are represented.

Example file:

track type=wiggle_0 name="test_chr22"
variableStep chrom=chr22 span=1
17535712 1.67
17535713 1.67
17535714 1.67
17535715 1.67
17535716 1.67

Full documentation can be found at [http://genome.ucsc.edu/goldenPath/help/wiggle.html UCSC]

Note: the coordinates in the WIG format are '''zero-based'''.

 

<center>[[#top|Top]]</center>

=== GFF ===

The GFF format is used to describe genes and other features. It consists of '''nine''' tab-delimited columns.

1. Name
2. Source
3. Feature
4. Start
5. End
6. Score
7. Strand
8. Frame
9. Group

Example file:

browser hide all
track name="chr11" visibility=2
chr11 MRF feature 46772115 46772161 . - . TG5
chr11 MRF feature 46772668 46772695 . - . TG5
chr11 MRF feature 118521207 118521252 . + . TG21
chr11 MRF feature 118526315 118526343 . + . TG21

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format3 UCSC]

Note: the coordinates in the BED format are '''one-based''' and the '''end coordinate is included'''.

 

<center>[[#top|Top]]</center>
=== PSL ===

The PSL format represents alignments from the BLAT alignment program.

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format2 UCSC]

 

<center>[[#top|Top]]</center>
=== SAM ===

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

Full documentation can be found at [http://samtools.sourceforge.net/ SAMtools]

 

 

== List of programs ==

This is the documentation for the end-users. The full documentation for developers can be found [http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/files.html here].

=== Format conversion utilities ===

The following programs convert the output from various alignment programs into [[#MRF|MRF]].

<center>[[#top|Top]]</center>
==== bowtie2mrf ====

bowtie2mrf converts read alignments from Bowtie into [[#MRF|MRF]].

'''Usage''':

bowtie2mrf <genomic|junctions|paired> [-sequence] [-qualityScores] [-IDs]

* Inputs: Takes [http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output Bowtie output] from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** genomic - convert single-end reads that were aligned against a genomic reference sequence using Bowtie
** junctions - convert single-end reads that were aligned against a splice junction library (generated by '''createSpliceJunctionLibrary''') using Bowtie
** paired - convert paired-end reads that were aligned using Bowtie
* ''Optional arguments''
** sequence - include the read sequence in the [[#MRF|MRF]] output
** qualityScores - include the quality scores of the read in the [[#MRF|MRF]] output
** IDs - include the read IDs in the [[#MRF|MRF]] output
 
'''Note''': bowtie2mrf assumes that bowtie was run using the default option for the -B parameter
-B/--offbase <int> leftmost ref offset = <int> in bowtie output (default: 0)
 
'''Note''': If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary. Otherwise, bowtie2mrf will not be able to convert the splice junction coordinates correctly.

 
<center>[[#top|Top]]</center>

==== psl2mrf ====

psl2mrf converts read alignments from BLAT into [[#MRF|MRF]].

'''Usage''':

psl2mrf

* Inputs: Takes BLAT alignments in [[#PSL|PSL]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== singleExport2mrf ====

singleExport2mrf converts single-end read alignments from ELAND (export file) into [[#MRF|MRF]].

'''Usage''':

singleExport2mrf

* Inputs: Takes ELAND single-end alignments in export format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None
 
'''Note''': If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary and that its file name included 'splice' or 'junction'. Otherwise, singleExport2mrf will not be able to convert the splice junction coordinates correctly.
 
The output includes sequences and quality scores. If one wants the alignment only:
singleExport2mrf < file.export.txt | cut -f1 > file.mrf
 

<center>[[#top|Top]]</center>

==== sam2mrf ====

sam2mrf converts [[#SAM|SAM]] format into [[#MRF|MRF]]

'''Usage''':

sam2mrf

* Inputs: Takes [[#SAM|SAM]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Please note that for paired-end data, sam2mrf requires the mate pairs to be on subsequent lines. You may want to sort the [[#SAM|SAM]] file first. Please note that you would need to run mrfSorter after sam2mrf to make a valid MRF file.

Example: <pre>samtools sort file.sam file.sorted; sam2mrf < file.sorted.sam | mrfSorter > file.mrf </pre>

 

=== Genome annotation tools ===

The following tools are helpful in manipulating annotation files.

<center>[[#top|Top]]</center>
==== createSpliceJunctionLibrary ====

This program is used to create a splice junction library from an annotation set. It creates all pair-wise splice junctions within a transcript.

'''Usage''':

createSpliceJunctionLibrary <file.2bit> <file.annotation> <sizeExonOverlap>

* Inputs: None
* Outputs: Reports the slice junctions in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** sizeExonOverlap - defines the number of nucleotides included from each exon
* ''Optional arguments''
** None

Example output:

>chr1|12162|12612|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGG
>chr1|12162|13220|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG
>chr1|12656|13220|65
CAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG

The identifier for each splice junction consists of '''four''' items:

1. Chromosome
2. Start position (with respect to the "+", zero-based) of the splice junction within the first exon
3. Start position (with respect to the "+", zero-based) of the splice junction within the second exon
4. Size of the exon overlap

'''Note''': Internally the program uses ''twoBitToFa'' (part of BLAT package). Thus, the executable must be in the PATH.

 

<center>[[#top|Top]]</center>

==== mergeTranscripts ====

Module to merge a set of transcripts from the same gene.

Obtain unique exons from various transcript isoforms based on:
# longest isoform
# composite model (union of the exons from the different transcript isoforms)
# intersection (intersection of the exons of the different transcript isoforms)

'''Usage''':

mergeTranscripts <knownIsoforms.txt> <file.annotation> <longestIsoform|compositeModel|intersection>

* Inputs: None
* Outputs: Reports a new annotation set of merged transcripts in [[#Interval|Interval]] format
* ''Required arguments''
** knownIsoforms.txt - file that determines which transcript isoforms belong together (see format below)
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < longestIsoform | compositeModel | intersection > - determines how transcript isoforms are selected/merged:
* ''Optional arguments''
** None

The file knownIsoforms.txt should have two columns (tab-delimited) and no header:

1. ID (int). Transcripts with the same id belong to the same gene.
2. Name of the transcript.

Example:

1 uc009vip.1
1 uc001aaa.2
2 uc009vis.1
2 uc001aae.2
2 uc009viu.1
2 uc009vit.1

 

<center>[[#top|Top]]</center>
==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== Gene expression analysis ===

<center>[[#top|Top]]</center>
==== mrfQuantifier ====

Module to calculate expression values (RPKM). Given a set of mapped reads in MRF and an annotation set (representing exons, transcripts, or gene models) mrfQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

'''Usage''':

mrfQuantifier <file.annotation> <singleOverlap|multipleOverlap>

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See [[#Interval|Interval]] for more details.
** < singleOverlap | multipleOverlap > - singleOverlap: reads that overlap with multiple annotated features are ignored; multipleOverlap: reads that overlap with multiple annotated features are counted multiple times.
* ''Optional arguments''
** None
 
'''Note''': All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

[[Image:mrfQuantifier.png|thumb|1000px|center|Determining overlaps between annotation entries and reads]]

 

<center>[[#top|Top]]</center>

==== bgrQuantifier ====

Module to calculate expression values (RPKM) from a signal track in bedGraph (bgr) format. Given a signal track and an annotation set (representing exons, transcripts, or gene models) bgrQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

'''Usage''':

bgrQuantifier <file.annotation>

* Inputs: Takes [[#BedGraph|BedGraph]] file from STDIN.
* Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See [[#Interval|Interval]] for more details.
* ''Optional arguments''
** None
 
'''Note''': All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

 

=== Visualization tools ===

The following programs are useful for converting [[#MRF|MRF]] into data formats that can be viewed in a genome browser.

<center>[[#top|Top]]</center>
==== mrf2wig ====

Generates signal track ([[#WIG|WIG]]) of mapped reads from a [[#MRF|MRF]] file. By default, the values in the WIG file are normalized by the total number of mapped reads per million.
Only positions with non-zero values are reported.

'''Usage''':

mrf2wig <prefix> [doNotNormalize]

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [[#WIG|WIG]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.wig
* ''Optional arguments''
** doNotNormailze - the counts are NOT normalized

 

<center>[[#top|Top]]</center>

==== mrf2gff ====

Generates a [[#GFF|GFF]] file of mapped splice junction reads from a [[#MRF|MRF]] file.

'''Usage'''
mrf2gff <prefix>

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [[#GFF|GFF]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.gff
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrf2bgr ====

Module to convert [[#MRF|MRF]] to [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph]. Generates a [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph], where the counts are normalized by the total number of mapped reads per million.

'''Usage''':

mrf2bgr <prefix> [doNotNormalize]

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.bgr
* ''Optional arguments''
** doNotNormailze - the counts are NOT normalized.

 

=== Segmentation of mapped reads ===

<center>[[#top|Top]]</center>
==== wigSegmenter ====

Module to segment a [[#WIG|WIG]] signal track using the [http://www.ncbi.nlm.nih.gov/pubmed/15979196 maxGap-minRun algorithm]. The output is a set of transcriptionally active regions (TARs) in [[#BED|BED]] format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

'''Usage''':

wigSegmenter <wigPrefix> <threshold> <maxGap> <minRun>

* Inputs: None
* Outputs: Generates a [[#BED|BED]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** wigPrefix - prefix used to generate the [[#WIG|WIG]] files using [[#mrf2wig|mrf2wig]]
** threshold - level at which the segmentation is performed
** maxGap - maximum number of consecutive positions that can have values less than the threshold
** minRun - minimal length of a TAR
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== bgrSegmenter ====

Module to segment a [[#BedGraph|BedGraph]] signal track using the [http://www.ncbi.nlm.nih.gov/pubmed/15979196 maxGap-minRun algorithm]. The output is a set of transcriptionally active regions (TARs) in [[#BED|BED]] format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

'''Usage''':

bgrSegmenter <bgrPrefix> <threshold> <maxGap> <minRun>

* Inputs: None
* Outputs: Generates a [[#BED|BED]] file for each chromosome occurring in the [[#BedGraph|BedGraph]] input file.
* ''Required arguments''
** bgrPrefix - prefix used to generate the [[#BedGraph|BedGraph]] files using [[#mrf2bgr|mrf2bgr]]
** threshold - level at which the segmentation is performed
** maxGap - maximum number of consecutive positions that can have values less than the threshold
** minRun - minimal length of a TAR
* ''Optional arguments''
** None

 

=== Annotation statistics tools ===

The following modules are useful for calculating annotation statistics given a set of mapped reads.

<center>[[#top|Top]]</center>
==== mrfAnnotationCoverage ====

Module to calculate annotation coverage. Sample a set of mapped reads and determine the fraction of transcripts (specified in file.annotation) that have at least <coverageFactor>-times uniform coverage.

'''Usage''':

mrfAnnotationCoverage <file.annotation> <numTotalReads> <numReadsToSample> <coverageFactor>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Reports the fraction of transcripts that have at least <coverageFactor>-times uniform coverage to STDOUT
* ''Required arguments''
** file.annotation -
** numTotalReads - total number of reads in the [[#MRF|MRF]] input file
** numReadsToSample - number of reads to sample from the [[#MRF|MRF]] input file
** coverageFactor - minimum level of uniform coverage required across a transcript
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrfMappingBias ====

Module to calculate mapping bias for a given annotation set. Aggregates mapped reads that overlap with transcripts (specified in file.annotation) and
outputs the counts over a standardized transcript (divided into 100 equally sized bins) where 0 represents the 5' end of the transcript and
1 denotes the 3' end of the transcripts. This analysis is done in a strand specific way.

'''Usage''':

mrfMappingBias <file.annotation>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs the number of mapped reads for each bin of the standardized transcript to STDOUT
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
* ''Optional arguments''
** None

 

=== MRF selection utilities ===

The following utilities are helpful to select subsets of an [[#MRF|MRF]] file. It should be noted that these utilities operate on ''existing'' MRF files.

<center>[[#top|Top]]</center>
==== mrfSampler ====

Randomly select a subset of [[#MRF|MRF]] entries.

'''Usage''':

mrfSampler <proportionOfReadsToSample>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** proportionOfReadsToSample - fraction of reads to sample (on average)
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfSelectRegion ====

Select reads that overlap with a specified genomic region.

'''Usage''':

mrfSelectRegion <targetName:targetStart-targetEnd>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** targetName:targetStart-targetEnd - region of interest
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfSelectSpliced ====

Select reads that span a splice junction.

'''Usage''':

mrfSelectSpliced

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None
 

'''Note''': The purpose of mrfSelectSpliced is to extract mapped reads that align to a splice junction from an '''existing MRF file'''. It is important to note that this utility is not used to convert the output of a specific mapping program.

 

<center>[[#top|Top]]</center>
==== mrfSubsetByTargetName ====

Split up an [[#MRF|MRF]] file by chromosome.

'''Usage''':

mrfSubsetByTargetName <prefix>

* Inputs: Takes [[#MRF\MRF]] from STDIN
* Outputs: Outputs a separate [[#MRF|MRF]] file for each chromosome using the following naming convention: <prefix>_chrXXX.mrf.
* ''Required arguments''
** prefix - prefix used for the output files.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrfSelectAnnotated ====

Module to select a subset of reads that overlap with a specified annotation set.

'''Usage''':

mrfSelectAnnotated <file.annotation> <include|exclude>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (one transcript per line).
** < include | exclude > - include: report reads that overlap with ''exonic'' regions of the annotation set; exclude: report reads that do not overlap with ''exonic'' regions of the annotation set
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfRegionCount ====

Module to count the total number of reads in a specified region.

'''Usage''':

mrfRegionCount <targetName:targetStart-targetEnd>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs the total number of reads in a specified region to STDOUT.
* ''Required arguments''
** targetName:targetStart-targetEnd, specifies the region of interest
* ''Optional arguments''
** None

 

=== Auxiliary utilities ===

This section includes various data format conversion utilities.

<center>[[#top|Top]]</center>
==== bed2interval ====

Utility to convert [[#BED|BED]] format into [[#Interval|Interval]] format.

'''Usage''':

bed2interval

* Inputs: Takes data in [[#BED|BED]] format from STDIN
* Outputs: Outputs data in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== interval2bed ====

Utility to convert [[#Interval|Interval]] format into [[#BED|BED]] format.

'''Usage''':

interval2bed

* Inputs: Takes data in [[#Interval|Interval]] format from STDIN
* Outputs: Outputs data in [[#BED|BED]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== interval2gff ====

Utility to convert [[#Interval|Interval]] format into [[#GFF|GFF]] format.

'''Usage''':

interval2gff <trackName>

* Inputs: Takes data in [[#Interval|Interval]] format from STDIN
* Outputs: Outputs data in [[#GFF|GFF]] format to STDOUT
* ''Required arguments''
** trackName - track name used in the [[#GFF|GFF]] file
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== gff2interval ====

Utility to convert [[#GFF|GFF]] format into [[#Interval|Interval]] format.

'''Usage''':

gff2interval

* Inputs: Takes data in [[#GFF|GFF]] format from STDIN
* Outputs: Outputs data in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== export2fastq ====

Module to generate FASTQ sequences from an ELAND export file.

'''Usage''':

export2fastq

* Inputs: Takes an ELAND export file from STDIN
* Outputs: Reports the extracted sequences in FASTQ format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrf2sam ====

Module to convert [[#MRF|MRF]] to [[#SAM|SAM]].

'''Usage''':

mrf2sam

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#SAM|SAM]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

FusionSeq Papers

2012-03-08T19:46:39Z

Asboner: /* 2011 */

{{FusionSeqHeader}}

Here is a list of papers citing:
* Sboner A, Habegger L, Pflueger D, Terry S, Chen DZ, Rozowsky JS, Tewari AK, Kitabayashi N, Moss BJ, Chee MS, Demichelis F, Rubin MA, Gerstein MB. '''FusionSeq: a modular framework for finding gene fusions by analyzing Paired-End RNA-Sequencing data.''' ''Genome Biol'' 21 Oct. 2010; '''11''':R104 doi:[http://dx.doi.org/10.1186/gb-2010-11-10-r104 10.1186/gb-2010-11-10-r104]

=News=
* Nov 2010: [http://www.genomeweb.com/sequencing/transcriptome-sequencing-25-prostate-cancer-tumors-ids-novel-gene-fusions GenomeWeb -- InSequence]
* July/Aug 2010: [http://www.genomeweb.com/arrays/moving-away-microarrays?page=show GenomeWeb -- GenomeTechnology]

=Scientific Papers=
==2012==
* Pierron, Gaëlle, Franck Tirode, Carlo Lucchesi, Stéphanie Reynaud, Stelly Ballet, Sarah Cohen-Gogo, Virginie Perrin, Jean-Michel Coindre, and Olivier Delattre. '''A New Subtype of Bone Sarcoma Defined by BCOR-CCNB3 Gene Fusion.''' [http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.1107.html Nature Genetics] (March 4, 2012).

* Barbieri, Christopher E, Francesca Demichelis, and Mark A Rubin. '''Molecular Genetics of Prostate Cancer: Emerging Appreciation of Genetic Complexity.''' Histopathology 60, no. 1 (January 1, 2012): 187–198.

* Chng, Kern Rei, Shin Chet Chuah, and Edwin Cheung eds. '''Stem Cells and Human Diseases.''' 175–196 [http://www.springerlink.com/content/g101v26672v450uk/abstract/ Springer Netherlands 2012].

==2011==
* Stein, Lincoln D. '''An Introduction to the Informatics of ‘Next‐Generation’ Sequencing''' In . John Wiley & Sons, Inc. [http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1101s36/abstract Current Protocols in Bioinformatics]

* Abate, F., A. Acquaviva, E. Ficarra, G. Paciello, E. Macii, A. Ferrarini, M. Delledonne, S. Soverini, and G. Martinelli. '''A Novel Framework for Chimeric Transcript Detection Based on Accurate Gene Fusion Model.''' 34–41. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6112352&tag=1 IEEE, 2011]

* Salzman, Julia, Robert J. Marinelli, Peter L. Wang, Ann E. Green, Julie S. Nielsen, Brad H. Nelson, Charles W. Drescher, and Patrick O. Brown. '''ESRRA-C11orf20 Is a Recurrent Gene Fusion in Serous Ovarian Carcinoma.''' PLoS Biol 9, no. 9 (2011): e1001156.

* Asmann, Yan W, Asif Hossain, Brian M Necela, Sumit Middha, Krishna R Kalari, Zhifu Sun, High-Seng Chai, et al. “A Novel Bioinformatics Pipeline for Identification and Characterization of Fusion Transcripts in Breast Cancer and Normal Cell Lines.” Nucleic Acids Research 39, no. 15 (August 1, 2011): e100–e100.

* Chen, William L., Burton A. Leland, Joseph L. Durant, David L. Grier, Bradley D. Christie, James G. Nourse, and Keith T. Taylor. '''Self-Contained Sequence Representation: Bridging the Gap Between Bioinformatics and Cheminformatics.''' J. Chem. Inf. Model. 51, no. 9 (2011): 2186–2208.

* Kim, Daehwan, and Steven L Salzberg. '''TopHat-Fusion: An Algorithm for Discovery of Novel Fusion Transcripts.''' Genome Biology 12 (2011): R72.

* McPherson, Andrew, Fereydoun Hormozdiari, Abdalnasser Zayed, Ryan Giuliany, Gavin Ha, Mark G. F. Sun, Malachi Griffith, et al. '''deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data.''' PLoS Comput Biol 7, no. 5 (May 19, 2011): e1001138.

* Ge, Huanying, Kejun Liu, Todd Juan, Fang Fang, Matthew Newman, and Wolfgang Hoeck. “FusionMap: Detecting Fusion Genes from Next-generation Sequencing Data at Base-pair Resolution.” [http://bioinformatics.oxfordjournals.org/content/early/2011/05/18/bioinformatics.btr310.abstract Bioinformatics] (May 18, 2011).

* Li, Yang, Jeremy Chien, David I. Smith, and Jian Ma. '''FusionHunter: Identifying Fusion Transcripts in Cancer Using Paired-end RNA-seq.''' [http://bioinformatics.oxfordjournals.org/content/early/2011/05/05/bioinformatics.btr265.abstract Bioinformatics] (May 5, 2011).

* McPherson, Andrew, Chunxiao Wu, Iman Hajirasouliha, Fereydoun Hormozdiari, Faraz Hach, Anna Lapuk, Stanislav Volik, Sohrab Shah, Colin Collins, and S. Cenk Sahinalp. '''Comrad: a novel algorithmic framework for the integrated analysis of RNA-Seq and WGSS data.''' [http://bioinformatics.oxfordjournals.org/content/early/2011/04/09/bioinformatics.btr184.abstract. Bioinformatics] (April 9, 2011).

* Nacu, Serban, Wenlin Yuan, Zhengyan Kan, Deepali Bhatt, Celina Rivers, Jeremy Stinson, Brock Peters, et al. '''Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples.''' [http://www.biomedcentral.com/1755-8794/4/11 BMC Medical Genomics] 4, no. 1 (2011): 11.

* Pflueger D, Terry S, Sboner A, Habegger L, Esgueva R, Lin P, Svensson MA, Kitabayashi N, Moss BJ, MacDonald TY, Cao X, Barrette T, Tewari AK, Chee MS, Chinnaiyan AM, Rickman DS, Demichelis F, Gerstein MB, Rubin MA. '''Discovery of non-ETS gene fusions in human prostate cancer using next-generation RNA sequencing''' [http://www.genome.org/cgi/doi/10.1101/gr.110684.110 Genome Res] 2011; Vol. 21:56-67; Published in Advance October 29, 2010. doi:[http://www.genome.org/cgi/doi/10.1101/gr.110684.110 10.1101/gr.110684.110]

FusionSeq Papers

2012-03-08T19:41:48Z

Asboner: /* Scientific Papers */

{{FusionSeqHeader}}

Here is a list of papers citing:
* Sboner A, Habegger L, Pflueger D, Terry S, Chen DZ, Rozowsky JS, Tewari AK, Kitabayashi N, Moss BJ, Chee MS, Demichelis F, Rubin MA, Gerstein MB. '''FusionSeq: a modular framework for finding gene fusions by analyzing Paired-End RNA-Sequencing data.''' ''Genome Biol'' 21 Oct. 2010; '''11''':R104 doi:[http://dx.doi.org/10.1186/gb-2010-11-10-r104 10.1186/gb-2010-11-10-r104]

=News=
* Nov 2010: [http://www.genomeweb.com/sequencing/transcriptome-sequencing-25-prostate-cancer-tumors-ids-novel-gene-fusions GenomeWeb -- InSequence]
* July/Aug 2010: [http://www.genomeweb.com/arrays/moving-away-microarrays?page=show GenomeWeb -- GenomeTechnology]

=Scientific Papers=
==2012==
* Pierron, Gaëlle, Franck Tirode, Carlo Lucchesi, Stéphanie Reynaud, Stelly Ballet, Sarah Cohen-Gogo, Virginie Perrin, Jean-Michel Coindre, and Olivier Delattre. '''A New Subtype of Bone Sarcoma Defined by BCOR-CCNB3 Gene Fusion.''' [http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.1107.html Nature Genetics] (March 4, 2012).

* Barbieri, Christopher E, Francesca Demichelis, and Mark A Rubin. '''Molecular Genetics of Prostate Cancer: Emerging Appreciation of Genetic Complexity.''' Histopathology 60, no. 1 (January 1, 2012): 187–198.

* Chng, Kern Rei, Shin Chet Chuah, and Edwin Cheung eds. '''Stem Cells and Human Diseases.''' 175–196 [http://www.springerlink.com/content/g101v26672v450uk/abstract/ Springer Netherlands 2012].

==2011==
* Stein, Lincoln D. '''An Introduction to the Informatics of ‘Next‐Generation’ Sequencing''' In . John Wiley & Sons, Inc. [http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1101s36/abstract Current Protocols in Bioinformatics]

* Abate, F., A. Acquaviva, E. Ficarra, G. Paciello, E. Macii, A. Ferrarini, M. Delledonne, S. Soverini, and G. Martinelli. '''A Novel Framework for Chimeric Transcript Detection Based on Accurate Gene Fusion Model.''' 34–41. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6112352&tag=1 IEEE, 2011]

* Asmann, Yan W, Asif Hossain, Brian M Necela, Sumit Middha, Krishna R Kalari, Zhifu Sun, High-Seng Chai, et al. “A Novel Bioinformatics Pipeline for Identification and Characterization of Fusion Transcripts in Breast Cancer and Normal Cell Lines.” Nucleic Acids Research 39, no. 15 (August 1, 2011): e100–e100.

* Kim, Daehwan, and Steven L Salzberg. '''TopHat-Fusion: An Algorithm for Discovery of Novel Fusion Transcripts.''' Genome Biology 12 (2011): R72.

* McPherson, Andrew, Fereydoun Hormozdiari, Abdalnasser Zayed, Ryan Giuliany, Gavin Ha, Mark G. F. Sun, Malachi Griffith, et al. '''deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data.''' PLoS Comput Biol 7, no. 5 (May 19, 2011): e1001138.

* Ge, Huanying, Kejun Liu, Todd Juan, Fang Fang, Matthew Newman, and Wolfgang Hoeck. “FusionMap: Detecting Fusion Genes from Next-generation Sequencing Data at Base-pair Resolution.” [http://bioinformatics.oxfordjournals.org/content/early/2011/05/18/bioinformatics.btr310.abstract Bioinformatics] (May 18, 2011).

* Li, Yang, Jeremy Chien, David I. Smith, and Jian Ma. '''FusionHunter: Identifying Fusion Transcripts in Cancer Using Paired-end RNA-seq.''' [http://bioinformatics.oxfordjournals.org/content/early/2011/05/05/bioinformatics.btr265.abstract Bioinformatics] (May 5, 2011).

* McPherson, Andrew, Chunxiao Wu, Iman Hajirasouliha, Fereydoun Hormozdiari, Faraz Hach, Anna Lapuk, Stanislav Volik, Sohrab Shah, Colin Collins, and S. Cenk Sahinalp. '''Comrad: a novel algorithmic framework for the integrated analysis of RNA-Seq and WGSS data.''' [http://bioinformatics.oxfordjournals.org/content/early/2011/04/09/bioinformatics.btr184.abstract. Bioinformatics] (April 9, 2011).

* Nacu, Serban, Wenlin Yuan, Zhengyan Kan, Deepali Bhatt, Celina Rivers, Jeremy Stinson, Brock Peters, et al. '''Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples.''' [http://www.biomedcentral.com/1755-8794/4/11 BMC Medical Genomics] 4, no. 1 (2011): 11.

* Pflueger D, Terry S, Sboner A, Habegger L, Esgueva R, Lin P, Svensson MA, Kitabayashi N, Moss BJ, MacDonald TY, Cao X, Barrette T, Tewari AK, Chee MS, Chinnaiyan AM, Rickman DS, Demichelis F, Gerstein MB, Rubin MA. '''Discovery of non-ETS gene fusions in human prostate cancer using next-generation RNA sequencing''' [http://www.genome.org/cgi/doi/10.1101/gr.110684.110 Genome Res] 2011; Vol. 21:56-67; Published in Advance October 29, 2010. doi:[http://www.genome.org/cgi/doi/10.1101/gr.110684.110 10.1101/gr.110684.110]

Installation and Configuration of FusionSeq

2012-02-09T18:01:49Z

Asboner: /* (versions 0.7.0 and later) */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

=====(versions 0.7.0 and later) =====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

=====(versions up to 0.6.1)=====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install [http://root.cern.ch/drupal/ ROOT] can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====(versions 0.7.0 and later)=====
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for [http://root.cern.ch/drupal/ ROOT], however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link [http://root.cern.ch/drupal/ ROOT] files for FusionSeq.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

Also, for some versions of ROOT, one may get the following error:
<pre>
[...]/root/include/Rtypes.h:35:67: error: snprintf.h: No such file or directory
[...]/root/include/Rtypes.h:36:68: error: strlcpy.h: No such file or directory
</pre>
This is because ROOT provides its own copy of the header files. One workaround is thus to create symbolic links
<pre>
$ cd ~/fusionseq/include
$ ln -s root/snprintf.h .
$ ln -s root/strlcpy.h .
</pre>
This should solve it.

=====(versions up to 0.6.1)=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

=====(versions 0.7.0 and later)=====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_CONFPATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"
# Used for gfrRibosomalFilter
MAX_FRACTION_HOMOLOGOUS=0.05
MAX_OVERLAP_ALLOWED=0.75

# Used for gfr2bpJunctions
MAX_NUMBER_OF_JUNCTION_PER_FILE=2000000

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2012-01-11T11:42:12Z

Asboner: /* (versions 0.7.0 and later) */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

=====(versions 0.7.0 and later) =====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

=====(versions up to 0.6.1)=====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install [http://root.cern.ch/drupal/ ROOT] can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====(versions 0.7.0 and later)=====
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for [http://root.cern.ch/drupal/ ROOT], however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link [http://root.cern.ch/drupal/ ROOT] files for FusionSeq.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

Also, for some versions of ROOT, one may get the following error:
<pre>
[...]/root/include/Rtypes.h:35:67: error: snprintf.h: No such file or directory
[...]/root/include/Rtypes.h:36:68: error: strlcpy.h: No such file or directory
</pre>
This is because ROOT provides its own copy of the header files. One workaround is thus to create symbolic links
<pre>
$ cd ~/fusionseq/include
$ ln -s root/snprintf.h .
$ ln -s root/strlcpy.h .
</pre>
This should solve it.

=====(versions up to 0.6.1)=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

=====(versions 0.7.0 and later)=====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"
# Used for gfrRibosomalFilter
MAX_FRACTION_HOMOLOGOUS=0.05
MAX_OVERLAP_ALLOWED=0.75

# Used for gfr2bpJunctions
MAX_NUMBER_OF_JUNCTION_PER_FILE=2000000

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

RSEQtools

2011-11-03T12:55:30Z

Asboner: /* singleExport2mrf */

<center>[http://archive.gersteinlab.org/proj/rnaseq/rseqtools '''RSEQtools Main Page''']</center>

__TOC__

== Introduction ==

The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify that person, raising significant privacy concerns. In order to address these issues we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confi-dential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools that uses this format for the analysis of RNA-Seq experiments. RSEQtools consists of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions. In addition to the anonymization afforded by this format it also facilitates the decoupling of the alignment of reads from downstream analyses.

 

== Citation ==

Lukas Habegger*, Andrea Sboner*, Tara A. Gianoulis, Joel Rozowsky, Ashish Agarwal, Michael Snyder, Mark Gerstein. '''RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries'''. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq643?ijkey=GSZOpzLAEOqJtJ4&keytype=ref Bioinformatics] 2010,
doi: 10.1093/bioinformatics/btq643

 

== Overview ==

The following sections provide documenation for the modules that are part of RSEQtools (http://rseqtools.gersteinlab.org/). This documentation is intended for the end-users and can also be found at http://info.gersteinlab.org/RSEQtools.
'''RSEQtools is implemented in C''' and uses a general C library called BIOS.

The full '''documentation for developers''' can be found here:
* RSEQtools: http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/
* BIOS: http://archive.gersteinlab.org/proj/rnaseq/doc/bios/

 

== Data formats ==

<center>[[#top|Top]]</center>
=== Mapped Read Format (MRF) ===

The Mapped Read Format (MRF) flat file consists of '''three''' components and this format is closely associated with the software components of [http://rseqtools.gersteinlab.org RSEQtools]

1. Comment lines. Comment lines are optional and start with a '#' character.
2. Header line. The header line is required and specifies the type of each column.
3. Mapped reads. Each read (single-end or paired-end) is represented by on line.

'''Required''' column:

* AlignmentBlocks, each alignment block must contain the following attributes: TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd

''Optional'' columns:

* Sequence
* QualityScores
* QueryId

Example file:

<pre>
# Comments
# Required field: Blocks [TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd]
# Optional fields: Sequence,QualityScores,QueryId
AlignmentBlocks
chr1:+:2001:2050:1:50
chr1:+:2001:2025:1:25,chr1:+:3001:3025:26:50
chr2:-:3001:3051:1:51|chr11:+:4001:4051:1:51
chr2:-:6021:6050:1:30,chr2:-:7031:7051:31:51|chr11:+:4001:4051:1:51
contigA:+:5001:5200:1:200,contigB:-:1200:1400:200:400
</pre>

Notes:

* Paired-end reads are separated by ‘|’
* Alignment blocks are separated by ‘,’
* Features of a block are separated by ‘:’
* Columns are tab-delimited
* Columns can be arranged in any order
* Coordinates are '''one-based''' and '''closed (inclusive)'''
 

'''Use MRF for confidential data'''

It is straightforward to use MRF to separate the confidential information, i.e. the sequences, from the alignment data. The MRF file can be split in 2 files: one file can include ''AlignmentBlocks'' and ''QueryID'', whereas a second file would can contain ''Sequence'' and ''QueryID''. From a practical viewpoint it is also easy to create these two files.

Assuming we have the columns AlignmentBlocks, Sequence, and QueryID as column 1, 2, and 3, respectively:
$ cut -f1,3 file.mrf > alignments.mrf
$ cut -f2,3 file.mrf > sequences.mrf

* alignments.mrf would contain the alignment data and the query ID; it can be freely shared since it does not include confidential information;
* sequences.mrf would contain the sequence data; which, potentially, could be used to identify an individual and thus may be subjected to more stringent rules.
 
<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of '''eight''' tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of [http://homes.gersteinlab.org/people/lh372/SOFT/bios/index.html BIOS]. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

Example file:

uc001aaw.1 chr1 + 357521 358460 1 357521 358460
uc001aax.1 chr1 + 410068 411702 3 410068,410854,411258 410159,411121,411702
uc001aay.1 chr1 - 552622 554252 3 552622,553203,554161 553066,553466,554252
uc001aaz.1 chr1 + 556324 557910 1 556324 557910
uc001aba.1 chr1 + 558011 558705 1 558011 558705

In this example the intervals represent a transcripts, while the sub-intervals denote exons.

Note: the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== BED ===

The BED format is used to represent contiguous genomic regions. It consists of '''three''' required columns (tab-delimited).

1. Chromosome
2. Start
3. End

Example file:

chr1 1000 5000
chr3 500 600
chrX 4000 4250

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC]

Note: the coordinates in the BED format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== BedGraph ===

The BedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. This track type is similar to the wiggle (WIG) format, but unlike the wiggle format, data exported in the bedGraph format are preserved in their original state. It consists of '''four''' required columns (tab-delimited).

1. Chromosome
2. Start
3. End
4. Value

Example file:

chr1 1000 5000 12.3
chr3 500 600 3
chrX 4000 4250 54

Full documentation can be found at [ftp://hgdownload.cse.ucsc.edu/apache/htdocs-rr/goldenPath/help/bedgraph.html UCSC]

Note: the coordinates in the BedGraph format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== WIG ===

The WIG format is used to represent dense and continuous genomic data. There are two options for formatting wiggle data: '''variableStep''' and '''fixedStep'''.

In the context of [http://rseqtools.gersteinlab.org/ RSEQtools], the variable step formatting is used and only positions with '''non-zero values''' are represented.

Example file:

track type=wiggle_0 name="test_chr22"
variableStep chrom=chr22 span=1
17535712 1.67
17535713 1.67
17535714 1.67
17535715 1.67
17535716 1.67

Full documentation can be found at [http://genome.ucsc.edu/goldenPath/help/wiggle.html UCSC]

Note: the coordinates in the WIG format are '''zero-based'''.

 

<center>[[#top|Top]]</center>

=== GFF ===

The GFF format is used to describe genes and other features. It consists of '''nine''' tab-delimited columns.

1. Name
2. Source
3. Feature
4. Start
5. End
6. Score
7. Strand
8. Frame
9. Group

Example file:

browser hide all
track name="chr11" visibility=2
chr11 MRF feature 46772115 46772161 . - . TG5
chr11 MRF feature 46772668 46772695 . - . TG5
chr11 MRF feature 118521207 118521252 . + . TG21
chr11 MRF feature 118526315 118526343 . + . TG21

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format3 UCSC]

Note: the coordinates in the BED format are '''one-based''' and the '''end coordinate is included'''.

 

<center>[[#top|Top]]</center>
=== PSL ===

The PSL format represents alignments from the BLAT alignment program.

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format2 UCSC]

 

<center>[[#top|Top]]</center>
=== SAM ===

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

Full documentation can be found at [http://samtools.sourceforge.net/ SAMtools]

 

 

== List of programs ==

This is the documentation for the end-users. The full documentation for developers can be found [http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/files.html here].

=== Format conversion utilities ===

The following programs convert the output from various alignment programs into [[#MRF|MRF]].

<center>[[#top|Top]]</center>
==== bowtie2mrf ====

bowtie2mrf converts read alignments from Bowtie into [[#MRF|MRF]].

'''Usage''':

bowtie2mrf <genomic|junctions|paired> [-sequence] [-qualityScores] [-IDs]

* Inputs: Takes [http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output Bowtie output] from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** genomic - convert single-end reads that were aligned against a genomic reference sequence using Bowtie
** junctions - convert single-end reads that were aligned against a splice junction library (generated by '''createSpliceJunctionLibrary''') using Bowtie
** paired - convert paired-end reads that were aligned using Bowtie
* ''Optional arguments''
** sequence - include the read sequence in the [[#MRF|MRF]] output
** qualityScores - include the quality scores of the read in the [[#MRF|MRF]] output
** IDs - include the read IDs in the [[#MRF|MRF]] output
 
'''Note''': bowtie2mrf assumes that bowtie was run using the default option for the -B parameter
-B/--offbase <int> leftmost ref offset = <int> in bowtie output (default: 0)
 
'''Note''': If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary. Otherwise, bowtie2mrf will not be able to convert the splice junction coordinates correctly.

 
<center>[[#top|Top]]</center>

==== psl2mrf ====

psl2mrf converts read alignments from BLAT into [[#MRF|MRF]].

'''Usage''':

psl2mrf

* Inputs: Takes BLAT alignments in [[#PSL|PSL]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== singleExport2mrf ====

singleExport2mrf converts single-end read alignments from ELAND (export file) into [[#MRF|MRF]].

'''Usage''':

singleExport2mrf

* Inputs: Takes ELAND single-end alignments in export format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None
 
'''Note''': If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary and that its file name included 'splice' or 'junction'. Otherwise, singleExport2mrf will not be able to convert the splice junction coordinates correctly.
 
The output includes sequences and quality scores. If one wants the alignment only:
singleExport2mrf < file.export.txt | cut -f1 > file.mrf
 

<center>[[#top|Top]]</center>

==== sam2mrf ====

sam2mrf converts [[#SAM|SAM]] format into [[#MRF|MRF]]

'''Usage''':

sam2mrf

* Inputs: Takes [[#SAM|SAM]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Please note that for paired-end data, sam2mrf requires the mate pairs to be on subsequent lines. You may want to sort the [[#SAM|SAM]] file first.

Example: <pre>sort -r file.sam | sam2mrf > file.mrf </pre>

 

=== Genome annotation tools ===

The following tools are helpful in manipulating annotation files.

<center>[[#top|Top]]</center>
==== createSpliceJunctionLibrary ====

This program is used to create a splice junction library from an annotation set. It creates all pair-wise splice junctions within a transcript.

'''Usage''':

createSpliceJunctionLibrary <file.2bit> <file.annotation> <sizeExonOverlap>

* Inputs: None
* Outputs: Reports the slice junctions in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** sizeExonOverlap - defines the number of nucleotides included from each exon
* ''Optional arguments''
** None

Example output:

>chr1|12162|12612|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGG
>chr1|12162|13220|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG
>chr1|12656|13220|65
CAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG

The identifier for each splice junction consists of '''four''' items:

1. Chromosome
2. Start position (with respect to the "+", zero-based) of the splice junction within the first exon
3. Start position (with respect to the "+", zero-based) of the splice junction within the second exon
4. Size of the exon overlap

'''Note''': Internally the program uses ''twoBitToFa'' (part of BLAT package). Thus, the executable must be in the PATH.

 

<center>[[#top|Top]]</center>

==== mergeTranscripts ====

Module to merge a set of transcripts from the same gene.

Obtain unique exons from various transcript isoforms based on:
# longest isoform
# composite model (union of the exons from the different transcript isoforms)
# intersection (intersection of the exons of the different transcript isoforms)

'''Usage''':

mergeTranscripts <knownIsoforms.txt> <file.annotation> <longestIsoform|compositeModel|intersection>

* Inputs: None
* Outputs: Reports a new annotation set of merged transcripts in [[#Interval|Interval]] format
* ''Required arguments''
** knownIsoforms.txt - file that determines which transcript isoforms belong together (see format below)
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < longestIsoform | compositeModel | intersection > - determines how transcript isoforms are selected/merged:
* ''Optional arguments''
** None

The file knownIsoforms.txt should have two columns (tab-delimited) and no header:

1. ID (int). Transcripts with the same id belong to the same gene.
2. Name of the transcript.

Example:

1 uc009vip.1
1 uc001aaa.2
2 uc009vis.1
2 uc001aae.2
2 uc009viu.1
2 uc009vit.1

 

<center>[[#top|Top]]</center>
==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== Gene expression analysis ===

<center>[[#top|Top]]</center>
==== mrfQuantifier ====

Module to calculate expression values (RPKM). Given a set of mapped reads in MRF and an annotation set (representing exons, transcripts, or gene models) mrfQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

'''Usage''':

mrfQuantifier <file.annotation> <singleOverlap|multipleOverlap>

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See [[#Interval|Interval]] for more details.
** < singleOverlap | multipleOverlap > - singleOverlap: reads that overlap with multiple annotated features are ignored; multipleOverlap: reads that overlap with multiple annotated features are counted multiple times.
* ''Optional arguments''
** None
 
'''Note''': All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

[[Image:mrfQuantifier.png|thumb|1000px|center|Determining overlaps between annotation entries and reads]]

 

<center>[[#top|Top]]</center>

==== bgrQuantifier ====

Module to calculate expression values (RPKM) from a signal track in bedGraph (bgr) format. Given a signal track and an annotation set (representing exons, transcripts, or gene models) bgrQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

'''Usage''':

bgrQuantifier <file.annotation>

* Inputs: Takes [[#BedGraph|BedGraph]] file from STDIN.
* Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See [[#Interval|Interval]] for more details.
* ''Optional arguments''
** None
 
'''Note''': All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

 

=== Visualization tools ===

The following programs are useful for converting [[#MRF|MRF]] into data formats that can be viewed in a genome browser.

<center>[[#top|Top]]</center>
==== mrf2wig ====

Generates signal track ([[#WIG|WIG]]) of mapped reads from a [[#MRF|MRF]] file. By default, the values in the WIG file are normalized by the total number of mapped reads per million.
Only positions with non-zero values are reported.

'''Usage''':

mrf2wig <prefix> [doNotNormalize]

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [[#WIG|WIG]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.wig
* ''Optional arguments''
** doNotNormailze - the counts are NOT normalized

 

<center>[[#top|Top]]</center>

==== mrf2gff ====

Generates a [[#GFF|GFF]] file of mapped splice junction reads from a [[#MRF|MRF]] file.

'''Usage'''
mrf2gff <prefix>

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [[#GFF|GFF]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.gff
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrf2bgr ====

Module to convert [[#MRF|MRF]] to [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph]. Generates a [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph], where the counts are normalized by the total number of mapped reads per million.

'''Usage''':

mrf2bgr <prefix> [doNotNormalize]

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.bgr
* ''Optional arguments''
** doNotNormailze - the counts are NOT normalized.

 

=== Segmentation of mapped reads ===

<center>[[#top|Top]]</center>
==== wigSegmenter ====

Module to segment a [[#WIG|WIG]] signal track using the [http://www.ncbi.nlm.nih.gov/pubmed/15979196 maxGap-minRun algorithm]. The output is a set of transcriptionally active regions (TARs) in [[#BED|BED]] format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

'''Usage''':

wigSegmenter <wigPrefix> <threshold> <maxGap> <minRun>

* Inputs: None
* Outputs: Generates a [[#BED|BED]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** wigPrefix - prefix used to generate the [[#WIG|WIG]] files using [[#mrf2wig|mrf2wig]]
** threshold - level at which the segmentation is performed
** maxGap - maximum number of consecutive positions that can have values less than the threshold
** minRun - minimal length of a TAR
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== bgrSegmenter ====

Module to segment a [[#BedGraph|BedGraph]] signal track using the [http://www.ncbi.nlm.nih.gov/pubmed/15979196 maxGap-minRun algorithm]. The output is a set of transcriptionally active regions (TARs) in [[#BED|BED]] format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

'''Usage''':

bgrSegmenter <bgrPrefix> <threshold> <maxGap> <minRun>

* Inputs: None
* Outputs: Generates a [[#BED|BED]] file for each chromosome occurring in the [[#BedGraph|BedGraph]] input file.
* ''Required arguments''
** bgrPrefix - prefix used to generate the [[#BedGraph|BedGraph]] files using [[#mrf2bgr|mrf2bgr]]
** threshold - level at which the segmentation is performed
** maxGap - maximum number of consecutive positions that can have values less than the threshold
** minRun - minimal length of a TAR
* ''Optional arguments''
** None

 

=== Annotation statistics tools ===

The following modules are useful for calculating annotation statistics given a set of mapped reads.

<center>[[#top|Top]]</center>
==== mrfAnnotationCoverage ====

Module to calculate annotation coverage. Sample a set of mapped reads and determine the fraction of transcripts (specified in file.annotation) that have at least <coverageFactor>-times uniform coverage.

'''Usage''':

mrfAnnotationCoverage <file.annotation> <numTotalReads> <numReadsToSample> <coverageFactor>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Reports the fraction of transcripts that have at least <coverageFactor>-times uniform coverage to STDOUT
* ''Required arguments''
** file.annotation -
** numTotalReads - total number of reads in the [[#MRF|MRF]] input file
** numReadsToSample - number of reads to sample from the [[#MRF|MRF]] input file
** coverageFactor - minimum level of uniform coverage required across a transcript
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrfMappingBias ====

Module to calculate mapping bias for a given annotation set. Aggregates mapped reads that overlap with transcripts (specified in file.annotation) and
outputs the counts over a standardized transcript (divided into 100 equally sized bins) where 0 represents the 5' end of the transcript and
1 denotes the 3' end of the transcripts. This analysis is done in a strand specific way.

'''Usage''':

mrfMappingBias <file.annotation>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs the number of mapped reads for each bin of the standardized transcript to STDOUT
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
* ''Optional arguments''
** None

 

=== MRF selection utilities ===

The following utilities are helpful to select subsets of an [[#MRF|MRF]] file. It should be noted that these utilities operate on ''existing'' MRF files.

<center>[[#top|Top]]</center>
==== mrfSampler ====

Randomly select a subset of [[#MRF|MRF]] entries.

'''Usage''':

mrfSampler <proportionOfReadsToSample>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** proportionOfReadsToSample - fraction of reads to sample (on average)
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfSelectRegion ====

Select reads that overlap with a specified genomic region.

'''Usage''':

mrfSelectRegion <targetName:targetStart-targetEnd>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** targetName:targetStart-targetEnd - region of interest
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfSelectSpliced ====

Select reads that span a splice junction.

'''Usage''':

mrfSelectSpliced

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None
 

'''Note''': The purpose of mrfSelectSpliced is to extract mapped reads that align to a splice junction from an '''existing MRF file'''. It is important to note that this utility is not used to convert the output of a specific mapping program.

 

<center>[[#top|Top]]</center>
==== mrfSubsetByTargetName ====

Split up an [[#MRF|MRF]] file by chromosome.

'''Usage''':

mrfSubsetByTargetName <prefix>

* Inputs: Takes [[#MRF\MRF]] from STDIN
* Outputs: Outputs a separate [[#MRF|MRF]] file for each chromosome using the following naming convention: <prefix>_chrXXX.mrf.
* ''Required arguments''
** prefix - prefix used for the output files.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrfSelectAnnotated ====

Module to select a subset of reads that overlap with a specified annotation set.

'''Usage''':

mrfSelectAnnotated <file.annotation> <include|exclude>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (one transcript per line).
** < include | exclude > - include: report reads that overlap with ''exonic'' regions of the annotation set; exclude: report reads that do not overlap with ''exonic'' regions of the annotation set
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfRegionCount ====

Module to count the total number of reads in a specified region.

'''Usage''':

mrfRegionCount <targetName:targetStart-targetEnd>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs the total number of reads in a specified region to STDOUT.
* ''Required arguments''
** targetName:targetStart-targetEnd, specifies the region of interest
* ''Optional arguments''
** None

 

=== Auxiliary utilities ===

This section includes various data format conversion utilities.

<center>[[#top|Top]]</center>
==== bed2interval ====

Utility to convert [[#BED|BED]] format into [[#Interval|Interval]] format.

'''Usage''':

bed2interval

* Inputs: Takes data in [[#BED|BED]] format from STDIN
* Outputs: Outputs data in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== interval2bed ====

Utility to convert [[#Interval|Interval]] format into [[#BED|BED]] format.

'''Usage''':

interval2bed

* Inputs: Takes data in [[#Interval|Interval]] format from STDIN
* Outputs: Outputs data in [[#BED|BED]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== interval2gff ====

Utility to convert [[#Interval|Interval]] format into [[#GFF|GFF]] format.

'''Usage''':

interval2gff <trackName>

* Inputs: Takes data in [[#Interval|Interval]] format from STDIN
* Outputs: Outputs data in [[#GFF|GFF]] format to STDOUT
* ''Required arguments''
** trackName - track name used in the [[#GFF|GFF]] file
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== gff2interval ====

Utility to convert [[#GFF|GFF]] format into [[#Interval|Interval]] format.

'''Usage''':

gff2interval

* Inputs: Takes data in [[#GFF|GFF]] format from STDIN
* Outputs: Outputs data in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== export2fastq ====

Module to generate FASTQ sequences from an ELAND export file.

'''Usage''':

export2fastq

* Inputs: Takes an ELAND export file from STDIN
* Outputs: Reports the extracted sequences in FASTQ format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrf2sam ====

Module to convert [[#MRF|MRF]] to [[#SAM|SAM]].

'''Usage''':

mrf2sam

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#SAM|SAM]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

RSEQtools

2011-11-03T11:06:06Z

Asboner: /* singleExport2mrf */

<center>[http://archive.gersteinlab.org/proj/rnaseq/rseqtools '''RSEQtools Main Page''']</center>

__TOC__

== Introduction ==

The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify that person, raising significant privacy concerns. In order to address these issues we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confi-dential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools that uses this format for the analysis of RNA-Seq experiments. RSEQtools consists of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions. In addition to the anonymization afforded by this format it also facilitates the decoupling of the alignment of reads from downstream analyses.

 

== Citation ==

Lukas Habegger*, Andrea Sboner*, Tara A. Gianoulis, Joel Rozowsky, Ashish Agarwal, Michael Snyder, Mark Gerstein. '''RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries'''. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq643?ijkey=GSZOpzLAEOqJtJ4&keytype=ref Bioinformatics] 2010,
doi: 10.1093/bioinformatics/btq643

 

== Overview ==

The following sections provide documenation for the modules that are part of RSEQtools (http://rseqtools.gersteinlab.org/). This documentation is intended for the end-users and can also be found at http://info.gersteinlab.org/RSEQtools.
'''RSEQtools is implemented in C''' and uses a general C library called BIOS.

The full '''documentation for developers''' can be found here:
* RSEQtools: http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/
* BIOS: http://archive.gersteinlab.org/proj/rnaseq/doc/bios/

 

== Data formats ==

<center>[[#top|Top]]</center>
=== Mapped Read Format (MRF) ===

The Mapped Read Format (MRF) flat file consists of '''three''' components and this format is closely associated with the software components of [http://rseqtools.gersteinlab.org RSEQtools]

1. Comment lines. Comment lines are optional and start with a '#' character.
2. Header line. The header line is required and specifies the type of each column.
3. Mapped reads. Each read (single-end or paired-end) is represented by on line.

'''Required''' column:

* AlignmentBlocks, each alignment block must contain the following attributes: TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd

''Optional'' columns:

* Sequence
* QualityScores
* QueryId

Example file:

<pre>
# Comments
# Required field: Blocks [TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd]
# Optional fields: Sequence,QualityScores,QueryId
AlignmentBlocks
chr1:+:2001:2050:1:50
chr1:+:2001:2025:1:25,chr1:+:3001:3025:26:50
chr2:-:3001:3051:1:51|chr11:+:4001:4051:1:51
chr2:-:6021:6050:1:30,chr2:-:7031:7051:31:51|chr11:+:4001:4051:1:51
contigA:+:5001:5200:1:200,contigB:-:1200:1400:200:400
</pre>

Notes:

* Paired-end reads are separated by ‘|’
* Alignment blocks are separated by ‘,’
* Features of a block are separated by ‘:’
* Columns are tab-delimited
* Columns can be arranged in any order
* Coordinates are '''one-based''' and '''closed (inclusive)'''
 

'''Use MRF for confidential data'''

It is straightforward to use MRF to separate the confidential information, i.e. the sequences, from the alignment data. The MRF file can be split in 2 files: one file can include ''AlignmentBlocks'' and ''QueryID'', whereas a second file would can contain ''Sequence'' and ''QueryID''. From a practical viewpoint it is also easy to create these two files.

Assuming we have the columns AlignmentBlocks, Sequence, and QueryID as column 1, 2, and 3, respectively:
$ cut -f1,3 file.mrf > alignments.mrf
$ cut -f2,3 file.mrf > sequences.mrf

* alignments.mrf would contain the alignment data and the query ID; it can be freely shared since it does not include confidential information;
* sequences.mrf would contain the sequence data; which, potentially, could be used to identify an individual and thus may be subjected to more stringent rules.
 
<center>[[#top|Top]]</center>

=== Interval ===

The Interval format consists of '''eight''' tab-delimited columns and is used to represent genomic intervals such as genes.
This format is closely associated with the [http://homes.gersteinlab.org/people/lh372/SOFT/bios/intervalFind_8c.html intervalFind module], which is part of [http://homes.gersteinlab.org/people/lh372/SOFT/bios/index.html BIOS]. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" ''Bioinformatics'' 2007;23:1386-1393 [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386].

1. Name of the interval
2. Chromosome
3. Strand
4. Interval start (with respect to the "+")
5. Interval end (with respect to the "+")
6. Number of sub-intervals
7. Sub-interval starts (with respect to the "+", comma-delimited)
8. Sub-interval end (with respect to the "+", comma-delimited)

Example file:

uc001aaw.1 chr1 + 357521 358460 1 357521 358460
uc001aax.1 chr1 + 410068 411702 3 410068,410854,411258 410159,411121,411702
uc001aay.1 chr1 - 552622 554252 3 552622,553203,554161 553066,553466,554252
uc001aaz.1 chr1 + 556324 557910 1 556324 557910
uc001aba.1 chr1 + 558011 558705 1 558011 558705

In this example the intervals represent a transcripts, while the sub-intervals denote exons.

Note: the coordinates in the Interval format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== BED ===

The BED format is used to represent contiguous genomic regions. It consists of '''three''' required columns (tab-delimited).

1. Chromosome
2. Start
3. End

Example file:

chr1 1000 5000
chr3 500 600
chrX 4000 4250

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC]

Note: the coordinates in the BED format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== BedGraph ===

The BedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. This track type is similar to the wiggle (WIG) format, but unlike the wiggle format, data exported in the bedGraph format are preserved in their original state. It consists of '''four''' required columns (tab-delimited).

1. Chromosome
2. Start
3. End
4. Value

Example file:

chr1 1000 5000 12.3
chr3 500 600 3
chrX 4000 4250 54

Full documentation can be found at [ftp://hgdownload.cse.ucsc.edu/apache/htdocs-rr/goldenPath/help/bedgraph.html UCSC]

Note: the coordinates in the BedGraph format are '''zero-based''' and the '''end coordinate is not included'''.

 

<center>[[#top|Top]]</center>

=== WIG ===

The WIG format is used to represent dense and continuous genomic data. There are two options for formatting wiggle data: '''variableStep''' and '''fixedStep'''.

In the context of [http://rseqtools.gersteinlab.org/ RSEQtools], the variable step formatting is used and only positions with '''non-zero values''' are represented.

Example file:

track type=wiggle_0 name="test_chr22"
variableStep chrom=chr22 span=1
17535712 1.67
17535713 1.67
17535714 1.67
17535715 1.67
17535716 1.67

Full documentation can be found at [http://genome.ucsc.edu/goldenPath/help/wiggle.html UCSC]

Note: the coordinates in the WIG format are '''zero-based'''.

 

<center>[[#top|Top]]</center>

=== GFF ===

The GFF format is used to describe genes and other features. It consists of '''nine''' tab-delimited columns.

1. Name
2. Source
3. Feature
4. Start
5. End
6. Score
7. Strand
8. Frame
9. Group

Example file:

browser hide all
track name="chr11" visibility=2
chr11 MRF feature 46772115 46772161 . - . TG5
chr11 MRF feature 46772668 46772695 . - . TG5
chr11 MRF feature 118521207 118521252 . + . TG21
chr11 MRF feature 118526315 118526343 . + . TG21

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format3 UCSC]

Note: the coordinates in the BED format are '''one-based''' and the '''end coordinate is included'''.

 

<center>[[#top|Top]]</center>
=== PSL ===

The PSL format represents alignments from the BLAT alignment program.

Full documentation can be found at [http://genome.ucsc.edu/FAQ/FAQformat.html#format2 UCSC]

 

<center>[[#top|Top]]</center>
=== SAM ===

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

Full documentation can be found at [http://samtools.sourceforge.net/ SAMtools]

 

 

== List of programs ==

This is the documentation for the end-users. The full documentation for developers can be found [http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/files.html here].

=== Format conversion utilities ===

The following programs convert the output from various alignment programs into [[#MRF|MRF]].

<center>[[#top|Top]]</center>
==== bowtie2mrf ====

bowtie2mrf converts read alignments from Bowtie into [[#MRF|MRF]].

'''Usage''':

bowtie2mrf <genomic|junctions|paired> [-sequence] [-qualityScores] [-IDs]

* Inputs: Takes [http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output Bowtie output] from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** genomic - convert single-end reads that were aligned against a genomic reference sequence using Bowtie
** junctions - convert single-end reads that were aligned against a splice junction library (generated by '''createSpliceJunctionLibrary''') using Bowtie
** paired - convert paired-end reads that were aligned using Bowtie
* ''Optional arguments''
** sequence - include the read sequence in the [[#MRF|MRF]] output
** qualityScores - include the quality scores of the read in the [[#MRF|MRF]] output
** IDs - include the read IDs in the [[#MRF|MRF]] output
 
'''Note''': bowtie2mrf assumes that bowtie was run using the default option for the -B parameter
-B/--offbase <int> leftmost ref offset = <int> in bowtie output (default: 0)
 
'''Note''': If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary. Otherwise, bowtie2mrf will not be able to convert the splice junction coordinates correctly.

 
<center>[[#top|Top]]</center>

==== psl2mrf ====

psl2mrf converts read alignments from BLAT into [[#MRF|MRF]].

'''Usage''':

psl2mrf

* Inputs: Takes BLAT alignments in [[#PSL|PSL]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== singleExport2mrf ====

singleExport2mrf converts single-end read alignments from ELAND (export file) into [[#MRF|MRF]].

'''Usage''':

singleExport2mrf

* Inputs: Takes ELAND single-end alignments in export format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None
 
'''Note''': If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary and that its file name included 'splice' or 'junction'. Otherwise, singleExport2mrf will not be able to convert the splice junction coordinates correctly.

 

<center>[[#top|Top]]</center>

==== sam2mrf ====

sam2mrf converts [[#SAM|SAM]] format into [[#MRF|MRF]]

'''Usage''':

sam2mrf

* Inputs: Takes [[#SAM|SAM]] format from STDIN
* Outputs: Reports [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

Please note that for paired-end data, sam2mrf requires the mate pairs to be on subsequent lines. You may want to sort the [[#SAM|SAM]] file first.

Example: <pre>sort -r file.sam | sam2mrf > file.mrf </pre>

 

=== Genome annotation tools ===

The following tools are helpful in manipulating annotation files.

<center>[[#top|Top]]</center>
==== createSpliceJunctionLibrary ====

This program is used to create a splice junction library from an annotation set. It creates all pair-wise splice junctions within a transcript.

'''Usage''':

createSpliceJunctionLibrary <file.2bit> <file.annotation> <sizeExonOverlap>

* Inputs: None
* Outputs: Reports the slice junctions in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** sizeExonOverlap - defines the number of nucleotides included from each exon
* ''Optional arguments''
** None

Example output:

>chr1|12162|12612|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGG
>chr1|12162|13220|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG
>chr1|12656|13220|65
CAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG

The identifier for each splice junction consists of '''four''' items:

1. Chromosome
2. Start position (with respect to the "+", zero-based) of the splice junction within the first exon
3. Start position (with respect to the "+", zero-based) of the splice junction within the second exon
4. Size of the exon overlap

'''Note''': Internally the program uses ''twoBitToFa'' (part of BLAT package). Thus, the executable must be in the PATH.

 

<center>[[#top|Top]]</center>

==== mergeTranscripts ====

Module to merge a set of transcripts from the same gene.

Obtain unique exons from various transcript isoforms based on:
# longest isoform
# composite model (union of the exons from the different transcript isoforms)
# intersection (intersection of the exons of the different transcript isoforms)

'''Usage''':

mergeTranscripts <knownIsoforms.txt> <file.annotation> <longestIsoform|compositeModel|intersection>

* Inputs: None
* Outputs: Reports a new annotation set of merged transcripts in [[#Interval|Interval]] format
* ''Required arguments''
** knownIsoforms.txt - file that determines which transcript isoforms belong together (see format below)
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < longestIsoform | compositeModel | intersection > - determines how transcript isoforms are selected/merged:
* ''Optional arguments''
** None

The file knownIsoforms.txt should have two columns (tab-delimited) and no header:

1. ID (int). Transcripts with the same id belong to the same gene.
2. Name of the transcript.

Example:

1 uc009vip.1
1 uc001aaa.2
2 uc009vis.1
2 uc001aae.2
2 uc009viu.1
2 uc009vit.1

 

<center>[[#top|Top]]</center>
==== interval2sequences ====

Module to retrieve genomic/exonic sequences for an annotation set.

'''Usage''':

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

* Inputs: None
* Outputs: Reports the extracted sequences in FASTA format
* ''Required arguments''
** file.2bit - genome reference sequence in [http://genome.ucsc.edu/FAQ/FAQformat.html#format7 2bit format]
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
** < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
* ''Optional arguments''
** None

 

=== Gene expression analysis ===

<center>[[#top|Top]]</center>
==== mrfQuantifier ====

Module to calculate expression values (RPKM). Given a set of mapped reads in MRF and an annotation set (representing exons, transcripts, or gene models) mrfQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

'''Usage''':

mrfQuantifier <file.annotation> <singleOverlap|multipleOverlap>

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See [[#Interval|Interval]] for more details.
** < singleOverlap | multipleOverlap > - singleOverlap: reads that overlap with multiple annotated features are ignored; multipleOverlap: reads that overlap with multiple annotated features are counted multiple times.
* ''Optional arguments''
** None
 
'''Note''': All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

[[Image:mrfQuantifier.png|thumb|1000px|center|Determining overlaps between annotation entries and reads]]

 

<center>[[#top|Top]]</center>

==== bgrQuantifier ====

Module to calculate expression values (RPKM) from a signal track in bedGraph (bgr) format. Given a signal track and an annotation set (representing exons, transcripts, or gene models) bgrQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

'''Usage''':

bgrQuantifier <file.annotation>

* Inputs: Takes [[#BedGraph|BedGraph]] file from STDIN.
* Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See [[#Interval|Interval]] for more details.
* ''Optional arguments''
** None
 
'''Note''': All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

 

=== Visualization tools ===

The following programs are useful for converting [[#MRF|MRF]] into data formats that can be viewed in a genome browser.

<center>[[#top|Top]]</center>
==== mrf2wig ====

Generates signal track ([[#WIG|WIG]]) of mapped reads from a [[#MRF|MRF]] file. By default, the values in the WIG file are normalized by the total number of mapped reads per million.
Only positions with non-zero values are reported.

'''Usage''':

mrf2wig <prefix> [doNotNormalize]

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [[#WIG|WIG]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.wig
* ''Optional arguments''
** doNotNormailze - the counts are NOT normalized

 

<center>[[#top|Top]]</center>

==== mrf2gff ====

Generates a [[#GFF|GFF]] file of mapped splice junction reads from a [[#MRF|MRF]] file.

'''Usage'''
mrf2gff <prefix>

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [[#GFF|GFF]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.gff
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrf2bgr ====

Module to convert [[#MRF|MRF]] to [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph]. Generates a [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph], where the counts are normalized by the total number of mapped reads per million.

'''Usage''':

mrf2bgr <prefix> [doNotNormalize]

* Inputs: Takes [[#MRF|MRF]] from STDIN.
* Outputs: Generates a [http://genome.ucsc.edu/goldenPath/help/bedgraph.html BedGraph] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.bgr
* ''Optional arguments''
** doNotNormailze - the counts are NOT normalized.

 

=== Segmentation of mapped reads ===

<center>[[#top|Top]]</center>
==== wigSegmenter ====

Module to segment a [[#WIG|WIG]] signal track using the [http://www.ncbi.nlm.nih.gov/pubmed/15979196 maxGap-minRun algorithm]. The output is a set of transcriptionally active regions (TARs) in [[#BED|BED]] format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

'''Usage''':

wigSegmenter <wigPrefix> <threshold> <maxGap> <minRun>

* Inputs: None
* Outputs: Generates a [[#BED|BED]] file for each chromosome occurring in the [[#MRF|MRF]] input file.
* ''Required arguments''
** wigPrefix - prefix used to generate the [[#WIG|WIG]] files using [[#mrf2wig|mrf2wig]]
** threshold - level at which the segmentation is performed
** maxGap - maximum number of consecutive positions that can have values less than the threshold
** minRun - minimal length of a TAR
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== bgrSegmenter ====

Module to segment a [[#BedGraph|BedGraph]] signal track using the [http://www.ncbi.nlm.nih.gov/pubmed/15979196 maxGap-minRun algorithm]. The output is a set of transcriptionally active regions (TARs) in [[#BED|BED]] format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

'''Usage''':

bgrSegmenter <bgrPrefix> <threshold> <maxGap> <minRun>

* Inputs: None
* Outputs: Generates a [[#BED|BED]] file for each chromosome occurring in the [[#BedGraph|BedGraph]] input file.
* ''Required arguments''
** bgrPrefix - prefix used to generate the [[#BedGraph|BedGraph]] files using [[#mrf2bgr|mrf2bgr]]
** threshold - level at which the segmentation is performed
** maxGap - maximum number of consecutive positions that can have values less than the threshold
** minRun - minimal length of a TAR
* ''Optional arguments''
** None

 

=== Annotation statistics tools ===

The following modules are useful for calculating annotation statistics given a set of mapped reads.

<center>[[#top|Top]]</center>
==== mrfAnnotationCoverage ====

Module to calculate annotation coverage. Sample a set of mapped reads and determine the fraction of transcripts (specified in file.annotation) that have at least <coverageFactor>-times uniform coverage.

'''Usage''':

mrfAnnotationCoverage <file.annotation> <numTotalReads> <numReadsToSample> <coverageFactor>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Reports the fraction of transcripts that have at least <coverageFactor>-times uniform coverage to STDOUT
* ''Required arguments''
** file.annotation -
** numTotalReads - total number of reads in the [[#MRF|MRF]] input file
** numReadsToSample - number of reads to sample from the [[#MRF|MRF]] input file
** coverageFactor - minimum level of uniform coverage required across a transcript
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrfMappingBias ====

Module to calculate mapping bias for a given annotation set. Aggregates mapped reads that overlap with transcripts (specified in file.annotation) and
outputs the counts over a standardized transcript (divided into 100 equally sized bins) where 0 represents the 5' end of the transcript and
1 denotes the 3' end of the transcripts. This analysis is done in a strand specific way.

'''Usage''':

mrfMappingBias <file.annotation>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs the number of mapped reads for each bin of the standardized transcript to STDOUT
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (each line represents one transcript)
* ''Optional arguments''
** None

 

=== MRF selection utilities ===

The following utilities are helpful to select subsets of an [[#MRF|MRF]] file. It should be noted that these utilities operate on ''existing'' MRF files.

<center>[[#top|Top]]</center>
==== mrfSampler ====

Randomly select a subset of [[#MRF|MRF]] entries.

'''Usage''':

mrfSampler <proportionOfReadsToSample>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** proportionOfReadsToSample - fraction of reads to sample (on average)
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfSelectRegion ====

Select reads that overlap with a specified genomic region.

'''Usage''':

mrfSelectRegion <targetName:targetStart-targetEnd>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** targetName:targetStart-targetEnd - region of interest
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfSelectSpliced ====

Select reads that span a splice junction.

'''Usage''':

mrfSelectSpliced

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None
 

'''Note''': The purpose of mrfSelectSpliced is to extract mapped reads that align to a splice junction from an '''existing MRF file'''. It is important to note that this utility is not used to convert the output of a specific mapping program.

 

<center>[[#top|Top]]</center>
==== mrfSubsetByTargetName ====

Split up an [[#MRF|MRF]] file by chromosome.

'''Usage''':

mrfSubsetByTargetName <prefix>

* Inputs: Takes [[#MRF\MRF]] from STDIN
* Outputs: Outputs a separate [[#MRF|MRF]] file for each chromosome using the following naming convention: <prefix>_chrXXX.mrf.
* ''Required arguments''
** prefix - prefix used for the output files.
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== mrfSelectAnnotated ====

Module to select a subset of reads that overlap with a specified annotation set.

'''Usage''':

mrfSelectAnnotated <file.annotation> <include|exclude>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#MRF|MRF]] to STDOUT
* ''Required arguments''
** file.annotation - annotation set in [[#Interval|Interval]] format (one transcript per line).
** < include | exclude > - include: report reads that overlap with ''exonic'' regions of the annotation set; exclude: report reads that do not overlap with ''exonic'' regions of the annotation set
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrfRegionCount ====

Module to count the total number of reads in a specified region.

'''Usage''':

mrfRegionCount <targetName:targetStart-targetEnd>

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs the total number of reads in a specified region to STDOUT.
* ''Required arguments''
** targetName:targetStart-targetEnd, specifies the region of interest
* ''Optional arguments''
** None

 

=== Auxiliary utilities ===

This section includes various data format conversion utilities.

<center>[[#top|Top]]</center>
==== bed2interval ====

Utility to convert [[#BED|BED]] format into [[#Interval|Interval]] format.

'''Usage''':

bed2interval

* Inputs: Takes data in [[#BED|BED]] format from STDIN
* Outputs: Outputs data in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== interval2bed ====

Utility to convert [[#Interval|Interval]] format into [[#BED|BED]] format.

'''Usage''':

interval2bed

* Inputs: Takes data in [[#Interval|Interval]] format from STDIN
* Outputs: Outputs data in [[#BED|BED]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== interval2gff ====

Utility to convert [[#Interval|Interval]] format into [[#GFF|GFF]] format.

'''Usage''':

interval2gff <trackName>

* Inputs: Takes data in [[#Interval|Interval]] format from STDIN
* Outputs: Outputs data in [[#GFF|GFF]] format to STDOUT
* ''Required arguments''
** trackName - track name used in the [[#GFF|GFF]] file
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>

==== gff2interval ====

Utility to convert [[#GFF|GFF]] format into [[#Interval|Interval]] format.

'''Usage''':

gff2interval

* Inputs: Takes data in [[#GFF|GFF]] format from STDIN
* Outputs: Outputs data in [[#Interval|Interval]] format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== export2fastq ====

Module to generate FASTQ sequences from an ELAND export file.

'''Usage''':

export2fastq

* Inputs: Takes an ELAND export file from STDIN
* Outputs: Reports the extracted sequences in FASTQ format to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

 

<center>[[#top|Top]]</center>
==== mrf2sam ====

Module to convert [[#MRF|MRF]] to [[#SAM|SAM]].

'''Usage''':

mrf2sam

* Inputs: Takes [[#MRF|MRF]] from STDIN
* Outputs: Outputs [[#SAM|SAM]] to STDOUT
* ''Required arguments''
** None
* ''Optional arguments''
** None

FusionSeq Test Datasets

2011-08-07T09:38:40Z

Asboner:

{{FusionSeqHeader}}
Two datasets are available to test FusionSeq: NCIH660 and GM12878 cell-line data. These datasets are part of FusionSeq dataset, published in [http://genomebiology.com/2010/11/10/R104/abstract Genome Biology, 2010;11:R104]. Please note that the full set, including cancer tissue samples, is available at [http://www.ncbi.nlm.nih.gov/gap?term=phs000311.v1.p1 dbGaP (accession phs000311.v1.p1)], where confidentiality issues are taken care of properly. We here provide the cell-line data in different formats:

==[[RSEQtools#Mapped_Read_Format (MRF)|Mapped Read Format (MRF)]]==
This is the format required by FusionSeq. [http://rseqtools.gersteinlab.org/ RSEQtools] provide several conversion tools to generate MRF files from the most popular alignment tools.
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.mrf.gz GM12878.mrf.gz]
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.mrf.gz NCIH660.mrf.gz]
Please read '[[How to execute FusionSeq]]' section for more detail on how to use these files.

==Auxillliary data==
In order to properly score the fusion candidate, gfrConfidenceValues requires an external '''meta''' file.
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.meta GM12878.meta]
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.meta NCIH660.meta]

'''NB:''' please make sure that the *meta files are tab delimited.

The junction sequence identifier module requires to align all reads against the junction library. All the reads, including those that did not map, should be used to find as much support for the breakpoint junction.
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878_allReads.txt.gz GM12878_allReads.txt.gz]
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660_allReads.txt.gz NCIH660_allReads.txt.gz]
Please read '[[How to execute FusionSeq]]' section for more detail on how to use these files.

[[FusionSeq_List of programs#gfrBlacklistFilter|gfrBlackListFilter]] allows you to specify a list of candidates to be excluded. The tab-delimited file includes only the gene symbols of the two genes, such as:
LOC388160 LOC388161
LOC388161 LOC388161

An example of a blacklist file can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/datasets/blackList.txt here].

==FASTQ==
[http://en.wikipedia.org/wiki/FASTQ_format FASTQ] is a text-based format for storing both a biological sequence and its corresponding quality scores. Each tarball includes two FASTQ files, one for each end.
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.fastq.tar.gz GM12878.fastq.tar.gz]
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.fastq.tar.gz NCIH660.fastq.tar.gz]

==BAM==
[http://samtools.sourceforge.net/ BAM] format is the binary compressed format of [http://samtools.sourceforge.net/ SAM (Sequence Alignment/Map)]. We provide both BAM files and their corresponding index files (*.bai) so that they can be viewed with the [http://www.broadinstitute.org/igv/ Integrative Genome Viewer (IGV)] a high-performance visualization tool for interactive exploration of large, integrated datasets from the Broad Institute.
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam]
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam.bai http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam.bai]
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam]
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam.bai http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam.bai]
You can download the files locally or load them into IGV directly. See instructions at http://www.broadinstitute.org/igv/.

FusionSeq FAQ

2011-07-14T19:22:42Z

Asboner: /* gfrConfidenceValues: Cannot find the .meta file */

{{FusionSeqHeader}}
=General Questions=
==Does FusionSeq work with ''my favorite'' alignment tool?==
The format of the paired-end reads that is "understood" by FusionSeq is [[RSEQtools#Mapped_Read_Format|Mapped Read Format (MRF)]]. We provide several conversion tools from most common alignment programs and formats, including SAM/BAM, to represent mapped reads using MRF. Please take a look at [http://rseqtools.gersteinlab.org/ RSEQtools] for more information, and specifically to: [[RSEQtools#Format_conversion_utilities|Format conversion utilities]].

==Does FusionSeq work with colorspace paired-end reads?==
FusionSeq has been developed to be as much independent as possible from the sequencing technology and the alignment tool. However, extensive testing was conducted on Illumina Genome Analyzer II platform only.

==Where can I obtain the annotation data for hg19?==
Annotation data for hg19 can be found [[FusionSeq_Requirements#Human_genome_GRCh37.2Fhg19|here]].

==Can I use FusionSeq with ''my favorite species''?==
In principle, you can run FusionSeq using any paired-end RNA-Seq data. However, you would need to provide the corresponding data that is currently used for human, i.e.:
# a genome sequence, in 2bit format
# a gene annotation set in interval format; including ''composite'' models of genes
# the sequences of the composite models in the gene annotation set
# a mapping between your gene annotation and TreeFam (optional, used by gfrLargeScaleHomologyFilter)
# a list of the repetitive regions, in interval format (optional, used by gfrRepeatMaskerFilter)
# a ribosomal sequence library in 2bit format (optional, used by gfrRibosomalFilter)
# the mapping between your gene annotation and other descriptive information, e.g. gene symbols, descriptions, etc. (optional, used by gfrAddInfo)

==Where can I find some data sets to test FusionSeq?==
Please find some test data sets [[FusionSeq_Test_Datasets|here]].

==Is there a demo version of FusionSeq?==
A demo version of the web-interface of FusionSeq is available [http://dynamic.gersteinlab.org/people/asboner/FusionSeq/geneFusions_cgi here]. You can access the results described in the paper, by typing the sample ID (e.g. 106_T, 1700_D, etc.).

==Where can I find more information?==
The most up-to-date user documentation for FusionSeq is available [[FusionSeq|here]]. If you look for the developer's documentation, you can find it [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/documentation/index.html here].

==The BOWTIE_INDEXES directory is used for reference indexes as well as for temporary index files. However, I have a centralized repository of indexed genomes and cannot create the temporary files in that directory==
A workaround of this issue would be to create a local directory where one has write permission. This would solve the problem of generating temporary index files when running the junction-sequence identifier module. To also have the indexed genome and transcriptome in the same folder, one could link them symbolically, for example:
$ cd /path/to/local/folder/
$ ln -s /path/to/centralized/repository/hg18_nh/ .
$ ln -s /path/to/centralized/repository/hg18_knownGeneAnnotationTranscriptCompositeModel/ .

Now, BOWTIE_INDEXES in geneFusionConfig.h should point to /path/to/local/folder/ where the user can generate temporary files:
#define BOWTIE_INDEXES /path/to/local/folder

==How can I cite FusionSeq?==
Please cite this publication:
* Sboner A, Habegger L, Pflueger D, Terry S, Chen DZ, Rozowsky JS, Tewari AK, Kitabayashi N, Moss BJ, Chee MS, Demichelis F, Rubin MA, Gerstein MB. '''FusionSeq: a modular framework for finding gene fusions by analyzing Paired-End RNA-Sequencing data'''. ''Genome Biol'' 21 Oct. 2010; '''11''':R104 [http://dx.doi.org/10.1186/gb-2010-11-10-r104|doi:10.1186/gb-2010-11-10-r104]

=Compilation troubleshooting=

==Where can I find the BIOS library, required for FusionSeq?==
As described in [[Requirements]], the [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

==TROOT.h: No such file or directory==
This error occurs because the compiler does not find TROOT.h file. This file is part of [http://root.cern.ch/drupal/ ROOT], a framework for mathematical and statistical analysis. If you have installed [http://root.cern.ch/drupal/ ROOT], please make sure that you have defined ROOTSYS as the path to the ROOT folder and added it to your PATH:
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
Please also see [[Installation_and_Configuration_of_FusionSeq#Installing_and_configuring_ROOT|Installing and configuring ROOT]] for more details.

=Running issues=
==FusionSeq does not find the annotation datasets. However, geneFusionConfig.h specifies their correct location and the files are present.==
This error:
ls_createFromFile '$HOME/path/to/data/annotation_data.txt'
occurs because environmental variable, such as $HOME, are not interpreted. Please use full path names in geneFusionConfig.h to specify directory locations.

==I followed the instructions, but I still get many WARNINGs. Is this expected?==
Yes, every program in FusionSeq provides some logging information. We recommend to capture the log data by redirecting STDERR (e.g. '2> fusionseq.log').

==geneFusions: Segmentation Fault==
There a number of reasons why one gets this error. One possibility is the lack of the sequences in the MRF file. Although MRF does not require the inclusion of sequences to be valid, sequences are indeed required by geneFusions. Please ensure that sequences are present in the MRF file.

==gfrConfidenceValues: Cannot find the .meta file==
The .meta file is required to run [[FusionSeq_List_of_programs#gfrConfidenceValues|gfrConfidenceValues]]. This is a tab-delimited file including the number of mapped reads. A simple way to generate this file is to run:
$ MAPPED=$(grep -v "AlignmentBlock" file.mrf | grep -v "#" | wc -l); printf "Mapped_reads\t%d\n" $MAPPED > file.meta

The final files should look like:
Mapped_reads 123456789

==Paired-end reads and bowtie: I aligned each end separately. How do I convert the alignment file to MRF?==
To convert bowtie alignment into MRF when ends are aligned separately, we require the two ends to be on subsequent lines. This could be partially achieved by concatenating and sorting the two alignment files, e.g. cat end_1.bowtie end_2.bowtie | sort > alignment.bowtie. However, in some cases, only one end is mapped, thus creating "singletons" in the alignment file, where only one end is reported. Since there have been many requests regarding this issue, we decided to share an "internal" utility: bowtiePairedFix. Here you can download the binary file:

* [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/tarballs/bowtiePairedFix.linux64 bowtiePairedFix (GNU/Linux x86_64)]
* [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/tarballs/bowtiePairedFix.MacOs.10.6.7 bowtiePairedFix (MacOs 10.6.7)]
The conversion command is:
cat end_1.bowtie end_2.bowtie | sort | bowtiePairedFix | bowtie2mrf paired -sequence > data.mrf 2> data.mrf.log

Please note that this program is also provided "as is".

FusionSeq List of programs

2011-07-14T19:20:41Z

Asboner: /* gfrConfidenceValues */

{{FusionSeqHeader}}
== Data formats ==
FusionSeq use a few data formats to perform its operations.

=== Mapped Read Format (MRF) ===
This format is defined in the context of [http://rseqtools.gersteinlab.org/ RSEQtools]. More details can be found [http://info.gersteinlab.org/RSEQtools#Data_formats here].

===Gene Fusion Report (GFR)===
This file format defines the relevant information for each fusion transcript candidate. The rationale is that different filters can be applied to exclude “false positives” artificial fusions starting from an initial set. We also provide a parser that interprets this format allowing the user to propagate easily any changes to this format. For a given fusion candidate, involving gene A and gene B, the basic GFR format requires the following fields:
# the ID of the fusion candidate (''id''): typically it contains the sample name and a unique number separated by an underscore. The number is padded with zeros for consistency;
# ''SPER'', ''DASPER'' and ''RESPER'': scoring of the fusion candidate;
# Number of inter-transcript reads (''numInter''), i.e. the number of pairs having the ends mapped to the two genes;
# P-value of the insert size distribution analysis for the fusion transcript. Since we do not know the actual composition of the fusion transcript, we compute the p-value for both directions: AB (where gene A is upstream of gene B - ''pValueAB'') and BA (where gene B is upstream of gene A -- ''pValueBA'');
# Mean insert-size value of the minimal fusion transcript fragment. As before for the p-values, we compute both AB and BA versions (''interMeanAB'', ''interMeanBA'');
# Number of intra-transcript reads for gene A (''numIntra1'') and gene B (''numIntra2''), respectively, i.e the number of pairs where both ends map to the same gene;
# The type of the fusion (''fusionType''): cis, when both genes are on the same chromosome, or trans, otherwise;
# Name(s) of the transcripts (''nameTranscript''): all the UCSC gene IDs of the isoforms of each gene in the annotation separated by the pipe symbol '|';
# Chromosome of the genes (''chromosomeTranscript'');
# Strand information (''strandTranscript'');
# Start and end coordinates of the longest transcript for both genes (''startTranscript'', ''endTranscript'');
# Number of exons in the composite model for both genes (''numExonTranscript'');
# Coordinates of the exons in the composite model (''exonCoordinatesTranscript''): each exon is separated by the pipe symbol '|' and start and end coordinates are comma-separated;
# Exon-pair count: it describes which elements are connected and corresponding number of inter-transcript reads;
# interReads: the pair-read type, as well as the exons and the coordinates of the reads joining the two genes. Pair-type, exon number, start and end coordinates are reported as a comma-separated list, with the pipe symbol '|' separating the different pairs. The pair-reads type encodes the different possibilities two reads can be classified to in terms of the gene annotation set:
#* 1 : exon-exon
#* 2 : exon-intron
#* 3 : intron-exon
#* 4 : intron-intron
#* 5 : intron-boundary
#* 6 : exon-boundary
#* 7 : boundary-exon
#* 8 : boundary-intron
#* 9 : boundary-boundary
# Reads of the transcripts: the actual sequence of all the inter-reads.
# Pair-count: a summary of the number of reads for each category and joined exons (see interReads for the category definition). The field reports the pair-reads type, the number of reads, the two exons that are joined by the pair as a comma-separated list. The different pair types are separated by the pipe "|" symbol.
The GFR format can include additional optional information computed in the subsequent processing. For example, it is possible to add gene symbols (''geneSymbolTranscript'') and descriptions (''descriptionTranscript'') from the UCSC knownGene annotation set.
<center>[[#top|Top]]</center>

=== Breakpoint data format (BP) ===
Similarly to [[#Gene_Fusion_Format_(GFR)|GFR]], the junction-sequence identifier uses a standard format to capture the results of this analysis. For each tile that has at least 1 read aligned to, it reports, comma-separated:
# chromosome, start and end coordinates of the first tile, using UCSC notation: “chr:start-end”, although the intervals are 1-based and closed;
# chromosome, start and end coordinates of the second tile
# All the sequences of the reads mapped to that tile with the offset information, separated by the pipe symbol.
For example, one line may read as:
chr21:38764851-38764892,chr21:41758661-41758702,31:GTAGAATCATTCATTTCATTCTTGCAAACCAGCCTGCTTGGCCAGGAGGCA|30:TGTAGAATCATTCATTTCATTCTTGCAAACCAGCCTGCTTGGCCAGGAGGC
where two reads support this specific junction.

<center>[[#top|Top]]</center>

=Core programs=
==Fusion detection module==
==== geneFusions ====

geneFusions identifies potential fusion transcript candidates from an alignment file.

'''Usage''':

geneFusions prefix minNumberOfReads < sample.mrf > fusions.gfr

* Inputs: Takes an [[#Mapped_Read_Format_(MRF)|MRF]] file from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** prefix - the main ID of each candidate, i.e. prefix_0001, prefix_0002, etc.
** minNumberOfReads - the minimum number of reads required to include a candiate
* ''Optional arguments''
** none
'''Note''': sample.mrf must include the sequences of the reads.

<center>[[#top|Top]]</center>

==== gfrClassify ====

gfrClassify assign each fusion candidate to a specify category: inter-, intra-chromosomal, read-through, or cis. Please see the [http://dx.doi.org/10.1186/gb-2010-11-10-r104 publication] for a description of these classes.

'''Usage''':

gfrClassify < fileIN.gfr > fileOUT.gfr

* Inputs: Takes a [[#Gene_Fusion_Report_(GFR)|GFR]] file from STDIN
* Outputs: Reports a [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT

<center>[[#top|Top]]</center>

==Filtration cascade module==
===Mis-alignment filters===

----

==== gfrLargeScaleHomologyFilter ====
It removes potential fusion transcript candidates if the two genes are paralogs. It uses [http://www.treefam.org/ TreeFam] to establish is two genes have similar sequences.

'''Usage''':

gfrLargeScaleHomologyFilter < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT

<center>[[#top|Top]]</center>

==== gfrSmallScaleHomologyFilter ====

It removes candidates that have high-similarity between small regions within the two genes, where the reads actually map.

'''Usage''':

gfrSmallScaleHomologyFilter < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT

<center>[[#top|Top]]</center>

==== gfrRepeatMaskerFilter ====

Some reads may be aligned to repetitive regions in the genome, due to the low sequence complexity of those regions and may result in artificial fusion candidates. This filter removes reads mapped to those regions. If the number of reads left if less than a threshold, the candidate is removed.

'''Usage''':

gfrRepeatMaskerFilter repeatMasker.interval minNumberOfReads < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** repeatMasker.interval - the interval file with the coordinates of the repetitive regions
** minNumberOfRead - minimum number of reads overlapping the repetitive regions in order to remove the candidate

<center>[[#top|Top]]</center>

===Random pairing of transcript fragments===

----

==== gfrAbnormalInsertSizeFilter ====

gfrAbnormalInsertSizeFilter removes candidates with an insert-size bigger than the normal insert-size. The fusion candidate insert-size is computed on the ''minimal fusion transcript fragment''.

'''Usage''':

gfrAbnormalInsertSizeFilter pvalueCutOff < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** pvalueCutOff - the p-value threshold above which we keep the fusion transcript candidates

<center>[[#top|Top]]</center>

===Combination of mis-alignment and random pairing===

----

==== gfrRibosomalFilter ====

gfrRibosomalFilter removes candidates that have similarity with ribosomal genes. The rationale is that reads coming from highly expressed genes, such as ribosomal genes, are more likely to be mis-aligned and assigned to a different genes.

'''Usage''':

gfrRibosomalFilter < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT

<center>[[#top|Top]]</center>

==== gfrExpressionConsistencyFilter ====

It removes candidates that have a higher number of chimeric PE reads than PE reads aligned to the corresponding individual genes. This filter would only consider cases where the chimeric reads are mapped to introns of the two genes.

'''Usage''':

gfrExpressionConsistencyFilter < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT

<center>[[#top|Top]]</center>

===Other filters===

----

==== gfrPCRFilter ====

gfrPCRFilter removes candidates with the same read over-represented, yielding to a “spike-in-like” signal, i.e. a narrow signal with a high peak.

'''Usage''':

gfrPCRFilter offsetCutoff minNumUniqueRead
* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** offsetCutoff - the minimum number of different starting positions
** minNumUniqueRead - the minimum number of unique reads required to include a candidate

<center>[[#top|Top]]</center>

==== gfrAnnotationConsistencyFilter ====

It removes candidates involving genes with specific description, such as ribosomal, pseudogenes, etc.

'''Usage''':

gfrAnnotationConsistencyFilter string < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** string - a string identifying the element to remove, ex. pseudogene

<center>[[#top|Top]]</center>

==== gfrProximityFilter ====

It removes candidates that are likely due to mis-annotation of the 5' or 3' ends of the genes.

'''Usage''':

gfrProximityFilter offset < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** offset - the minimum distance (in nucleotides) between the two genes to keep the candidate

<center>[[#top|Top]]</center>

==== gfrBlackListFilter ====

It removes candidates specified by the user in a file

'''Usage''':

gfrBlackListFilter blackList.txt < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** blackList.txt - the file with the candidates to remove. The format of this files is a simple two-column tab-delimited file with describing the two gene symbols. For example:
<pre>
LOC388160 LOC388161
LOC388161 LOC388161
LOC440498 LOC440498
</pre>

<center>[[#top|Top]]</center>

==== gfrSpliceJunctionFilter ====

It removes candidates if the reads can be aligned to a splice junction library. '''NB''': this filter should be used only if the alignment was not performed against a splice junction library.

'''Usage''':

gfrSpliceJunctionFilter splice_junction_library < fileIN.gfr > fileOUT.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** splice_junction_library - the splice_junction_library in 2bit format

<center>[[#top|Top]]</center>

==== gfrMitochondrialFilter ====

It removes candidates including a mitochondrial gene. '''NB''': this filter should be used only if the alignment was performed against a gene annotation set with mitochondrial genes.

'''Usage''':

gfrMitochondrialFilter < fileIN.gfr > fileOUT.gfr

* Input: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Output: Reports the filtered [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT

<center>[[#top|Top]]</center>

===Scoring the candidates===

----

==== gfrConfidenceValues ====

gfrConfidenceValues computes the scores SPER, DASPER, and RESPER for each candidate. SPER is the number of '''S'''upportive '''PE R'''eads per candidates, i.e. the normalized number of inter-transcript reads.

<math>SPER=\frac {m}{N_{mapped}}*10^6</math>

where <math>m</math> is the number of inter-transcript reads, i.e. PE-reads connecting two different genes.

DASPER ('''D'''ifference between the observed and the '''A'''nalytically computed '''SPER''') and RESPER ('''R'''atio between the observed SPER and the '''E'''mpirically computed '''SPER''') compare the observed SPER with two slightly different ways to compute the expectation.
Please find at [http://genomebiology.com/2010/11/10/R104/abstract Genome Biology, 2010;11:R104] more information about those scores.

'''Usage''':

gfrConfidenceValues prefix
* Input: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Output: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** prefix - the prefix of the .meta file
'''Note''': gfrConfidenceValues expects that the file prefix.meta is available. This tab-delimited file includes some meta information about the MRF dataset. The minimum required set of information is the number of mapped reads with the following syntax as an example:
Mapped_reads 236549456
This represent the number of PE reads that were mapped. It is possible to determine this number by counting the number of lines in the [[#Mapped Read Format (MRF)|MRF]] file. For example:
grep -v AlignmentBlocks file.mrf | grep -v "#" | wc -l

A one-line command to generate this is:
$ MAPPED=$(grep -v "AlignmentBlock" file.mrf | grep -v "#" | wc -l); printf "Mapped_reads\t%d\n" $MAPPED > file.meta

<center>[[#top|Top]]</center>

==== gfrConfidenceValueTranscript [deprecated]====

gfrConfidenceValueTranscript computes LSPER ('''L'''ocal '''SPER''') as the number of inter-transcript PE reads supporting the fusion divided by the average gene expression value. However, since in many cases, only one allele contributes to the fusion transcript, the expression of the fusion transcript may not correlate with the expression of the genes generating it. Please find at [http://genomebiology.com/2010/11/10/R104/abstract Genome Biology 2010;11:R104] more details about why this score may impair the correct ranking of the candidates (see Additional file 1, text and Figure S6).

'''Usage''':

gfrConfidenceValueTranscript prefix
* Input: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Output: Reports [[#Gene_Fusion_Report_(GFR)|GFR]] to STDOUT
* ''Required arguments''
** prefix - the prefix of the .composite.expression file
'''Note''': gfrConfidenceValueTranscript expects that the file prefix.composite.expression is available. This file includes the expression values for the genes that are part of the transcript. The best way to compute is by using [http://rseqtools.gersteinlab.org/ RSEQtools], using [[RSEQtools#mrfQuantifier|mrfQuantifier]], e.g.:
mrfQuantifier knownGeneTranscriptCompositeModel.txt multipleOverlap < prefix.mrf > prefix.composite.expression
where knowGeneTranscriptCompositeModel.txt is the interval file representing the UCSC knownGenes library.
<center>[[#top|Top]]</center>

==Junction-sequence identification module==

==== gfr2bpJunctions ====

It generates the splice junction library and two files to be run with a cluster to perform the indexing and the mapping in parallel.

'''Usage''':

gfr2bpJunctions <file.gfr> <tileSize> <sizeFlankingRegion> <minDASPER>

* Input: file.gfr - [[#Gene_Fusion_Report_(GFR)|GFR]] file (not from STDIN)
* Outputs: Two files with all the jobs to be run on a cluster:
** file_joblist1.txt: the instructions to index the library and align the reads to the junction library.
** file_joblist2.txt: the instructions to aggregate the results of the alignment.
* Required parameters:
** tileSize: the number of nucleotides in each tile. For example, a 50nt read may be aligned to a 80nt junction, thus tileSize would be 40.
** sizeFlankingRegions: the size, in nucleotides, of the flanking region around the exons.
** minDASPER: the minimum DASPER values for which the breakpoint analysis is performed.

<center>[[#top|Top]]</center>

==== validateBpJunctions ====

It validates the junctions, i.e. excludes those junctions with sequence similarity to the other regions of the genome.

'''Usage''':

validateBpJunctions < fileIN.bp > fileOUT.bp

* Input: a [[#Breakpoint data format (BP)|BP]] file from STDIN
* Output: a [[#Breakpoint data format (BP)|BP]] to STDOUT

<center>[[#top|Top]]</center>

==== bpFilter ====

It filters some of the junctions because of number of reasons (see parameters).

'''Usage''':

bpFilter <minNumReads> <minNumUniqueOffsets> <minNumReadsForKS> <pValueCutoffForKS> <numPossibleOffsets>

* Input: [[#Breakpoint Data Format (BP)|BP]] file from STDIN
* Output: [[#Breakpoint Data Format (BP)|BP]] file to STDOUT
* Required parameters
** minNumReads: minimum number of reads aligned to the junction to be kept
** minNumUniqueOffsets: minimum number of unique reads aligned to the junction to avoid PCR artifacts
** minNumReadsForKS: minimum number of reads to perform a Kolmogorov-Smirnov (KS) test. This would compare the reads distribution with a uniform one. However, only if there are sufficient reads this test can be performed.
** pValueCutoffForKS: the pvalue cut-off if a KS test is used
** numPossibleOffsets: number of possible offsets, i.e. starting positions of the reads
<center>[[#top|Top]]</center>

==== bp2wig ====

It generates a signal track from the reads aligned to the junction.

'''Usage''':

bp2wig file.bp

* Input: [[#Breakpoint Data Format (BP)|BP]] file (not from STIDIN)
* Output: WIGGLE files showing the support of the junction

<center>[[#top|Top]]</center>

==== bp2alignment ====

It generates a text representation of the reads aligned to the junction.

'''Usage''':

bp2alignment

* Input: [[#Breakpoint Data Format (BP)|BP]] file from STDIN
* Output: an text file to STDOUT reporting the aligned read to the junction. For example:
<pre>
Tile 1: chr21:38717312-38717353
Tile 2: chr21:41801877-41801918
Number of reads spanning breakpoint: 5

AGGAGGGTTC CTGCCGCGCTCCAGGCGGCGCTCCCCGCCCCTCGCCCTCCG
ATTCATCAGGAGAGTTC CTACCGCGCTCCAGGCGGCGCTCCCCGCCCCTCG
CCACACTGCATTCATCAGGAGAGTTC CTGCCGCGCTCCAGGCGGCGCTCCC
CTTCCCGCCTTCGGCCACACTGCATTCATCAGGAGAGTTC CTGCCGCGCTC
CTTCCCGCCTTCGGCCACACTGCATTCATCAGGAGAGTTC CTGCCGCGCTC
TCTTCCCGCCTTTGGCCACACTGCATTCATCAGGAGAGTTC CTGCCGCGCTCCAGGCGGCGCTCCCCGCCCCTCGCCCTCCG
</pre>

<center>[[#top|Top]]</center>

==== bowtie2bp ====

Utility for debugging. It converts the alignment to the fusion junction library from bowtie output to [[#Breakpoint Data Format(BP)|BP]].

'''Usage''':

bowtie2bp

* Input: bowtie alignment file from STDIN
* Output: [[#Breakpoint Data Format(BP)|BP]] file to STDOUT.

<center>[[#top|Top]]</center>

=Auxiliary modules=
==== gfr2images ====

It generates a schematic illustration depicting which regions of the two genes are connected by PE reads.

'''Usage''':

gfr2images < fileIN.gfr

* Pre-requisite: the pairCount column should be present in the [[#Gene_Fusion_Report_(GFR)|GFR]] file. [[#gfrCountPairTypes|gfrCountPairTypes]] should be executed first.
* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: JPEG images labeled SampleID_0000#.jpg, where '''SampleID''' is the unique identification of the fusion candidate present in the [[#Gene_Fusion_Report_(GFR)|GFR]] file.

<center>[[#top|Top]]</center>

==== gfr2fasta ====

It generates two fasta files, one for each gene, with the PE-reads connecting them

'''Usage''':

gfr2fasta < fileIN.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: * Outputs: FASTA files labeled SampleID_0000#_[1|2].fasta, where '''SampleID''' is the unique identification of the fusion candidate present in the [[#Gene_Fusion_Report_(GFR)|GFR]] file and [1|2] indicates which gene they correspond to.

<center>[[#top|Top]]</center>

==== gfr2bed ====

It generates two BED files of the fusion transcript reads; one per gene.

'''Usage''':

gfr2bed < fileIN.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: BED files labeled SampleID_0000#_[1|2].bed, where '''SampleID''' is the unique identification of the fusion candidate present in the [[#Gene_Fusion_Report_(GFR)|GFR]] file and [1|2] indicates which gene they correspond to.

<center>[[#top|Top]]</center>

==== gfr2gff ====

It generates a GFF file for all candidates on the same chromosome reporting all PE-reads for visualization in genome browsers such as UCSC.

'''Usage''':

gfr2gff < fileIN.gfr

* Inputs: [[#Gene_Fusion_Report_(GFR)|GFR]] from STDIN
* Outputs: GFF files labeled SampleID_0000#.gff, where '''SampleID''' is the unique identification of the fusion candidate present in the [[#Gene_Fusion_Report_(GFR)|GFR]] file.

<center>[[#top|Top]]</center>

==== export2mrf ====

It generates an MRF file from the export files of a Genome Analyzer II run.

'''Usage''':

export2mrf prefix file_1_export.txt file_2_export.txt > file.mrf

* Inputs:
** prefix: the sample ID
** file_1_export.txt: the export file for the first end
** file_2_export.txt: the export file for the second end
* Outputs:
** An [[RSEQtools#Mapped_Read_Format_.28MRF.29|MRF]] file with all the mapped reads to STDOUT
** prefix_allReads.txt: the list of all the reads for the breakpoint analysis. Each line reports only a read
** prefix.meta: a summary file (tab-delimited) including the total number of reads and the number of mapped reads. This file is a prerequisite for [[#gfrConfidenceValues|gfrConfidenceValues]].

<center>[[#top|Top]]</center>

==== gfrAddInfo ====

It includes additional information about the fusion transcript candidates such as gene symbols and gene description. This is a pre-requisite for [[#gfrBlackListFilter|gfrBlackListFilter]] and [[#gfrAnnotationConsistencyFilter|gfrAnnotationConsistencyFilter]].

'''Usage''':

gfrAddInfo < fileIN.gfr > fileOUT.gfr

* Pre-requisite:
** An external file that includes all descriptive information about the annotation set. The format of this file should follow kgXref.txt (from UCSC). Indeed, we use kgXref.txt for human, however, this could be modified in [[#geneFusionConfig|geneFusionConfig]].
* Input:
** A [[#Gene Fusion Report (GFR)|GFR]] file from STDIN.
* Output:
** A [[#Gene Fusion Report (GFR)|GFR]] file to STDOUT with 4 additional columns:
**# geneSymbolTranscript1
**# geneSymbolTranscript2
**# descriptionTranscript1
**# descriptionTranscript2

<center>[[#top|Top]]</center>

====gfrCountPairTypes====

It counts how many PE reads are assigned to each category, i.e. exon-exon, exon-intron, etc. This is used by the the CGI programs to summarize the statistics of the fusion candidate.

'''Usage''':

gfrCountPairTypes < fileIN.gfr > fileOUT.gfr

* Input:
** A [[#Gene Fusion Report (GFR)|GFR]] file from STDIN.
* Output:
** A [[#Gene Fusion Report (GFR)|GFR]] file to STDOUT with 1 additional column (pairCount). See [[#Gene Fusion Report (GFR)|GFR]] for a description of this field.

<center>[[#top|Top]]</center>

FusionSeq FAQ

2011-07-14T19:19:03Z

Asboner: /* geneFusions: Segmentation Fault */

{{FusionSeqHeader}}
=General Questions=
==Does FusionSeq work with ''my favorite'' alignment tool?==
The format of the paired-end reads that is "understood" by FusionSeq is [[RSEQtools#Mapped_Read_Format|Mapped Read Format (MRF)]]. We provide several conversion tools from most common alignment programs and formats, including SAM/BAM, to represent mapped reads using MRF. Please take a look at [http://rseqtools.gersteinlab.org/ RSEQtools] for more information, and specifically to: [[RSEQtools#Format_conversion_utilities|Format conversion utilities]].

==Does FusionSeq work with colorspace paired-end reads?==
FusionSeq has been developed to be as much independent as possible from the sequencing technology and the alignment tool. However, extensive testing was conducted on Illumina Genome Analyzer II platform only.

==Where can I obtain the annotation data for hg19?==
Annotation data for hg19 can be found [[FusionSeq_Requirements#Human_genome_GRCh37.2Fhg19|here]].

==Can I use FusionSeq with ''my favorite species''?==
In principle, you can run FusionSeq using any paired-end RNA-Seq data. However, you would need to provide the corresponding data that is currently used for human, i.e.:
# a genome sequence, in 2bit format
# a gene annotation set in interval format; including ''composite'' models of genes
# the sequences of the composite models in the gene annotation set
# a mapping between your gene annotation and TreeFam (optional, used by gfrLargeScaleHomologyFilter)
# a list of the repetitive regions, in interval format (optional, used by gfrRepeatMaskerFilter)
# a ribosomal sequence library in 2bit format (optional, used by gfrRibosomalFilter)
# the mapping between your gene annotation and other descriptive information, e.g. gene symbols, descriptions, etc. (optional, used by gfrAddInfo)

==Where can I find some data sets to test FusionSeq?==
Please find some test data sets [[FusionSeq_Test_Datasets|here]].

==Is there a demo version of FusionSeq?==
A demo version of the web-interface of FusionSeq is available [http://dynamic.gersteinlab.org/people/asboner/FusionSeq/geneFusions_cgi here]. You can access the results described in the paper, by typing the sample ID (e.g. 106_T, 1700_D, etc.).

==Where can I find more information?==
The most up-to-date user documentation for FusionSeq is available [[FusionSeq|here]]. If you look for the developer's documentation, you can find it [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/documentation/index.html here].

==The BOWTIE_INDEXES directory is used for reference indexes as well as for temporary index files. However, I have a centralized repository of indexed genomes and cannot create the temporary files in that directory==
A workaround of this issue would be to create a local directory where one has write permission. This would solve the problem of generating temporary index files when running the junction-sequence identifier module. To also have the indexed genome and transcriptome in the same folder, one could link them symbolically, for example:
$ cd /path/to/local/folder/
$ ln -s /path/to/centralized/repository/hg18_nh/ .
$ ln -s /path/to/centralized/repository/hg18_knownGeneAnnotationTranscriptCompositeModel/ .

Now, BOWTIE_INDEXES in geneFusionConfig.h should point to /path/to/local/folder/ where the user can generate temporary files:
#define BOWTIE_INDEXES /path/to/local/folder

==How can I cite FusionSeq?==
Please cite this publication:
* Sboner A, Habegger L, Pflueger D, Terry S, Chen DZ, Rozowsky JS, Tewari AK, Kitabayashi N, Moss BJ, Chee MS, Demichelis F, Rubin MA, Gerstein MB. '''FusionSeq: a modular framework for finding gene fusions by analyzing Paired-End RNA-Sequencing data'''. ''Genome Biol'' 21 Oct. 2010; '''11''':R104 [http://dx.doi.org/10.1186/gb-2010-11-10-r104|doi:10.1186/gb-2010-11-10-r104]

=Compilation troubleshooting=

==Where can I find the BIOS library, required for FusionSeq?==
As described in [[Requirements]], the [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

==TROOT.h: No such file or directory==
This error occurs because the compiler does not find TROOT.h file. This file is part of [http://root.cern.ch/drupal/ ROOT], a framework for mathematical and statistical analysis. If you have installed [http://root.cern.ch/drupal/ ROOT], please make sure that you have defined ROOTSYS as the path to the ROOT folder and added it to your PATH:
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
Please also see [[Installation_and_Configuration_of_FusionSeq#Installing_and_configuring_ROOT|Installing and configuring ROOT]] for more details.

=Running issues=
==FusionSeq does not find the annotation datasets. However, geneFusionConfig.h specifies their correct location and the files are present.==
This error:
ls_createFromFile '$HOME/path/to/data/annotation_data.txt'
occurs because environmental variable, such as $HOME, are not interpreted. Please use full path names in geneFusionConfig.h to specify directory locations.

==I followed the instructions, but I still get many WARNINGs. Is this expected?==
Yes, every program in FusionSeq provides some logging information. We recommend to capture the log data by redirecting STDERR (e.g. '2> fusionseq.log').

==geneFusions: Segmentation Fault==
There a number of reasons why one gets this error. One possibility is the lack of the sequences in the MRF file. Although MRF does not require the inclusion of sequences to be valid, sequences are indeed required by geneFusions. Please ensure that sequences are present in the MRF file.

==gfrConfidenceValues: Cannot find the .meta file==
The .meta file is required to run gfrConfidenceValues. This is a tab-delimited file including the number of mapped reads. A simple way to generate this file is to run:
$ MAPPED=$(grep -v "AlignmentBlock" file.mrf | grep -v "#" | wc -l); printf "Mapped_reads\t%d\n" $MAPPED > file.meta

The final files should look like:
Mapped_reads 123456789

==Paired-end reads and bowtie: I aligned each end separately. How do I convert the alignment file to MRF?==
To convert bowtie alignment into MRF when ends are aligned separately, we require the two ends to be on subsequent lines. This could be partially achieved by concatenating and sorting the two alignment files, e.g. cat end_1.bowtie end_2.bowtie | sort > alignment.bowtie. However, in some cases, only one end is mapped, thus creating "singletons" in the alignment file, where only one end is reported. Since there have been many requests regarding this issue, we decided to share an "internal" utility: bowtiePairedFix. Here you can download the binary file:

* [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/tarballs/bowtiePairedFix.linux64 bowtiePairedFix (GNU/Linux x86_64)]
* [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/tarballs/bowtiePairedFix.MacOs.10.6.7 bowtiePairedFix (MacOs 10.6.7)]
The conversion command is:
cat end_1.bowtie end_2.bowtie | sort | bowtiePairedFix | bowtie2mrf paired -sequence > data.mrf 2> data.mrf.log

Please note that this program is also provided "as is".

Tools

2011-06-28T12:27:58Z

Asboner:

[http://molmovdb.org/molmovdb/morph Morph Server] generates a plausible pathway between two conformations of a protein or nucleic acid structure. A large number of statistics and several high-quality movies are output.

[http://bioinfo.mbb.yale.edu/ExpressYourself ExpressYourself] is an interactive platform for background correction, normalization, scoring, and quality assessment of raw microarray data.

[http://spine.nesg.org SPINE] is our laboratory-information management system (LIMS) for the [http://www.nesg.org NorthEast Structural Genomics Consortium]. The online version is restricted to consortium users, but most of the code is freely available for download. 

[http://pseudogene.org Pseudogene.org] is a collection of resources related to our efforts to survey eukaryotic genomes for pseudogene sequences, "pseudo-fold" usage, amino-acid composition, and single-nucleotide polymorphisms (SNPs) to help elucidate the relationships between pseudogene families across several
organisms.

[http://tiling.gersteinlab.org Tiling] is under construction.

[http://networks.gersteinlab.org/genome/interactions/networks/ TopNet] is an automated web tool designed to calculate topological parameters and compare different sub-networks for any given network.

A number of programs for calculating properties of protein and nucleic
acid structures have been collected into a [http://geometry.molmovdb.org single distribution]. Included are a library of utility functions for dealing
with structures, and a convenient interactive command-line interpreter.

A new algorithm for [http://bioinfo.mbb.yale.edu/expression/cluster local clustering of expression data] to find timeshifted and/or inverted
relationships is available as C source code.

[http://yeasthub.gersteinlab.org Yeasthub] is a semantic web-based application which demonstrates how a life sciences data warehouse can be built using a native Resource Description Framework (RDF) data store. This data warehouse allows integration of different types of yeast genome data provided by different resources in different formats including the tabular and RDF formats.

[http://pubnet.gersteinlab.org/ PubNet] is a web-based tool that extracts several types of relationships returned by PubMed queries and maps them into networks, allowing for graphical visualization, textual navigation, and topological analysis.

[http://tyna.gersteinlab.org/tyna/ tYNA] (TopNet-like Yale Network Analyzer) is a Web system for managing, comparing and mining multiple networks, both directed and undirected. tYNA efficiently implements methods that have proven useful in network analysis, including identifying defective cliques, finding small network motifs (such as feed-forward loops), calculating global statistics (such as the clustering coefficient and eccentricity), and identifying hubs and bottlenecks etc.

[http://helix.gersteinlab.org/ HIT] (Helix Interaction Tool) is a web-based comprehensive package of tools for analyzing helix-helix interactions in proteins.

[http://www.gersteinlab.org/proj/BoCaTFBS/ BoCaTFBS] is a boosted cascade learner to refine the binding sites suggested by ChIP-chip experiments. This tool is based on a data mining approach combining noisy data from ChIP-chip experiments with known binding site patterns. BoCaTFBS uses boosted cascades of classifiers for optimum efficiency, in which components are alternating decision trees; it exploits interpositional correlations; and it explicitly integrates massive negative information from ChIP-chip experiments.

[http://purelight.biology.yale.edu:8080/servlets-examples/procat.html ProCAT] is a data analysis approach for protein microarrays. ProCAT corrects for background bias and spatial artifacts, identifies significant signals, filters nonspecific spots, and normalizes the resulting signal to protein abundance. ProCAT provides a powerful and flexible new approach for analyzing many types of protein microarrays.

[http://tilescope.gersteinlab.org/ Tilescope] is an online analysis pipeline for high-density tiling microarray data. Tilescope normalizes signals between channels and across arrays, combines replicate experiments, score each array element, and identifies genomic features. The program is designed with a modular, three-tiered architecture, facilitating parallelism, and a graphic user-friendly interface, presenting results in an organized web page, downloadable for further analysis.

[http://proteomics.gersteinlab.org PARE] (Protein Abundance and mRNA Expression is a tool for comparing protein abundance and mRNA expression data. In addition to globally comparing the quantities of protein and mRNA, PARE allows users to select subsets of proteins for focused study (based on functional categories and complexes). Furthermore, it highlights correlation outliers, which are potentially worth further examination.

[http://hub.gersteinlab.org/ir-supp/ HUB] is a tool for leveraging the structure of the semantic web to enhance information retrieval for proteomics. This tool helps Proteomics researchers to be able to quickly retrieve relevant information from the web and the biomedical literature.

[http://coevolution.gersteinlab.org/coevolution/ Coevolution analysis of protein residues]: this is an integrated online system that enables comparative analyses of residue coevolution with a comprehensive set of commonly used scoring functions, including Statistical Coupling Analysis (SCA), Explicit Likelihood of Subset Variation (ELSC), mutual information and correlation-based methods.

[http://archive.gersteinlab.org/proj/rnaseq/rseqtools/ RSEQtools] is a suite of tools that use Mapped Read Format (MRF) for the analysis of RNA-Seq experiments. MRF is a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. These tools consist of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads and segmenting that signal into actively transcribed regions. Moreover, the tools can readily be used to build customizable RNA-Seq workflows.

[http://rnaseq.gersteinlab.org/fusionseq/ FusionSeq] is a computational framework for detecting chimeric transcripts from paired-end RNA-seq experiments. It provides a ranked list of fusion transcripts candidates that can be further evaluated via experimental methods.

[http://act.gersteinlab.org/ ACT] (aggregation and correlation toolbox) is an aggregation and correlation toolbox for analyses of genome tracks. ACT is an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation.

GersteinInfo:Privacy policy

2011-06-16T19:09:17Z

Asboner:

=Handling of personal data communicated by visitors to this website will be carried out under the following terms:=
No private or personal information is collected. However, if you are eligible to edit the wiki, i.e. you are a member of the Gerstein Lab, you will be required to log-in first. A username, chosen by the user, is thus collected only for the purpose of identifying the editor of this pages. An email address is also collected, if the user decides to voluntarily provide this information. The email address is used only for the purpose of retrieve the password by the user, not by the administrator of the site.

GersteinInfo:Privacy policy

2011-06-13T11:20:59Z

Asboner:

By visiting this website, no private or personal information is collected.

Main Page

2011-06-13T11:18:22Z

Asboner: /* Important Public Items */

This wiki hosts general, public information about the

'''[http://gersteinlab.org Gerstein Lab]'''.

Lab members should consult the [http://wiki.gersteinlab.org/labinfo/ private lab wiki] for lab-specific information (login as user "public"). There's also a wiki for [http://wiki.gersteinlab.org/meetings lab meetings] (login as user "private"). To edit this wiki, contact the [[User:Infoadmin | Infoadmin]]

== Important Public Items ==
*Public [[Documents]] about the lab
*[[FAQ]] (new) on programs
*Lab [[Calendar]]
*[http://www.gersteinlab.org/index.html.1jun11 Old version] of lab homepage
*[[Permissions]]
*Prominent [[public wiki pages linked from elsewhere]]

== Info for New People ==
*[[General Information for New Undergraduates]] thinking about working in the lab
* Information for new people starting in the lab from the private wiki ([http://wiki.gersteinlab.org/labinfo/Group_Meeting_and_JC_Procedure], [http://wiki.gersteinlab.org/labinfo/Bad_Times_and_Contact_Info], [http://wiki.gersteinlab.org/labinfo/New_People_Entering_Lab], [http://wiki.gersteinlab.org/labinfo/Staff])

== Info for Lab Members and Collaborators ==
*[http://wiki.gersteinlab.org/labinfo/Lab_Computing_Resources Lab Computing Resources] page.
*[http://wiki.gersteinlab.org/labinfo/Making_conference_calls_or_lab_related_calls Making phone calls] (SKYPE information included) page.
*[http://wiki.gersteinlab.org/labinfo/Using_copier Using copier] page.
*[http://www.facebook.com/group.php?gid=54856837624 Gerstein Lab Facebook Page]
*[http://bioinfo.mbb.yale.edu/pipermail/web/ Public WEB Mailing List] (no longer in use)
* [[Streamlining Draft Flow]] (Ideas for streamlining the process of drafting and submitting papers)
** Way to list Mark's [[Affiliation]] on papers
** See [[xxmg at gersteinlab.org | xxmg@gersteinlab.org]] for correct address to use for paper correspondence.
** Note that '''xxmg at bioinfo.mbb.yale.edu''' for paper correspondence is deprecated.
* [[Pointers on Powerpoints]] and [[Pointers on Grant Sections]]
* [[Recommendation Letters]]
* Lab [[Resources Document]] (NIH form)
* Google Groups: [http://group.gersteinlab.org homepage], for [http://docs.google.com/a/gersteinlab.org DOCS]
* Some Useful University Policies
** Snippets from [[MB&B Policy for Graduate Students on Vacation and Travel]] and [[Policy on Postdoc Appointments]]
** Policies on visitors: [http://provost.yale.edu/minors-in-labs Policy on Minors in Labs] ([http://archive.gersteinlab.org/docs/2010/06.02/Policy-on-Minors-in-Labs.pdf old]), [http://provost.yale.edu/policy-access-university-labs-and-research-facilities Policy on Access to the Lab]
** Travel per diem information: [http://www.yale.edu/ppdev/Guides/bluepages.pdf Yale Blue Pages]
* Useful links: [http://www.yale.edu/its/accounts/netid.html Yale NetID System], [http://www.yale.edu/its/network/vpn_faq.html Yale VPN FAQ], [http://www.yale.edu/its/network/wireless/faq.html Yale Wireless FAQ], [http://www.yale.edu/its/telecom/dialing.html Yale Dialing Instructions], [https://config.mail.yale.edu Configuring Yale email], [http://www.yale.edu/ris/main.html Poster Printing]
* [http://maguro.cs.yale.edu:8000/Center_for_High_Performance_Computation_in_Biology_and_Biomedicine Yale High Performance Computing Center]
* Information on [[what grant to charge something to]]
* Information on [[Meeting Invites]]
* [http://araman.mbgnet/nagios/ System Status] MBGNet LAN access only.
* [http://info.gersteinlab.org/Tools Tools] page

GersteinInfo:Privacy policy

2011-06-13T11:18:06Z

Asboner: Created page with 'By visiting this website, no collection of private information is carried out.'

By visiting this website, no collection of private information is carried out.

Main Page

2011-06-13T11:16:40Z

Asboner: /* Important Public Items */

This wiki hosts general, public information about the

'''[http://gersteinlab.org Gerstein Lab]'''.

Lab members should consult the [http://wiki.gersteinlab.org/labinfo/ private lab wiki] for lab-specific information (login as user "public"). There's also a wiki for [http://wiki.gersteinlab.org/meetings lab meetings] (login as user "private"). To edit this wiki, contact the [[User:Infoadmin | Infoadmin]]

== Important Public Items ==
*Public [[Documents]] about the lab
*[[FAQ]] (new) on programs
*Lab [[Calendar]]
*[http://www.gersteinlab.org/index.html.1jun11 Old version] of lab homepage
*[[Permissions]]
*[[Privacy]]
*Prominent [[public wiki pages linked from elsewhere]]

== Info for New People ==
*[[General Information for New Undergraduates]] thinking about working in the lab
* Information for new people starting in the lab from the private wiki ([http://wiki.gersteinlab.org/labinfo/Group_Meeting_and_JC_Procedure], [http://wiki.gersteinlab.org/labinfo/Bad_Times_and_Contact_Info], [http://wiki.gersteinlab.org/labinfo/New_People_Entering_Lab], [http://wiki.gersteinlab.org/labinfo/Staff])

== Info for Lab Members and Collaborators ==
*[http://wiki.gersteinlab.org/labinfo/Lab_Computing_Resources Lab Computing Resources] page.
*[http://wiki.gersteinlab.org/labinfo/Making_conference_calls_or_lab_related_calls Making phone calls] (SKYPE information included) page.
*[http://wiki.gersteinlab.org/labinfo/Using_copier Using copier] page.
*[http://www.facebook.com/group.php?gid=54856837624 Gerstein Lab Facebook Page]
*[http://bioinfo.mbb.yale.edu/pipermail/web/ Public WEB Mailing List] (no longer in use)
* [[Streamlining Draft Flow]] (Ideas for streamlining the process of drafting and submitting papers)
** Way to list Mark's [[Affiliation]] on papers
** See [[xxmg at gersteinlab.org | xxmg@gersteinlab.org]] for correct address to use for paper correspondence.
** Note that '''xxmg at bioinfo.mbb.yale.edu''' for paper correspondence is deprecated.
* [[Pointers on Powerpoints]] and [[Pointers on Grant Sections]]
* [[Recommendation Letters]]
* Lab [[Resources Document]] (NIH form)
* Google Groups: [http://group.gersteinlab.org homepage], for [http://docs.google.com/a/gersteinlab.org DOCS]
* Some Useful University Policies
** Snippets from [[MB&B Policy for Graduate Students on Vacation and Travel]] and [[Policy on Postdoc Appointments]]
** Policies on visitors: [http://provost.yale.edu/minors-in-labs Policy on Minors in Labs] ([http://archive.gersteinlab.org/docs/2010/06.02/Policy-on-Minors-in-Labs.pdf old]), [http://provost.yale.edu/policy-access-university-labs-and-research-facilities Policy on Access to the Lab]
** Travel per diem information: [http://www.yale.edu/ppdev/Guides/bluepages.pdf Yale Blue Pages]
* Useful links: [http://www.yale.edu/its/accounts/netid.html Yale NetID System], [http://www.yale.edu/its/network/vpn_faq.html Yale VPN FAQ], [http://www.yale.edu/its/network/wireless/faq.html Yale Wireless FAQ], [http://www.yale.edu/its/telecom/dialing.html Yale Dialing Instructions], [https://config.mail.yale.edu Configuring Yale email], [http://www.yale.edu/ris/main.html Poster Printing]
* [http://maguro.cs.yale.edu:8000/Center_for_High_Performance_Computation_in_Biology_and_Biomedicine Yale High Performance Computing Center]
* Information on [[what grant to charge something to]]
* Information on [[Meeting Invites]]
* [http://araman.mbgnet/nagios/ System Status] MBGNet LAN access only.
* [http://info.gersteinlab.org/Tools Tools] page

Installation and Configuration of FusionSeq

2011-06-07T17:01:35Z

Asboner: /* (versions 0.7.0 and later) */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

=====(versions 0.7.0 and later) =====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

=====(versions up to 0.6.1)=====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install [http://root.cern.ch/drupal/ ROOT] can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====(versions 0.7.0 and later)=====
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for [http://root.cern.ch/drupal/ ROOT], however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link [http://root.cern.ch/drupal/ ROOT] files for FusionSeq.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

Also, for some versions of ROOT, one may get the following error:
<pre>
[...]/root/include/Rtypes.h:35:67: error: snprintf.h: No such file or directory
[...]/root/include/Rtypes.h:36:68: error: strlcpy.h: No such file or directory
</pre>
This is because ROOT provides its own copy of the header files. One workaround is thus to create symbolic links
<pre>
$ cd ~/fusionseq/include
$ ln -s root/snprintf.h .
$ ln -s root/strlcpy.h .
</pre>
This should solve it.

=====(versions up to 0.6.1)=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

=====(versions 0.7.0 and later)=====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

FusionSeq FAQ

2011-05-19T18:08:59Z

Asboner: /* Paired-end reads and bowtie: I aligned each end separately. How do I convert the alignment file to MRF? */

{{FusionSeqHeader}}
=General Questions=
==Does FusionSeq work with ''my favorite'' alignment tool?==
The format of the paired-end reads that is "understood" by FusionSeq is [[RSEQtools#Mapped_Read_Format|Mapped Read Format (MRF)]]. We provide several conversion tools from most common alignment programs and formats, including SAM/BAM, to represent mapped reads using MRF. Please take a look at [http://rseqtools.gersteinlab.org/ RSEQtools] for more information, and specifically to: [[RSEQtools#Format_conversion_utilities|Format conversion utilities]].

==Does FusionSeq work with colorspace paired-end reads?==
FusionSeq has been developed to be as much independent as possible from the sequencing technology and the alignment tool. However, extensive testing was conducted on Illumina Genome Analyzer II platform only.

==Where can I obtain the annotation data for hg19?==
Annotation data for hg19 can be found [[FusionSeq_Requirements#Human_genome_GRCh37.2Fhg19|here]].

==Can I use FusionSeq with ''my favorite species''?==
In principle, you can run FusionSeq using any paired-end RNA-Seq data. However, you would need to provide the corresponding data that is currently used for human, i.e.:
# a genome sequence, in 2bit format
# a gene annotation set in interval format; including ''composite'' models of genes
# the sequences of the composite models in the gene annotation set
# a mapping between your gene annotation and TreeFam (optional, used by gfrLargeScaleHomologyFilter)
# a list of the repetitive regions, in interval format (optional, used by gfrRepeatMaskerFilter)
# a ribosomal sequence library in 2bit format (optional, used by gfrRibosomalFilter)
# the mapping between your gene annotation and other descriptive information, e.g. gene symbols, descriptions, etc. (optional, used by gfrAddInfo)

==Where can I find some data sets to test FusionSeq?==
Please find some test data sets [[FusionSeq_Test_Datasets|here]].

==Is there a demo version of FusionSeq?==
A demo version of the web-interface of FusionSeq is available [http://dynamic.gersteinlab.org/people/asboner/FusionSeq/geneFusions_cgi here]. You can access the results described in the paper, by typing the sample ID (e.g. 106_T, 1700_D, etc.).

==Where can I find more information?==
The most up-to-date user documentation for FusionSeq is available [[FusionSeq|here]]. If you look for the developer's documentation, you can find it [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/documentation/index.html here].

==The BOWTIE_INDEXES directory is used for reference indexes as well as for temporary index files. However, I have a centralized repository of indexed genomes and cannot create the temporary files in that directory==
A workaround of this issue would be to create a local directory where one has write permission. This would solve the problem of generating temporary index files when running the junction-sequence identifier module. To also have the indexed genome and transcriptome in the same folder, one could link them symbolically, for example:
$ cd /path/to/local/folder/
$ ln -s /path/to/centralized/repository/hg18_nh/ .
$ ln -s /path/to/centralized/repository/hg18_knownGeneAnnotationTranscriptCompositeModel/ .

Now, BOWTIE_INDEXES in geneFusionConfig.h should point to /path/to/local/folder/ where the user can generate temporary files:
#define BOWTIE_INDEXES /path/to/local/folder

==How can I cite FusionSeq?==
Please cite this publication:
* Sboner A, Habegger L, Pflueger D, Terry S, Chen DZ, Rozowsky JS, Tewari AK, Kitabayashi N, Moss BJ, Chee MS, Demichelis F, Rubin MA, Gerstein MB. '''FusionSeq: a modular framework for finding gene fusions by analyzing Paired-End RNA-Sequencing data'''. ''Genome Biol'' 21 Oct. 2010; '''11''':R104 [http://dx.doi.org/10.1186/gb-2010-11-10-r104|doi:10.1186/gb-2010-11-10-r104]

=Compilation troubleshooting=

==Where can I find the BIOS library, required for FusionSeq?==
As described in [[Requirements]], the [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

==TROOT.h: No such file or directory==
This error occurs because the compiler does not find TROOT.h file. This file is part of [http://root.cern.ch/drupal/ ROOT], a framework for mathematical and statistical analysis. If you have installed [http://root.cern.ch/drupal/ ROOT], please make sure that you have defined ROOTSYS as the path to the ROOT folder and added it to your PATH:
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
Please also see [[Installation_and_Configuration_of_FusionSeq#Installing_and_configuring_ROOT|Installing and configuring ROOT]] for more details.

=Running issues=
==FusionSeq does not find the annotation datasets. However, geneFusionConfig.h specifies their correct location and the files are present.==
This error:
ls_createFromFile '$HOME/path/to/data/annotation_data.txt'
occurs because environmental variable, such as $HOME, are not interpreted. Please use full path names in geneFusionConfig.h to specify directory locations.

==I followed the instructions, but I still get many WARNINGs. Is this expected?==
Yes, every program in FusionSeq provides some logging information. We recommend to capture the log data by redirecting STDERR (e.g. '2> fusionseq.log').

==geneFusions: Segmentation Fault==
There a number of reasons why one gets this error. One possibility is the lack of the sequences in the MRF file. Although MRF does not require the inclusion of sequences to be valid, sequences are indeed required by geneFusions. Please ensure that sequences are present in the MRF file.

==Paired-end reads and bowtie: I aligned each end separately. How do I convert the alignment file to MRF?==
To convert bowtie alignment into MRF when ends are aligned separately, we require the two ends to be on subsequent lines. This could be partially achieved by concatenating and sorting the two alignment files, e.g. cat end_1.bowtie end_2.bowtie | sort > alignment.bowtie. However, in some cases, only one end is mapped, thus creating "singletons" in the alignment file, where only one end is reported. Since there have been many requests regarding this issue, we decided to share an "internal" utility: bowtiePairedFix. Here you can download the binary file:

* [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/tarballs/bowtiePairedFix.linux64 bowtiePairedFix (GNU/Linux x86_64)]
* [http://archive.gersteinlab.org/proj/rnaseq/fusionseq/tarballs/bowtiePairedFix.MacOs.10.6.7 bowtiePairedFix (MacOs 10.6.7)]
The conversion command is:
cat end_1.bowtie end_2.bowtie | sort | bowtiePairedFix | bowtie2mrf paired -sequence > data.mrf 2> data.mrf.log

Please note that this program is also provided "as is".

Installation and Configuration of FusionSeq

2011-05-17T18:48:25Z

Asboner: /* (versions 0.7.0 and later) */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

=====(versions 0.7.0 and later) =====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

=====(versions up to 0.6.1)=====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install [http://root.cern.ch/drupal/ ROOT] can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====(versions 0.7.0 and later)=====
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for [http://root.cern.ch/drupal/ ROOT], however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link [http://root.cern.ch/drupal/ ROOT] files for FusionSeq.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

=====(versions up to 0.6.1)=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

=====(versions 0.7.0 and later)=====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-17T18:48:02Z

Asboner: /* Installing and configuring ROOT */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) ====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

=====(versions up to 0.6.1)=====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install [http://root.cern.ch/drupal/ ROOT] can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====(versions 0.7.0 and later)=====
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for [http://root.cern.ch/drupal/ ROOT], however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link [http://root.cern.ch/drupal/ ROOT] files for FusionSeq.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

=====(versions up to 0.6.1)=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

=====(versions 0.7.0 and later)=====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-17T18:47:39Z

Asboner: /* (versions up to 0.6.1) */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) ====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

=====(versions up to 0.6.1)=====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install [http://root.cern.ch/drupal/ ROOT] can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====(version 0.7.0 and later)=====
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for [http://root.cern.ch/drupal/ ROOT], however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link [http://root.cern.ch/drupal/ ROOT] files for FusionSeq.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

=====(version up to 0.6.1)=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

=====(versions 0.7.0 and later)=====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-17T18:47:26Z

Asboner: /* (versions 0.7.0 and later) -- coming soon */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) ====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install [http://root.cern.ch/drupal/ ROOT] can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====(version 0.7.0 and later)=====
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for [http://root.cern.ch/drupal/ ROOT], however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link [http://root.cern.ch/drupal/ ROOT] files for FusionSeq.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

=====(version up to 0.6.1)=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

=====(versions 0.7.0 and later)=====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-17T18:46:57Z

Asboner: /* Installing and configuring ROOT */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) ====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install [http://root.cern.ch/drupal/ ROOT] can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====(version 0.7.0 and later)=====
If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for [http://root.cern.ch/drupal/ ROOT], however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link [http://root.cern.ch/drupal/ ROOT] files for FusionSeq.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

=====(version up to 0.6.1)=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-17T18:44:35Z

Asboner: /* (versions 0.7.0 and later) */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) ====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install ROOT can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====version up to 0.6.1=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-17T18:43:58Z

Asboner: /* Installing and configuring ROOT */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) ====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

'''NOTE''': If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for ROOT, however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link root files.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. '''NOTE''': for Ubuntu users, the detailed instructions to install ROOT can be found [http://cometpeak.com/2011/05/building-and-installing-root-on-ubuntu-11-04-x86_64/ here].

=====version up to 0.6.1=====
Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-17T14:09:46Z

Asboner: /* (versions 0.7.0 and later) */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) ====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (libmrf, libbios, GD, GSL, etc.) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
If one doesn't want to list all relevant directories, a convenient approach is the creation of local ''include'' and ''lib'' directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/libbios/include/* .
$ ln -s /path/to/libmrf/include/* .
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/libbios/lib/* .
$ ln -s /path/to/libmrf/lib/* .
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

'''NOTE''': If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of a non-standard location for ROOT, however, this doesn't occur. Hence, a similar approach as above can be adopted to properly link root files.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-12T11:57:20Z

Asboner: /* (versions 0.7.0 and later) -- coming soon */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) ====
Starting from version 0.7.0 (alpha), libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (GD, GSL, etc) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
One convenient approach could be the creation of local include and lib directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>
Hence, one could simply define:
<pre>
$ export CPPFLAGS="-I/home/user/fusionseq/include"
$ export LDFLAGS="-L/home/user/fusionseq/lib"
</pre>

'''NOTE''': If ROOT is installed in the default folder, it will generate a subfolder 'root' both for the include and lib files. In the case of non-standard location for ROOT, however, this doesn't occur. Hence, the same approach as above can be adopted to properly link root files.
<pre>
$ mkdir ~/fusionseq/include/root
$ cd ~/fusionseq/include/root
$ ln -s /path/to/root/include/* .
$ mkdir ~/fusionseq/lib/root
$ cd ~/fusionseq/lib/root
$ ln -s /path/to/root/lib/* .
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-12T11:37:20Z

Asboner: /* (versions 0.7.0 and later) -- coming soon */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Starting from version 0.7.0, libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

'''Note''': if headers and libraries of required packages (GD, GSL, etc) are not installed in a standard location, one would need to set the paths using CPPFLAGS and LDFLAGS. For example:
<pre>
$ export CPPFLAGS="-I/path/to/header/files -I/path/2/header/files ..."
$ export LDFLAGS="-L/path/to/lib/files -L/path/to/lib/files ..."
</pre>
One convenient approach could be the creation of local include and lib directories and use symbolic links to the relevant files. For example:
<pre>
$ mkdir ~/fusionseq/include
$ mkdir ~/fusionseq/lib
$ cd ~/fusionseq/include
$ ln -s /path/to/gsl/include/* .
$ ln -s /path/to/gd/include/* .
$ cd ~/fusionseq/lib
$ ln -s /path/to/gsl/lib/* .
$ ln -s /path/to/gd/lib/* .
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-11T18:08:21Z

Asboner: /* Auxiliary modules */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Starting from version 0.7.0, libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = /path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

FusionSeq Download

2011-05-11T17:49:35Z

Asboner: /* Early acces version */

{{FusionSeqHeader}}

Before downloading FusionSeq, please read the [[FusionSeq_Requirements|requirements]] first.
===Download ===
The latest stable version (2010.10.30) is: [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_0.6.1.tar.gz FusionSeq ver. 0.6.1].

Previous versions can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/oldversions/ here]. See the [[FusionSeq_ChangeLog|changelog]] for more details.

===Early access version===
Here you can download the early access version of FusionSeq. This is considered a pre-release version, thus we would much appreciate any feedback. Please also note that starting from version 0.7.0, we adopted a different approach to install and run FusionSeq. Specifically:
# Installation from source code is simplified and uses the standard autoconf/automake, etc. tools. See [[Installation and Configuration of FusionSeq]].
# The configuration file is now an external text file. This enables the use of binary files for the system supported, without the need to install it from scratch.

=====Source code=====
* [http://rnaseq.gersteinlab.org/fusionseq/tarballs/fusionseq-0.7.0.tar.gz version 0.7.0 (alpha)]

=====Binaries=====
* coming soon

===Important note for both the stable release and the early access one===
<pre>
THIS PACKAGE (FusionSeq) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>
Also, consult the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage. We have chosen a [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommercial)].

FusionSeq Requirements

2011-05-07T11:53:39Z

Asboner: /* Human genome GRCh37/hg19 */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the [[FusionSeq_Download#Binaries|binaries]] (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. As a reference, we tested FusionSeq with gsl-1.14.
=====(versions 0.7.0 and later)=====
* Starting with version 0.7.0, two new libraries are required:
** [http://rnaseq.gersteinlab.org/doc/bios/ libbios], which replaces the old BIOS, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libbios-1.1.0.tar.gz here].
** [http://rnaseq.gersteinlab.org/doc/mrf/ libmrf] can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libmrf-1.0.0.tar.gz here]
Please note that this libraries are for the "early access" version of FusionSeq.

=====(versions up to 0.6.1)=====
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] (or libbios and libmrf -- depending on the version) can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq. As a reference, we tested FusionSeq with gd-2.0.35.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path/to/bowtie/Index/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path/to/bowtie/Index/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)].

<center>[[#top|Top]]</center>

FusionSeq Requirements

2011-05-07T11:53:23Z

Asboner: /* Provided */

FusionSeq Requirements

2011-05-07T11:53:00Z

Asboner: /* External */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the [[FusionSeq_Download#Binaries|binaries]] (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. As a reference, we tested FusionSeq with gsl-1.14.
=====(versions 0.7.0 and later)=====
* Starting with version 0.7.0, two new libraries are required:
** [http://rnaseq.gersteinlab.org/doc/bios/ libbios], which replaces the old BIOS, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libbios-1.1.0.tar.gz here].
** [http://rnaseq.gersteinlab.org/doc/mrf/ libmrf] can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libmrf-1.0.0.tar.gz here]
Please note that this libraries are for the "early access" version of FusionSeq.

=====(versions up to 0.6.1)=====
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] (or libbios and libmrf -- depending on the version) can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq. As a reference, we tested FusionSeq with gd-2.0.35.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path/to/bowtie/Index/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)]. Please note that this data set has not been thoroughly tested yet. This means that it should work anyway.

<center>[[#top|Top]]</center>

FusionSeq Requirements

2011-05-07T11:52:09Z

Asboner: /* Drawing tools */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the [[FusionSeq_Download#Binaries|binaries]] (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. As a reference, we tested FusionSeq with gsl-1.14.
=====(versions 0.7.0 and later)=====
* Starting with version 0.7.0, two new libraries are required:
** [http://rnaseq.gersteinlab.org/doc/bios/ libbios], which replaces the old BIOS, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libbios-1.1.0.tar.gz here].
** [http://rnaseq.gersteinlab.org/doc/mrf/ libmrf] can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libmrf-1.0.0.tar.gz here]
Please note that this libraries are for the "early access" version of FusionSeq.

=====(versions up to 0.6.1)=====
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] (or libbios and libmrf -- depending on the version) can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq. As a reference, we tested FusionSeq with gd-2.0.35.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)]. Please note that this data set has not been thoroughly tested yet. This means that it should work anyway.

<center>[[#top|Top]]</center>

FusionSeq Requirements

2011-05-07T11:51:38Z

Asboner: /* Scientific and bioinformatics libraries */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the [[FusionSeq_Download#Binaries|binaries]] (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. As a reference, we tested FusionSeq with gsl-1.14.
=====(versions 0.7.0 and later)=====
* Starting with version 0.7.0, two new libraries are required:
** [http://rnaseq.gersteinlab.org/doc/bios/ libbios], which replaces the old BIOS, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libbios-1.1.0.tar.gz here].
** [http://rnaseq.gersteinlab.org/doc/mrf/ libmrf] can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libmrf-1.0.0.tar.gz here]
Please note that this libraries are for the "early access" version of FusionSeq.

=====(versions up to 0.6.1)=====
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] (or libbios and libmrf -- depending on the version) can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)]. Please note that this data set has not been thoroughly tested yet. This means that it should work anyway.

<center>[[#top|Top]]</center>

FusionSeq Requirements

2011-05-07T11:50:15Z

Asboner: /* Software Requirements */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the [[FusionSeq_Download#Binaries|binaries]] (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS].
=====(versions 0.7.0 and later)=====
* Starting with version 0.7.0, two new libraries are required:
** [http://rnaseq.gersteinlab.org/doc/bios/ libbios], which replaces the old BIOS, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libbios-1.1.0.tar.gz here].
** [http://rnaseq.gersteinlab.org/doc/mrf/ libmrf] can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libmrf-1.0.0.tar.gz here]
Please note that this libraries are for the "early access" version of FusionSeq.

=====(versions up to 0.6.1)=====
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] (or libbios and libmrf -- depending on the version) can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)]. Please note that this data set has not been thoroughly tested yet. This means that it should work anyway.

<center>[[#top|Top]]</center>

FusionSeq Requirements

2011-05-07T11:46:52Z

Asboner: /* (versions 0.7.0 and later -- coming soon) */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the binaries (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS].
=====(versions 0.7.0 and later)=====
* Starting with version 0.7.0, two new libraries are required:
** [http://rnaseq.gersteinlab.org/doc/bios/ libbios], which replaces the old BIOS, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libbios-1.1.0.tar.gz here].
** [http://rnaseq.gersteinlab.org/doc/mrf/ libmrf] can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libmrf-1.0.0.tar.gz here]
Please note that this libraries are for the "early access" version of FusionSeq.

=====(versions up to 0.6.1)=====
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] (or libbios and libmrf -- depending on the version) can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)]. Please note that this data set has not been thoroughly tested yet. This means that it should work anyway.

<center>[[#top|Top]]</center>

FusionSeq Requirements

2011-05-07T11:45:21Z

Asboner: /* (versions up to 0.6.1) */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the binaries (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS].
=====(versions 0.7.0 and later -- coming soon)=====
* Starting with version 0.7.0, two new libraries are required:
** [http://rnaseq.gersteinlab.org/doc/bios/ libbios], which replaces the old BIOS, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libbios-1.1.0.tar.gz here].
** [http://rnaseq.gersteinlab.org/doc/mrf/ libmrf] can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libmrf-1.0.0.tar.gz here]
=====(versions up to 0.6.1)=====
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] (or libbios and libmrf -- depending on the version) can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)]. Please note that this data set has not been thoroughly tested yet. This means that it should work anyway.

<center>[[#top|Top]]</center>

FusionSeq

2011-05-07T11:43:57Z

Asboner:

<center>[http://rnaseq.gersteinlab.org/fusionseq/ FusionSeq main web page]</center>
This document provides the information for downloading, installing, compile and run FusionSeq. Please note that these tools were tested on a multi-node cluster of computing nodes with Linux Red Hat as operating system and PBS as scheduler system. FusionSeq programs are written in C and should likely compile to most Unix/Linux platforms. We used the gcc complier (version 3.4.6 20060404) to compile the source code.
However, this is not a plug-and-play program, but it requires the user to compile, install and run a set of programs. Please read the [[Requirements|requirements]] '''before''' [[#Download|downloading]] FusionSeq.

'''IMPORTANT''': Starting from version 0.7.0, we adopted a different approach to include the configuration file that allows the users to simply download the binary files for their platform. However, we still provide the source code, which now uses the more standard autoconf/automake tools, thus simplifying the installation of FusionSeq. Please note that at the time of writing (May 7th, 2011), version 0.7.0 is still in alpha. Any feedback will be very much appreciated.

If you have any questions, please check the [[FusionSeq_FAQ|FAQ]] or send an email to [mailto:fusionseq-faq@gersteinlab.org fusionseq-faq@gersteinlab.org]

==[[FusionSeq Requirements]]==
List of required programs and data. Please read this section '''before''' downloading FusionSeq.

==[[FusionSeq Download|Download]]==
Links to FusionSeq source code and data sets.

==[[Installation and Configuration of FusionSeq]]==
Instructions to install and configure FusionSeq.

==[[How to execute FusionSeq]]==
An example workflow of FusionSeq.

==[[FusionSeq_List of programs|List of programs]]==
Description of all the FusionSeq programs.

==[[FusionSeq_Test Datasets|Test Datasets]]==
A few datasets to test FusionSeq installation.

==[http://dynamic.gersteinlab.org/people/asboner/FusionSeq/geneFusions_cgi Demo]==
You can see some of the results of FusionSeq as described in the paper. Use the sample IDs reported in Table 1, e.g. 106_T, 1700_D.

==[[FusionSeq_FAQ|Frequently Asked Questions (FAQ)]]==
Solutions to common problems. If your issue is not described in the [[FusionSeq_FAQ|FAQ]], please send an email to [mailto:fusionseq-faq@gersteinlab.org fusionseq-faq@gersteinlab.org]

==[[FusionSeq Gallery|Gallery]]==
Some figures about FusionSeq.

==[[FusionSeq_Papers|In the news et al.]]==
News or scientific papers referring to FusionSeq

FusionSeq Download

2011-05-07T11:36:02Z

Asboner:

{{FusionSeqHeader}}

Before downloading FusionSeq, please read the [[FusionSeq_Requirements|requirements]] first.
===Download ===
The latest stable version (2010.10.30) is: [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_0.6.1.tar.gz FusionSeq ver. 0.6.1].

Previous versions can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/oldversions/ here]. See the [[FusionSeq_ChangeLog|changelog]] for more details.

===Early acces version===
Here you can download the early access version of FusionSeq. This is considered a pre-release version, thus we would much appreciate any feedback. Please also note that starting from version 0.7.0, we adopted a different approach to install and run FusionSeq. Specifically:
# Installation from source code is simplified and uses the standard autoconf/automake, etc. tools. See [[Installation and Configuration of FusionSeq]].
# The configuration file is now an external text file. This enables the use of binary files for the system supported, without the need to install it from scratch.

=====Source code=====
* [http://rnaseq.gersteinlab.org/fusionseq/tarballs/fusionseq-0.7.0.tar.gz version 0.7.0 (alpha)]

=====Binaries=====
* coming soon

===Important note for both the stable release and the early access one===
<pre>
THIS PACKAGE (FusionSeq) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>
Also, consult the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage. We have chosen a [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommercial)].

FusionSeq Download

2011-05-07T00:55:24Z

Asboner:

{{FusionSeqHeader}}

The latest stable version (2010.10.30) is: [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_0.6.1.tar.gz FusionSeq ver. 0.6.1].

Previous versions can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/oldversions/ here]. See the [[FusionSeq_ChangeLog|changelog]] for more details.

'''Important Note'''
<pre>
THIS PACKAGE (FusionSeq) IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
</pre>
Also, consult the [http://www.gersteinlab.org/misc/permissions.html Permissions Page] on the Gerstein Lab webpage. We have chosen a [http://creativecommons.org/licenses/by-nc/2.5/legalcode Creative Commons license (Attribution-NonCommercial)].

Installation and Configuration of FusionSeq

2011-05-07T00:53:54Z

Asboner: /* Installing and configuring libbios, libmrf, or BIOS */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
Please refer to [[#versions 0.6.1|version 0.6.1]] for instructions of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Starting from version 0.7.0, libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = -L/path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-07T00:52:27Z

Asboner: /* Installing and configuring FusionSeq */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
====(versions 0.7.0 and later) -- coming soon====
Starting from version 0.7.0, libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==
Please refer to [[#(versions up to 0.6.1)|version 0.6.1]] for instruction of the stable version.

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = -L/path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-07T00:49:05Z

Asboner: /* Installing and configuring FusionSeq */

{{FusionSeqHeader}}
==Installing GSL and GD libraries==
In order to install FusionSeq these external packages need to be installed first (see [[Requirements]]). Please, follow the instruction provided by the single packages. After they are installed, the first step for FusionSeq is the installation and configuration of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]. BIOS is a C library of useful general definitions for manipulating strings, arrays, and parser and more related to bioinformatic analysis. It requires the GSL library, which, in most systems, can be installed with the following commands (for details, please refer to the specific instructions at the [http://www.gnu.org/software/gsl/ GNU Scientific Library] website):
<pre>
$ cd /path/to/gslSource/
$ ./configure --prefix=/path/to/installation/
$ make
$ make install
</pre>

If a 64bit system is used, add CFLAGS=-m64 in the ./configure command. Similarly, the [http://www.boutell.com/gd/ GD library] can be installed in most systems with:
<pre>
$ cd /path/to/gdSource/
$ ./configure --prefix=/path/to/installation/ --with-jpeg=/path/to/jpegLib/
$ make
$ make install
</pre>

Although, the GD library is NOT required for the core analysis, if you want to use it, please make sure that png, jpeg, zlib, freetype 2.x, and xpm are properly installed and linked. These are required by [[FusionSeq_List of programs#gfr2images|gfr2images]] in order to create a schematic illustration depicting the connected regions of the two genes. See [[#Installing_and_configuring_FusionSeq|Installing and configuring FusionSeq]] for setting the appropriate environmental variables.

'''Note''': we used gsl-1.14 and gd-2.0.35.
<center>[[#top|Top]]</center>

==Installing and configuring libbios, libmrf, or BIOS==
====(versions 0.7.0 and later) -- coming soon====
Starting from version 0.7.0, libbios and libmrf need to be installed, in this order. Typically, one would run:
<pre>
$ cd /path/to/libbios/
$ ./configure --prefix=/path/to/libbios
$ make
$ make install
</pre>

Similarly, for libmrf, one would run:
<pre>
$ cd /path/to/libmrf/
$ ./configure --prefix=/path/to/libmrf
$ make
$ make install
</pre>

====(versions up to 0.6.1)====
To install [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] a few variables need to be set before compiling the library. Here is an example of the procedure on a bash shell:
<pre>
$ export BIOINFOCONFDIR=/pathToBios/conf/
$ export BIOINFOGSLDIR=/pathToGsl/
$ cd /pathToBios/
$ make
$ make prod
</pre>

Please refer to [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] documentation for additional information.

<center>[[#top|Top]]</center>

==Installing and configuring ROOT==
To install [http://root.cern.ch/drupal/ ROOT], please follow the instructions on the website. You may also include GSL library when compiling, but it is not a requirement. Once ROOT is installed, a few variables need to be defined in order to properly use this library with FusionSeq.
<pre>
$ export ROOTSYS=/path/to/ROOT/
$ export PATH=$ROOTSYS/bin:$PATH
</pre>

<center>[[#top|Top]]</center>

==Installing and configuring FusionSeq==

====(versions 0.7.0 and later) -- coming soon====
Once the required packages are properly installed, to install FusionSeq is sufficient to run:
<pre>
$ tar xzvf fusionseq-0.7.0.tar.gz
$ cd fusionseq-0.7.0/
$ ./configure --prefix=/path/to/fusionseq/
$ make
$ make optional
$ make install
</pre>

This procedure installs all core and optional programs into bin/ and the libraries in lib/. Moreover, a configuration file .fusionseqrc is copied in your home directory. This is a text file including the configuration information as KEY=VALUE. The user needs to edit this file in a similar way as the previous versions. '''IMPORTANT''': the location of this file must be assigned to FUSIONSEQ_CONFPATH, e.g.:
$ export FUSIONSEQ_PATH=/path/2/home/.fusionseqrc

Here after an example of the configuration file.
<pre>
.fusionseqrc:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/path/to/bowtie/Indexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/path/to/blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/path/to/blat/Data/Dir"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/path/to/transcript/Composite/Model"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
ANNOTATION_DIR="/path/to/annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
KNOWN_GENE_XREF_FILENAME="kgXref.txt"
// conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt"

// Location and filename of the ribosomal library
RIBOSOMAL_DIR="/path/to/ribosomal/Dir"
RIBOSOMAL_FILENAME="ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
WEB_URL_CGI="http://cgiURL"
// location of the data directory on the web server, as seen from the web server
WEB_DATA_DIR="/path/to/data"
// URL of the data directory on the web server
WEB_DATA_LINK="http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
UCSC_GENOME_BROWSER_FLANKING_REGION=500
// URL of the public website (non cgi)
WEB_PUB_DIR="http://publicURL"
// Location of the structural data for Circos
WEB_SDATA_DIR="/path/to/structural/Data/Circos"
// Location of Circos installation
WEB_CIRCOS_DIR="/path/to/circos"
</pre>

'''NB:''' use '''absolute''' paths in .fusionseqrc, i.e. avoid using environmental variables such as $HOME, etc.

====(versions up to 0.6.1) ====
FusionSeq is composed of several programs divided into a set of core modules (to identify the candidate fusion transcripts), and a set of additional modules (to create images, BED, GFF and other auxiliary files for visualization and analysis) and CGIs for visualization of the results. To run the analysis, only the core modules are required. For the auxiliary modules, one needs to specify the location of the Drawing tool libraries by editing the specific section in the Makefile. For installing the visualization tools, please read [[#Installing CGIs|Installing CGIs]].

Before starting with the installation of FusionSeq, '''please read''' the [[#FusionSeq_Requirements|Requirements]] section to make sure all data sets and external tools are available.

To install FusionSeq, the file geneFusionsConfig.h needs to be edited to specify to locations of the annotation files and other required tools:
<pre>
geneFusionConfig.h:

// --------------------------------- This section is required ---------------------------------
// Location of the bowtie indexes of the human genome and the composite model
#define BOWTIE_INDEXES "/path2bowtieIndexes"
// the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
#define BOWTIE_GENOME "hg18_nh"
// the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
#define BOWTIE_COMPOSITE "hg18_knownGeneAnnotationTranscriptCompositeModel"

// Pointer to the program twoBitToFa part of the blat suite
#define BLAT_TWO_BIT_TO_FA "/path2blat/twoBitToFa"
// Location and filename of the reference genome in 2bit format (to be used by blat)
#define BLAT_DATA_DIR "/path2blatDataDir"
#define BLAT_TWO_BIT_DATA_FILENAME "hg18.2bit"

// Location and name of the transcript composite model sequence and interval files
#define TRANSCRIPT_COMPOSITE_MODEL_DIR "/path2transcriptCompositeModel"
#define TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME "knownGeneAnnotationTranscriptCompositeModel.fa"
#define TRANSCRIPT_COMPOSITE_MODEL_FILENAME "knownGeneAnnotationTranscriptCompositeModel.txt"

// location of the annotation files
#define ANNOTATION_DIR "/path2annotationFiles"
// conversion of knownGenes to gene symbols, description, etc.
#define KNOWN_GENE_XREF_FILENAME "kgXref.txt"
// conversion of knownGenes to TreeFam
#define KNOWN_GENE_TREE_FAM_FILENAME "knownToTreefam.txt"

// Location and filename of the ribosomal library
#define RIBOSOMAL_DIR "/path2ribosomalDir"
#define RIBOSOMAL_FILENAME "ribosomal.2bit"

// ----------------------- This section is optional: visualization tools -------------------------
// URL of the cgi directory on the web server
#define WEB_URL_CGI "http://cgiURL"
// location of the data directory on the web server, as seen from the web server
#define WEB_DATA_DIR "/path2data"
// URL of the data directory on the web server
#define WEB_DATA_LINK "http://dataURL"
// Number of nucleotides flanking the region (for UCSC Genome Browser)
#define UCSC_GENOME_BROWSER_FLANKING_REGION 500
// URL of the public website (non cgi)
#define WEB_PUB_DIR "http://publicURL"
// Location of the structural data for Circos
#define WEB_SDATA_DIR "/path2structuralDataCircos"
// Location of Circos installation
#define WEB_CIRCOS_DIR "/path2circos"
</pre>

'''NB:''' use '''absolute''' paths in geneFusionsConfig.h, i.e. avoid using environmental variables such as $HOME, $FUSIONSEQWEBDIR, etc.

Once the configuration file is ready, the core modules can be compiled and installed. However, for compiling the auxiliary modules, the Makefile needs to be properly edited (see [[#Auxiliary modules|Auxiliary modules]]). Moreover, for the visualization tools, i.e. the CGIs, some additional variables need to be defined (see [[#Installing CGIs|Installing CGIs]]). Once the configuration file is set up, the compilation just requires:
<pre>
$ make // for the core analysis elements
$ make all // for the core analysis elements as well as the auxiliary programs
$ make cgi // for compiling the visualization/summary tools (see Installing CGIs)
$ make deploy// for installing the visualization/summary tools to the web server
</pre>

<center>[[#top|Top]]</center>

==Auxiliary modules==
These modules generate a set of useful data files for interpreting and visualizing the results. For example, [[FusionSeq_List of programs#gfr2gff|gfr2gff]] generates the GFF files that can be displayed with the UCSC Genome Browser to show the location and the connection between the paired reads; [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]] generates two fasta files containing the sequences of the reads, one for each end. Most of these modules do not require additional configuration, with the exception of [[FusionSeq_List of programs#gfr2images|gfr2images]]. This tool creates a schematic for each candidate showing what are the exons connected by paired-end reads. It uses graphic libraries whose locations need to be specified in the Makefile (section "optional parameters"). Here is an example on how to edit the Makefile.
<pre>
GDDIR = -L/path/to/gd/gd-2.0.35/
GDINC = -I$(GDDIR)/include
GDLIB = -L$(GDDIR)/lib
PNGLIB = -L/usr/lib64
JPEGLIB = -L/usr/X11/lib
ZLIB = -L/usr/lib
FREETYPELIB = -L/usr/lib64
</pre>
Note that GDINC and GDLIB are automatically defined once GDDIR is set. Once the Makefile is properly defined, the installation goes on as usual. However, to fully exploit the auxiliary modules, the CGIs should also be installed.

<center>[[#top|Top]]</center>

==Installing CGIs==
The visualization tools are rather useful to display the results of the analysis, although they are completely independent from the analysis itself. These tools require a web server able to interpret CGI programs. We tested our tools on an Apache Web server. First, a set of variables need to be specfied, describing the locations of the different tools:
<pre>
$ export FUSIONSEQWEBSERVER=web_server_name
$ export FUSIONSEQWEBUSER=webuserID
$ export FUSIONSEQCGIDIR=/path/to/cgiDir
$ export FUSIONSEQCGITARGET="-b target_architecture" // optional: to be specified only if using the CGIs on a machine with
a different architecture than the core programs. On the webserver with the CGIs, execute gcc -dumpmachine to get the target_architecture.
</pre>
The CGI programs assumes a certain directory structure in your data directory (WEB_DATA_DIR):
<pre>
./ALIGNMENTS
./BED
./FASTA
./GFF
./IMAGES
./WIGS
</pre>
BED, FASTA, GFF, and IMAGES contain the data generated by [[FusionSeq_List of programs#gfr2bed|gfr2bed]], [[FusionSeq_List of programs#gfr2fasta|gfr2fasta]], [[FusionSeq_List of programs#gfr2gff|gfr2gff]] and [[FusionSeq_List of programs#gfr2images|gfr2images]], respectively. ALIGNMENTS and WIGS contain the results of the junction-sequence identification analysis, namely [[FusionSeq_List of programs#bp2alignment|bp2alignment]] and [[FusionSeq_List of programs#bp2wig|bp2wig]]. The user is required to ensure that these directories contain the expected files.

One of the CGI applications is SeqViz, which is used to visualize Paired-End RNA-Seq reads. SeqViz requires the software package Circos in order to perform visualization. Please visit the [http://mkweb.bcgsc.ca/circos/ Circos] website to download the latest version of Circos and for detailed information on installing Circos and the required Perl modules.

There is a set of CSS style sheets, JavaScript scripts, and annotation files required for the CGIs. They may be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_WebFiles_1.0.tar.gz here]. In the Web Files tarball, the following files are included:
* The /web folder contains the CSS style sheets, images, and a distribution of the JQuery and JQuery UI Javascript libraries that are used by the CGI applications. Copy the contents of this folder to a non-CGI directory such as public_html. Set WEB_PUB_DIR to this directory.
* The /IMAGES folder contains images required by showDetails_cgi. Copy this folder to the directory specified by WEB_DATA_LINK.
* The /structdata folder contains genomic structure and annotation data files for Circos. Copy the contents of this folder to a directory for which Circos has sufficient permissions. Set WEB_SDATA_DIR to this directory.

<center>[[#top|Top]]</center>

==Troubleshooting==
Here are some common issues when installing FusionSeq and the associated libraries:

* libraries compiled for different architectures:
:: Make sure you installed and configured all libraries for the same architecture. For example, if you have a 64bit machine, use the flag CFLAGS=-m64 in the configure command.
* /usr/bin/ld: cannot find -lpng (or -ljpeg)
:: This usually occurs when compiling the optional program gfr2images which creates the schematic images of the connected exons between the two genes. You need to define the location of the libraries in the Makefile (see [[#Auxilliary modules|Auxilliary modules]]).

<center>[[#top|Top]]</center>

Installation and Configuration of FusionSeq

2011-05-06T23:55:49Z

Asboner: /* Installing GSL and GD libraries */

Installation and Configuration of FusionSeq

2011-05-06T23:51:22Z

Asboner: /* Installing and configuring BIOS */

FusionSeq Requirements

2011-05-06T22:14:50Z

Asboner: /* Scientific and bioinformatics libraries */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the binaries (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of [http://rnaseq.gersteinlab.org/doc/bios/ BIOS].
=====(versions 0.7.0 and later -- coming soon)=====
* Starting with version 0.7.0, two new libraries are required:
** [http://rnaseq.gersteinlab.org/doc/bios/ libbios], which replaces the old BIOS, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libbios-1.1.0.tar.gz here].
** [http://rnaseq.gersteinlab.org/doc/mrf/ libmrf] can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/libmrf-1.0.0.tar.gz here]
=====(versions up to 0.6.1)=====
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] library: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)]. Please note that this data set has not been thoroughly tested yet. This means that it should work anyway.

<center>[[#top|Top]]</center>

FusionSeq Requirements

2011-05-06T22:01:36Z

Asboner: /* Software Requirements */

{{FusionSeqHeader}}
==Software Requirements==
FusionSeq requires several additional packages to be installed in order to carry out the analysis and visualize the results. Moreover, since its modularity, different programs would need specific libraries. Moreover, some data sets are also required for the analysis (see [[#Data Requirements|Data Requirements]]). Here we describe the complete set of tools that one would need to run the analysis as we do in our lab. The modules should be installed in the listed order.

'''Note''': the following instructions apply if one wants to compile FusionSeq from the source code (all versions). Alternatively, one can download the binaries (version 7.0 and later).

===Alignment tools===
* [http://bowtie-bio.sourceforge.net/index.shtml bowtie] (64bit)
* [http://users.soe.ucsc.edu/~kent/src/ Blat (source)] [http://genome-test.cse.ucsc.edu/~kent/exe/ (binaries)]
Please make sure that blat and bowtie executables are part of the PATH, i.e. they can be accessed and executed from any location on your file system. Moreover, make sure that twoBitToFa is also downloaded from the blat package and part of the PATH.

===Scientific and bioinformatics libraries===
* [http://www.gnu.org/software/gsl/ GNU Scientific Library (GSL)]: this library is a required for the compilation of the [http://rnaseq.gersteinlab.org/doc/bios/ BIOS].
* [http://rnaseq.gersteinlab.org/doc/bios/ BIOS]: this library can be downloaded as part of [http://rseqtools.gersteinlab.org RSEQtools], a computational framework to analyze RNA-Seq data, or it can be downloaded as a separate component from [http://rnaseq.gersteinlab.org/fusionseq/tarballs/bios_0.9.0.tar.gz here].

Instructions to install [http://www.gnu.org/software/gsl/ GSL] and [http://rnaseq.gersteinlab.org/doc/bios/ BIOS] can be found in '''[[Installation and Configuration of FusionSeq]]'''. However, please ensure that you read all the requirements (including [[#Data_requirements|Data requirements]]) and downloaded all the libraries and packages needed.

===Drawing tools===
* [http://www.boutell.com/gd/ GD library]: The gd library is used to create schematic images of the PE reads connecting the two genes. It is required by [[FusionSeq_List_of_programs#gfr2images|gfr2images]], which is an optional component of FusionSeq.

===Data analysis===
* [http://root.cern.ch/drupal/ ROOT]: this is a very powerful mathematical and computational framework. In the context of FusionSeq, it is used to perform a Kolomogorov-Smirnov analysis for filtering the breakpoint junctions and plotting the insert-size distribution.

<center>[[#top|Top]]</center>

==Data Requirements==
Here is the list of required data for a comprehensive use of FusionSeq tools.

===External===
*[http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ Homo Sapiens Reference genome (hg18)]: the user should download both chromFa.zip and hg18.2bit.
The human genome needs to be properly indexed to be used by bowtie. Please see the instruction of bowtie for performing this operation. Indicatevely, you would need to run something like:
$ bowtie-build -f hg18_nh.fa /path2bowtieIndex/hg18_nh/hg18_nh
where '''hg18_nh.fa''' corresponds to the concatenation of all human chromosomes from chromFa.zip ''without'' the different haplotypes and the "random" sequences.

===Provided===
The following data sets (for hg18), bundled in a tarball, can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg18_1.1.tar.gz here (hg18)]. For hg19 see [[#Human genome GRCh37/hg19|below]].

* knownGeneAnnotationTranscriptCompositeModel.txt - the interval file with the coordinates of the composite models
* knownGeneAnnotationTranscriptCompositeModel.fa - the sequences of all the composite transcripts
* kgXref.txt - the mapping between the UCSC knownGene annotation set and other information (RefSeq, gene symbols and description etc.)
* knownToTreefam.txt - the mapping between UCSC knownGene annotation and TreeFam
* hg18_repeatMasker.interval - the interval file, i.e. the file with the coordinates, of the repetitive regions
* ribosomal.2bit - the ribosomal sequences in 2bit format

The composite model needs to be indexed by bowtie:
<pre>
$ bowtie-build -f knownGeneAnnotationTranscriptCompositeModel.fa
/path2bowtieIndex/hg18_knownGeneAnnotationTranscriptCompositeModel/hg18_knownGeneAnnotationTranscriptCompositeModel
</pre>
knownGeneAnnotationTranscriptCompositeModel.txt (the interval file) and knownGeneAnnotationTranscriptCompositeModel.fa (the sequences) should be located in the same directory.

Although we extensively used the UCSC knownGene annotation set, it is worth mentioning that it is possible to use other gene annotation sets. However, in this case, the same information, and in the same format, should be provided to the corresponding programs.

=====Human genome GRCh37/hg19=====
The corresponding version of these files for hg19 can be found [http://rnaseq.gersteinlab.org/fusionseq/tarballs/FusionSeq_Annotation_Data_hg19_1.0.tar.gz here (hg19)]. Please note that this data set has not been thoroughly tested yet. This means that it should work anyway.

<center>[[#top|Top]]</center>