FusionSeq List of programs

From GersteinInfo

Revision as of 01:55, 21 August 2010 by Asboner (Talk | contribs)
Jump to: navigation, search
FusionSeq main web page
User documentation main

Contents

Data formats

FusionSeq use a few data formats to perform its operations.

Mapped Read Format (MRF)

This format is defined in the context of RSEQtools. More details can be found here.

Gene Fusion Report (GFR)

This file format defines the relevant information for each fusion transcript candidate. The rationale is that different filters can be applied to exclude “false positives” artificial fusions starting from an initial set. We also provide a parser that interprets this format allowing the user to propagate easily any changes to this format. For a given fusion candidate, involving gene A and gene B, the basic GFR format requires the following fields:

  1. ID: the ID of the fusion candidate: typically it contains the sample name and a unique number separated by an underscore. The number is padded with zeros for consistency;
  2. SPER, DASPER and RESPER: scoring of the fusion candidate;
  3. Number of inter-transcript reads, i.e. the number of pairs having the ends mapped to the two genes;
  4. P-value of the insert size distribution analysis for the fusion transcript. Since we do not know the actual composition of the fusion transcript, we computed the p-value for both directions: AB (where gene A is upstream of gene B) and BA (where gene B is upstream of gene A);
  5. Number of intra-transcript reads for gene A and gene B, respectively, i.e the number of pairs where both ends map to the same gene;
  6. The type of the fusion: cis, when both genes are on the same chromosome, or trans, otherwise;
  7. Name(s) of the transcripts: all the UCSC gene IDs of the isoforms of each gene in the annotation separated by the pipe symbol '|';
  8. Chromosome of the genes;
  9. Strand information;
  10. Start and end coordinates of the longest transcript for both genes;
  11. Number of exons in the composite model for both genes;
  12. Coordinates of the exons in the composite model: each exon is separated by the pipe symbol '|' and start and end coordinates are comma-separated;
  13. Exon-pair count: it describes which exons are connected and the number of inter-reads;
  14. Inter-reads: the exon and the coordinates of the reads that join the two genes. Exon number, start and end coordinates are reported as comma-separated, with the pipe symbol '|' separating the different pairs;
  15. Reads of the transcripts: the actual sequence of all the inter-reads.

The GFR format can include additional optional information computed in the subsequent processing. For example, it is possible to add gene symbols and descriptions from the UCSC knownGene annotation set.

Top

Breakpoint data format (BP)

Similarly to GFR, the junction-sequence identifier uses a standard format to capture the results of this analysis. For each tile that has at least 1 read aligned to, it reports, comma-separated:

  1. chromosome, start and end coordinates of the first tile, using UCSC notation: “chr:start-end”, although the intervals are 1-based and closed;
  2. chromosome, start and end coordinates of the second tile
  3. All the sequences of the reads mapped to that tile with the offset information, separated by the pipe symbol.

For example, one line may read as:

chr21:38764851-38764892,chr21:41758661-41758702,31:GTAGAATCATTCATTTCATTCTTGCAAACCAGCCTGCTTGGCCAGGAGGCA|30:TGTAGAATCATTCATTTCATTCTTGCAAACCAGCCTGCTTGGCCAGGAGGC

where two reads support this specific junction.

Top

Core programs

Fusion detection module

geneFusions

geneFusions identifies potential fusion transcript candidates from an alignment file.

Usage:

geneFusions prefix minNumberOfReads < sample.mrf > fusions.gfr
  • Inputs: Takes an MRF file from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • prefix - the main ID of each candidate, i.e. prefix_0001, prefix_0002, etc.
    • minNumberOfReads - the minimum number of reads required to include a candiate
  • Optional arguments
    • none
Top

Filtration cascade

Mis-alignment filters

gfrLargeScaleHomologyFilter

It removes potential fusion transcript candidates if the two genes are paralogs. It uses TreeFam to establish is two genes have similar sequences.

Usage:

gfrLargeScaleHomologyFilter < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
Top

gfrSmallScaleHomologyFilter

It removes candidates that have high-similarity between small regions within the two genes, where the reads actually map.

Usage:

gfrSmallScaleHomologyFilter < fileIN.gfr > fileOUT.gfr 
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
Top

gfrRepeatMaskerFilter

Some reads may be aligned to repetitive regions in the genome, due to the low sequence complexity of those regions and may result in artificial fusion candidates. This filter removes reads mapped to those regions. If the number of reads left if less than a threshold, the candidate is removed.

Usage:

gfrRepeatMaskerFilter repeatMasker.interval minNumberOfReads < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • repeatMasker.interval - the interval file with the coordinates of the repetitive regions
    • minNumberOfRead - minimum number of reads overlapping the repetitive regions in order to remove the candidate
Top

Random pairing of transcript fragments

gfrAbnormalInsertSizeFilter

gfrAbnormalInsertSizeFilter removes candidates with an insert-size bigger than the normal insert-size. The fusion candidate insert-size is computed on the minimal fusion transcript fragment.

Usage:

gfrAbnormalInsertSizeFilter pvalueCutOff < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • pvalueCutOff - the p-value threshold above which we keep the fusion transcript candidates
Top

Combination of mis-alignment and random pairing

gfrRibosomalFilter

gfrRibosomalFilter removes candidates that have similarity with ribosomal genes. The rationale is that reads coming from highly expressed genes, such as ribosomal genes, are more likely to be mis-aligned and assigned to a different genes.

Usage:

gfrRibosomalFilter < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
Top

gfrExpressionConsistency

It removes candidates that do not have reads aligned to the corresponding genes.

Usage:

gfrExpressionConsistency < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
Top

Other filters

gfrPCRFilter

gfrPCRFilter removes candidates with the same read over-represented, yielding to a “spike-in-like” signal, i.e. a narrow signal with a high peak. T

Usage:

gfrPCRFilter offsetCutoff minNumUniqueRead
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • offsetCutoff - the minimum number of different starting positions
    • minNumUniqueRead - the minimum number of unique reads required to include a candidate
Top

gfrAnnotationConsistencyFilter

It removes candidates involving genes with specific description, such as ribosomal, pseudogenes, etc.

Usage:

gfrAnnotationConsistencyFilter string < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • string - a string identifying the element to remove, ex. pseudogene
Top

gfrProximityFilter

It removes candidates that are likely due to mis-annotation of the 5' or 3' ends of the genes.

Usage:

gfrProximityFilter offset < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • offset - the minimum distance (in nucleotides) between the two genes to keep the candidate
Top

gfrBlackListFilter

It removes candidates specified by the user in a file

Usage:

gfrBlackListFilter blackList.txt < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • blackList.txt - the file with the candidates to remove. The format of this files is a simple two-column tab-delimited file with describing the two gene symbols. For example:
LOC388160	LOC388161
LOC388161	LOC388161
LOC440498	LOC440498
Top

gfrSpliceJunctionFilter

It removes candidates if the reads can be aligned to a splice junction library. NB: this filter should be used only if the alignment was not performed against a splice junction library.

Usage:

gfrSpliceJunctionFilter splice_junction_library < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • splice_junction_library - the splice_junction_library in 2bit format
Top

Junction-sequence identification

Auxiliary modules

gfr2images

It generates a schematic illustration depicting which regions of the two genes are connected by PE reads.

Usage:

gfr2images < fileIN.gfr > fileOUT.gfr
  • Inputs: GFR from STDIN
  • Outputs: Reports GFR to STDOUT
  • Required arguments
    • none
Top


... to be continued

Personal tools