FusionSeq Test Datasets
From GersteinInfo
(Created page with 'Two datasets are available to test FusionSeq: NCIH660 and GM12878 cell-line data. These datasets are part of FusionSeq dataset, published in [http://genomebiology.com/2010/11/10/…') |
|||
(7 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
- | Two datasets are available to test FusionSeq: NCIH660 and GM12878 cell-line data. These datasets are part of FusionSeq dataset, published in [http://genomebiology.com/2010/11/10/R104/abstract Genome Biology, 2010;11:R104]. Please note that the full set, including cancer samples, | + | {{FusionSeqHeader}} |
+ | Two datasets are available to test FusionSeq: NCIH660 and GM12878 cell-line data. These datasets are part of FusionSeq dataset, published in [http://genomebiology.com/2010/11/10/R104/abstract Genome Biology, 2010;11:R104]. Please note that the full set, including cancer tissue samples, is available at [http://www.ncbi.nlm.nih.gov/gap?term=phs000311.v1.p1 dbGaP (accession phs000311.v1.p1)], where confidentiality issues are taken care of properly. We here provide the cell-line data in different formats: | ||
==[[RSEQtools#Mapped_Read_Format (MRF)|Mapped Read Format (MRF)]]== | ==[[RSEQtools#Mapped_Read_Format (MRF)|Mapped Read Format (MRF)]]== | ||
Line 6: | Line 7: | ||
* [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.mrf.gz NCIH660.mrf.gz] | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.mrf.gz NCIH660.mrf.gz] | ||
Please read '[[How to execute FusionSeq]]' section for more detail on how to use these files. | Please read '[[How to execute FusionSeq]]' section for more detail on how to use these files. | ||
+ | |||
+ | ==Auxillliary data== | ||
+ | In order to properly score the fusion candidate, gfrConfidenceValues requires an external '''meta''' file. | ||
+ | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.meta GM12878.meta] | ||
+ | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.meta NCIH660.meta] | ||
+ | |||
+ | '''NB:''' please make sure that the *meta files are tab delimited. | ||
+ | |||
+ | The junction sequence identifier module requires to align all reads against the junction library. All the reads, including those that did not map, should be used to find as much support for the breakpoint junction. | ||
+ | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878_allReads.txt.gz GM12878_allReads.txt.gz] | ||
+ | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660_allReads.txt.gz NCIH660_allReads.txt.gz] | ||
+ | Please read '[[How to execute FusionSeq]]' section for more detail on how to use these files. | ||
+ | |||
+ | [[FusionSeq_List of programs#gfrBlacklistFilter|gfrBlackListFilter]] allows you to specify a list of candidates to be excluded. The tab-delimited file includes only the gene symbols of the two genes, such as: | ||
+ | LOC388160 LOC388161 | ||
+ | LOC388161 LOC388161 | ||
+ | |||
+ | An example of a blacklist file can be downloaded [http://rnaseq.gersteinlab.org/fusionseq/datasets/blackList.txt here]. | ||
==FASTQ== | ==FASTQ== | ||
Line 14: | Line 33: | ||
==BAM== | ==BAM== | ||
[http://samtools.sourceforge.net/ BAM] format is the binary compressed format of [http://samtools.sourceforge.net/ SAM (Sequence Alignment/Map)]. We provide both BAM files and their corresponding index files (*.bai) so that they can be viewed with the [http://www.broadinstitute.org/igv/ Integrative Genome Viewer (IGV)] a high-performance visualization tool for interactive exploration of large, integrated datasets from the Broad Institute. | [http://samtools.sourceforge.net/ BAM] format is the binary compressed format of [http://samtools.sourceforge.net/ SAM (Sequence Alignment/Map)]. We provide both BAM files and their corresponding index files (*.bai) so that they can be viewed with the [http://www.broadinstitute.org/igv/ Integrative Genome Viewer (IGV)] a high-performance visualization tool for interactive exploration of large, integrated datasets from the Broad Institute. | ||
- | * http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam | + | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam] |
- | * http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam.bai | + | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam.bai http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam.bai] |
- | * http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam | + | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam] |
- | * http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam.bai | + | * [http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam.bai http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam.bai] |
You can download the files locally or load them into IGV directly. See instructions at http://www.broadinstitute.org/igv/. | You can download the files locally or load them into IGV directly. See instructions at http://www.broadinstitute.org/igv/. | ||
- | |||
- | |||
- |
Latest revision as of 09:38, 7 August 2011
User documentation main
Two datasets are available to test FusionSeq: NCIH660 and GM12878 cell-line data. These datasets are part of FusionSeq dataset, published in Genome Biology, 2010;11:R104. Please note that the full set, including cancer tissue samples, is available at dbGaP (accession phs000311.v1.p1), where confidentiality issues are taken care of properly. We here provide the cell-line data in different formats:
Contents |
Mapped Read Format (MRF)
This is the format required by FusionSeq. RSEQtools provide several conversion tools to generate MRF files from the most popular alignment tools.
Please read 'How to execute FusionSeq' section for more detail on how to use these files.
Auxillliary data
In order to properly score the fusion candidate, gfrConfidenceValues requires an external meta file.
NB: please make sure that the *meta files are tab delimited.
The junction sequence identifier module requires to align all reads against the junction library. All the reads, including those that did not map, should be used to find as much support for the breakpoint junction.
Please read 'How to execute FusionSeq' section for more detail on how to use these files.
gfrBlackListFilter allows you to specify a list of candidates to be excluded. The tab-delimited file includes only the gene symbols of the two genes, such as:
LOC388160 LOC388161 LOC388161 LOC388161
An example of a blacklist file can be downloaded here.
FASTQ
FASTQ is a text-based format for storing both a biological sequence and its corresponding quality scores. Each tarball includes two FASTQ files, one for each end.
BAM
BAM format is the binary compressed format of SAM (Sequence Alignment/Map). We provide both BAM files and their corresponding index files (*.bai) so that they can be viewed with the Integrative Genome Viewer (IGV) a high-performance visualization tool for interactive exploration of large, integrated datasets from the Broad Institute.
- http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam
- http://rnaseq.gersteinlab.org/fusionseq/datasets/GM12878.bam.bai
- http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam
- http://rnaseq.gersteinlab.org/fusionseq/datasets/NCIH660.bam.bai
You can download the files locally or load them into IGV directly. See instructions at http://www.broadinstitute.org/igv/.