RSEQtools/workflow

From GersteinInfo
Revision as of 11:57, 9 June 2010 by Infoadmin (talk | contribs) (Created page with '<center>[http://archive.gersteinlab.org/proj/rnaseq/rseqtools '''RSEQtools Main Page''']</center> __TOC__ == Example workflow == This section shows a simple workflow for proce…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
RSEQtools Main Page

Example workflow

This section shows a simple workflow for processing a RNA-Seq data set. A sample data set that was originally published in PNAS is used for this example. Specifically, this sample data set (human embryonic stem cells) consists of 27 nucleotides single-end reads. The reads are aligned to the human reference genome (hg18) and to a splice junction library generated from the UCSC Known Genes annotation set using Bowtie.


#top

Prerequisites


  • Download the following data files:
    • chromFa.zip - Human reference genome (hg18), one file per chromosome.
    • hg18.2bit contains the complete hg18 Human Genome in the 2bit format.
    • knownGene.txt.gz - UCSC Known Genes annotation set.
    • knownIsoforms.txt.gz Links together various transcripts (UCSC Known Genes) of a gene into a cluster.
    • sample.fastq.gz - Sample data (hESC; 27nt; single-end; 5 million reads, FASTQ format).


  • Decompress the files
 gunzip knownGene.txt.gz knownIsoforms.txt.gz
 unzip chromFa.zip 


  • Convert knownGene.txt into the Interval format
 cut -f1,2,3,4,5,8,9,10 knownGene.txt > knownGene.interval


  • Generate the composite gene models
 mergeTranscripts knownIsoforms.txt knownGene.interval compositeModel > knownGene_composite.interval


  • Create a splice junction library based on the UCSC annotation set
 createSpliceJunctionLibrary hg18.2bit knownGene.interval 20 > knownGene_2x20_spliceJunctions.fa


  • Build bowtie index for the reference genome
 bowtie-build -f chr1.fa,chr10.fa,chr11.fa,chr12.fa,chr13.fa,chr14.fa,chr15.fa,chr16.fa,chr17.fa,chr18.fa,chr19.fa,
                 chr20.fa,chr21.fa,chr22.fa,chr2,chr3.fa,chr4.fa,chr5.fa,chr6.fa,chr7.fa,chr8.fa,chr9.fa,chrX.fa,chrY.fa hg18

  • Build bowtie index for the splice junction liabrary
 bowtie-build -f knownGene_2x20_spliceJunctions.fa knownGene_2x20_spliceJunctions


#top

Alignment

  • Bowtie alignment of the RNA-Seq reads to the human reference genome
 bowtie -q -p 4 hg18 sample.fastq sample_hg18.bowtie


  • Bowtie alignment of the RNA-Seq reads to the splice junction library
 bowtie -q -p 4 knownGene_2x20_spliceJunctions sample.fastq sample_splices.bowtie


#top

Conversion

  • Convert bowtie output to MRF
 bowtie2mrf genomic < sample_hg18.bowtie > sample_hg18.mrf    
 bowtie2mrf junctions < sample_junctions.bowtie > sample_junctions.mrf    


#top

Downstream analyses

  • Calculate gene expression values using the composite gene models
 cat sample_hg18.mrf sample_junctions.mrf | mrfQuantifier knownGene_composite.interval multipleOverlap > sample.geneExpression


  • Generate a signal track files in WIG format of the mapped reads
 cat sample_hg18.mrf sample_junctions.mrf | mrf2wig sample


  • Generate GFF files representing the splice junctions
 cat sample_hg18.mrf sample_junctions.mrf | mrf2gff sample


  • Determine whether there is a mapping bias
 cat sample_hg18.mrf sample_junctions.mrf | mrfMappingBias knownGene_composite.interval > sample.mappingBias 


  • Calculate the annotation coverage for the UCSC KnownGene composite gene models
 cat sample_hg18.mrf sample_junctions.mrf | wc -l
 Result: 4760475
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 500000  1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 1000000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 1500000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 2000000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 2500000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 3000000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 3500000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 4000000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 4500000 1