RSEQtools/workflow

From GersteinInfo

Revision as of 11:57, 9 June 2010 by Infoadmin (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
RSEQtools Main Page

Contents


Example workflow

This section shows a simple workflow for processing a RNA-Seq data set. A sample data set that was originally published in PNAS is used for this example. Specifically, this sample data set (human embryonic stem cells) consists of 27 nucleotides single-end reads. The reads are aligned to the human reference genome (hg18) and to a splice junction library generated from the UCSC Known Genes annotation set using Bowtie.


#top

Prerequisites


  • Download the following data files:
    • chromFa.zip - Human reference genome (hg18), one file per chromosome.
    • hg18.2bit contains the complete hg18 Human Genome in the 2bit format.
    • knownGene.txt.gz - UCSC Known Genes annotation set.
    • knownIsoforms.txt.gz Links together various transcripts (UCSC Known Genes) of a gene into a cluster.
    • sample.fastq.gz - Sample data (hESC; 27nt; single-end; 5 million reads, FASTQ format).


  • Decompress the files
 gunzip knownGene.txt.gz knownIsoforms.txt.gz
 unzip chromFa.zip 


  • Convert knownGene.txt into the Interval format
 cut -f1,2,3,4,5,8,9,10 knownGene.txt > knownGene.interval


  • Generate the composite gene models
 mergeTranscripts knownIsoforms.txt knownGene.interval compositeModel > knownGene_composite.interval


  • Create a splice junction library based on the UCSC annotation set
 createSpliceJunctionLibrary hg18.2bit knownGene.interval 20 > knownGene_2x20_spliceJunctions.fa


  • Build bowtie index for the reference genome
 bowtie-build -f chr1.fa,chr10.fa,chr11.fa,chr12.fa,chr13.fa,chr14.fa,chr15.fa,chr16.fa,chr17.fa,chr18.fa,chr19.fa,
                 chr20.fa,chr21.fa,chr22.fa,chr2,chr3.fa,chr4.fa,chr5.fa,chr6.fa,chr7.fa,chr8.fa,chr9.fa,chrX.fa,chrY.fa hg18

  • Build bowtie index for the splice junction liabrary
 bowtie-build -f knownGene_2x20_spliceJunctions.fa knownGene_2x20_spliceJunctions


#top

Alignment

  • Bowtie alignment of the RNA-Seq reads to the human reference genome
 bowtie -q -p 4 hg18 sample.fastq sample_hg18.bowtie


  • Bowtie alignment of the RNA-Seq reads to the splice junction library
 bowtie -q -p 4 knownGene_2x20_spliceJunctions sample.fastq sample_splices.bowtie


#top

Conversion

  • Convert bowtie output to MRF
 bowtie2mrf genomic < sample_hg18.bowtie > sample_hg18.mrf    
 bowtie2mrf junctions < sample_junctions.bowtie > sample_junctions.mrf    


#top

Downstream analyses

  • Calculate gene expression values using the composite gene models
 cat sample_hg18.mrf sample_junctions.mrf | mrfQuantifier knownGene_composite.interval multipleOverlap > sample.geneExpression


  • Generate a signal track files in WIG format of the mapped reads
 cat sample_hg18.mrf sample_junctions.mrf | mrf2wig sample


  • Generate GFF files representing the splice junctions
 cat sample_hg18.mrf sample_junctions.mrf | mrf2gff sample


  • Determine whether there is a mapping bias
 cat sample_hg18.mrf sample_junctions.mrf | mrfMappingBias knownGene_composite.interval > sample.mappingBias 


  • Calculate the annotation coverage for the UCSC KnownGene composite gene models
 cat sample_hg18.mrf sample_junctions.mrf | wc -l
 Result: 4760475
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 500000  1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 1000000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 1500000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 2000000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 2500000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 3000000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 3500000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 4000000 1 
 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 4500000 1 


Personal tools