RSEQtools/workflow
From GersteinInfo
Contents |
Example workflow
This section shows a simple workflow for processing a RNA-Seq data set. A sample data set that was originally published in PNAS is used for this example. Specifically, this sample data set (human embryonic stem cells) consists of 27 nucleotides single-end reads. The reads are aligned to the human reference genome (hg18) and to a splice junction library generated from the UCSC Known Genes annotation set using Bowtie.
Prerequisites
- Software download:
- Download the following data files:
- chromFa.zip - Human reference genome (hg18), one file per chromosome.
- hg18.2bit contains the complete hg18 Human Genome in the 2bit format.
- knownGene.txt.gz - UCSC Known Genes annotation set.
- knownIsoforms.txt.gz Links together various transcripts (UCSC Known Genes) of a gene into a cluster.
- sample.fastq.gz - Sample data (hESC; 27nt; single-end; 5 million reads, FASTQ format).
- Decompress the files
gunzip knownGene.txt.gz knownIsoforms.txt.gz unzip chromFa.zip
- Convert knownGene.txt into the Interval format
cut -f1,2,3,4,5,8,9,10 knownGene.txt > knownGene.interval
- Generate the composite gene models
mergeTranscripts knownIsoforms.txt knownGene.interval compositeModel > knownGene_composite.interval
- Create a splice junction library based on the UCSC annotation set
createSpliceJunctionLibrary hg18.2bit knownGene.interval 20 > knownGene_2x20_spliceJunctions.fa
- Build bowtie index for the reference genome
bowtie-build -f chr1.fa,chr10.fa,chr11.fa,chr12.fa,chr13.fa,chr14.fa,chr15.fa,chr16.fa,chr17.fa,chr18.fa,chr19.fa, chr20.fa,chr21.fa,chr22.fa,chr2,chr3.fa,chr4.fa,chr5.fa,chr6.fa,chr7.fa,chr8.fa,chr9.fa,chrX.fa,chrY.fa hg18
- Build bowtie index for the splice junction liabrary
bowtie-build -f knownGene_2x20_spliceJunctions.fa knownGene_2x20_spliceJunctions
Alignment
- Bowtie alignment of the RNA-Seq reads to the human reference genome
bowtie -q -p 4 hg18 sample.fastq sample_hg18.bowtie
- Bowtie alignment of the RNA-Seq reads to the splice junction library
bowtie -q -p 4 knownGene_2x20_spliceJunctions sample.fastq sample_splices.bowtie
Conversion
- Convert bowtie output to MRF
bowtie2mrf genomic < sample_hg18.bowtie > sample_hg18.mrf bowtie2mrf junctions < sample_junctions.bowtie > sample_junctions.mrf
Downstream analyses
- Calculate gene expression values using the composite gene models
cat sample_hg18.mrf sample_junctions.mrf | mrfQuantifier knownGene_composite.interval multipleOverlap > sample.geneExpression
- Generate a signal track files in WIG format of the mapped reads
cat sample_hg18.mrf sample_junctions.mrf | mrf2wig sample
- Generate GFF files representing the splice junctions
cat sample_hg18.mrf sample_junctions.mrf | mrf2gff sample
- Determine whether there is a mapping bias
cat sample_hg18.mrf sample_junctions.mrf | mrfMappingBias knownGene_composite.interval > sample.mappingBias
- Calculate the annotation coverage for the UCSC KnownGene composite gene models
cat sample_hg18.mrf sample_junctions.mrf | wc -l Result: 4760475
cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 500000 1 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 1000000 1 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 1500000 1 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 2000000 1 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 2500000 1 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 3000000 1 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 3500000 1 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 4000000 1 cat sample_hg18.mrf sample_junctions.mrf | mrfAnnotationCoverage knownGene_composite.interval 4760475 4500000 1