RSEQtools

From GersteinInfo

Jump to navigation Jump to search

RSEQtools Main Page

Introduction

The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify that person, raising significant privacy concerns. In order to address these issues we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confi-dential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools that uses this format for the analysis of RNA-Seq experiments. RSEQtools consists of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions. In addition to the anonymization afforded by this format it also facilitates the decoupling of the alignment of reads from downstream analyses.

Citation

Lukas Habegger*, Andrea Sboner*, Tara A. Gianoulis, Joel Rozowsky, Ashish Agarwal, Michael Snyder, Mark Gerstein. RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics 2010, doi: 10.1093/bioinformatics/btq643

Overview

The following sections provide documenation for the modules that are part of RSEQtools (http://rseqtools.gersteinlab.org/). This documentation is intended for the end-users and can also be found at http://info.gersteinlab.org/RSEQtools. RSEQtools is implemented in C and uses a general C library called BIOS.

The full documentation for developers can be found here:

RSEQtools: http://archive.gersteinlab.org/proj/rnaseq/doc/mrf/
BIOS: http://archive.gersteinlab.org/proj/rnaseq/doc/bios/

Data formats

Mapped Read Format (MRF)

The Mapped Read Format (MRF) flat file consists of three components and this format is closely associated with the software components of RSEQtools

1. Comment lines. Comment lines are optional and start with a '#' character.
2. Header line. The header line is required and specifies the type of each column.
3. Mapped reads. Each read (single-end or paired-end) is represented by on line.

Required column:

* AlignmentBlocks, each alignment block must contain the following attributes: TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd

Optional columns:

* Sequence
* QualityScores
* QueryId

Example file:

# Comments
# Required field: Blocks [TargetName:Strand:TargetStart:TargetEnd:QueryStart:QueryEnd]
# Optional fields: Sequence,QualityScores,QueryId
AlignmentBlocks
chr1:+:2001:2050:1:50
chr1:+:2001:2025:1:25,chr1:+:3001:3025:26:50
chr2:-:3001:3051:1:51|chr11:+:4001:4051:1:51
chr2:-:6021:6050:1:30,chr2:-:7031:7051:31:51|chr11:+:4001:4051:1:51
contigA:+:5001:5200:1:200,contigB:-:1200:1400:200:400

Notes:

* Paired-end reads are separated by ‘|’
* Alignment blocks are separated by ‘,’
* Features of a block are separated by ‘:’
* Columns are tab-delimited
* Columns can be arranged in any order
* Coordinates are one-based and closed (inclusive)

Use MRF for confidential data

It is straightforward to use MRF to separate the confidential information, i.e. the sequences, from the alignment data. The MRF file can be split in 2 files: one file can include AlignmentBlocks and QueryID, whereas a second file would can contain Sequence and QueryID. From a practical viewpoint it is also easy to create these two files.

Assuming we have the columns AlignmentBlocks, Sequence, and QueryID as column 1, 2, and 3, respectively:

$ cut -f1,3 file.mrf > alignments.mrf
$ cut -f2,3 file.mrf > sequences.mrf

alignments.mrf would contain the alignment data and the query ID; it can be freely shared since it does not include confidential information;
sequences.mrf would contain the sequence data; which, potentially, could be used to identify an individual and thus may be subjected to more stringent rules.

Interval

The Interval format consists of eight tab-delimited columns and is used to represent genomic intervals such as genes. This format is closely associated with the intervalFind module, which is part of BIOS. This module efficiently finds intervals that overlap with a query interval. The underlying algorithm is based on containment sublists: Alekseyenko, A.V., Lee, C.J. "Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases" Bioinformatics 2007;23:1386-1393 [1].

1.   Name of the interval
2.   Chromosome 
3.   Strand
4.   Interval start (with respect to the "+")
5.   Interval end (with respect to the "+")
6.   Number of sub-intervals
7.   Sub-interval starts (with respect to the "+", comma-delimited)
8.   Sub-interval end (with respect to the "+", comma-delimited)

Example file:

uc001aaw.1      chr1    +       357521  358460  1       357521  358460
uc001aax.1      chr1    +       410068  411702  3       410068,410854,411258    410159,411121,411702
uc001aay.1      chr1    -       552622  554252  3       552622,553203,554161    553066,553466,554252
uc001aaz.1      chr1    +       556324  557910  1       556324  557910
uc001aba.1      chr1    +       558011  558705  1       558011  558705

In this example the intervals represent a transcripts, while the sub-intervals denote exons.

Note: the coordinates in the Interval format are zero-based and the end coordinate is not included.

BED

The BED format is used to represent contiguous genomic regions. It consists of three required columns (tab-delimited).

1.   Chromosome
2.   Start
3.   End

Example file:

chr1  1000  5000
chr3  500   600
chrX  4000  4250

Full documentation can be found at UCSC

Note: the coordinates in the BED format are zero-based and the end coordinate is not included.

BedGraph

The BedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. This track type is similar to the wiggle (WIG) format, but unlike the wiggle format, data exported in the bedGraph format are preserved in their original state. It consists of four required columns (tab-delimited).

1.   Chromosome
2.   Start
3.   End 
4.   Value

Example file:

chr1  1000   5000   12.3 
chr3  500     600      3
chrX  4000  4250    54

Full documentation can be found at UCSC

Note: the coordinates in the BedGraph format are zero-based and the end coordinate is not included.

WIG

The WIG format is used to represent dense and continuous genomic data. There are two options for formatting wiggle data: variableStep and fixedStep.

In the context of RSEQtools, the variable step formatting is used and only positions with non-zero values are represented.

Example file:

track type=wiggle_0 name="test_chr22"
variableStep chrom=chr22 span=1
17535712	1.67
17535713	1.67
17535714	1.67
17535715	1.67
17535716	1.67

Full documentation can be found at UCSC

Note: the coordinates in the WIG format are zero-based.

GFF

The GFF format is used to describe genes and other features. It consists of nine tab-delimited columns.

1. Name 
2. Source 
3. Feature
4. Start
5. End 
6. Score
7. Strand
8. Frame
9. Group

Example file:

browser hide all
track name="chr11" visibility=2
chr11	MRF	feature	46772115	46772161	.	-	.	TG5
chr11	MRF	feature	46772668	46772695	.	-	.	TG5
chr11	MRF	feature	118521207	118521252	.	+	.	TG21
chr11	MRF	feature	118526315	118526343	.	+	.	TG21

Full documentation can be found at UCSC

Note: the coordinates in the BED format are one-based and the end coordinate is included.

PSL

The PSL format represents alignments from the BLAT alignment program.

Full documentation can be found at UCSC

SAM

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

Full documentation can be found at SAMtools

List of programs

This is the documentation for the end-users. The full documentation for developers can be found here.

Format conversion utilities

The following programs convert the output from various alignment programs into MRF.

bowtie2mrf

bowtie2mrf converts read alignments from Bowtie into MRF.

Usage:

bowtie2mrf <genomic|junctions|paired> [-sequence] [-qualityScores] [-IDs]

Inputs: Takes Bowtie output from STDIN
Outputs: Reports MRF to STDOUT
Required arguments
- genomic - convert single-end reads that were aligned against a genomic reference sequence using Bowtie
- junctions - convert single-end reads that were aligned against a splice junction library (generated by createSpliceJunctionLibrary) using Bowtie
- paired - convert paired-end reads that were aligned using Bowtie
Optional arguments
- sequence - include the read sequence in the MRF output
- qualityScores - include the quality scores of the read in the MRF output
- IDs - include the read IDs in the MRF output

Note: bowtie2mrf assumes that bowtie was run using the default option for the -B parameter

-B/--offbase <int> leftmost ref offset = <int> in bowtie output (default: 0)

Note: If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary. Otherwise, bowtie2mrf will not be able to convert the splice junction coordinates correctly.

psl2mrf

psl2mrf converts read alignments from BLAT into MRF.

Usage:

psl2mrf

Inputs: Takes BLAT alignments in PSL format from STDIN
Outputs: Reports MRF to STDOUT
Required arguments
- None
Optional arguments
- None

singleExport2mrf

singleExport2mrf converts single-end read alignments from ELAND (export file) into MRF.

Usage:

singleExport2mrf

Inputs: Takes ELAND single-end alignments in export format from STDIN
Outputs: Reports MRF to STDOUT
Required arguments
- None
Optional arguments
- None

Note: If a splice junction library is used during the alignment step, it is important that the splice junction library was generated by createSpliceJunctionLibrary and that its file name included 'splice' or 'junction'. Otherwise, singleExport2mrf will not be able to convert the splice junction coordinates correctly.
The output includes sequences and quality scores. If one wants the alignment only:

singleExport2mrf < file.export.txt | cut -f1 > file.mrf

mrfSorter

mrfSorter sort each MRF line according to its coordinate format, i.e. the left-most coordinate would be reported first, regardless of strand.

Usage:

mrfSorter

Inputs: Takes MRF format from STDIN
Outputs: Reports MRF to STDOUT
Required arguments
- None
Optional arguments
- None

Example:

 mrfSorter <file.mrf > file.sorted.mrf

sam2mrf

sam2mrf converts SAM format into MRF

Usage:

sam2mrf

Inputs: Takes SAM format from STDIN
Outputs: Reports MRF to STDOUT
Required arguments
- None
Optional arguments
- None

Please note that for paired-end data, sam2mrf requires the mate pairs to be on subsequent lines. You may want to sort the SAM file first. Please note that you would need to run mrfSorter after sam2mrf to make a valid MRF file.

Example:

samtools sort file.sam file.sorted; sam2mrf < file.sorted.sam | mrfSorter > file.mrf

Genome annotation tools

The following tools are helpful in manipulating annotation files.

createSpliceJunctionLibrary

This program is used to create a splice junction library from an annotation set. It creates all pair-wise splice junctions within a transcript.

Usage:

createSpliceJunctionLibrary <file.2bit> <file.annotation> <sizeExonOverlap>

Inputs: None
Outputs: Reports the slice junctions in FASTA format
Required arguments
- file.2bit - genome reference sequence in 2bit format
- file.annotation - annotation set in Interval format (each line represents one transcript)
- sizeExonOverlap - defines the number of nucleotides included from each exon
Optional arguments
- None

Example output:

>chr1|12162|12612|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGG
>chr1|12162|13220|65
AGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG
>chr1|12656|13220|65
CAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGG

The identifier for each splice junction consists of four items:

1.   Chromosome
2.   Start position (with respect to the "+", zero-based) of the splice junction within the first exon  
3.   Start position (with respect to the "+", zero-based) of the splice junction within the second exon  
4.   Size of the exon overlap

Note: Internally the program uses twoBitToFa (part of BLAT package). Thus, the executable must be in the PATH.

mergeTranscripts

Module to merge a set of transcripts from the same gene.

Obtain unique exons from various transcript isoforms based on:

longest isoform
composite model (union of the exons from the different transcript isoforms)
intersection (intersection of the exons of the different transcript isoforms)

Usage:

mergeTranscripts <knownIsoforms.txt> <file.annotation> <longestIsoform|compositeModel|intersection>

Inputs: None
Outputs: Reports a new annotation set of merged transcripts in Interval format
Required arguments
- knownIsoforms.txt - file that determines which transcript isoforms belong together (see format below)
- file.annotation - annotation set in Interval format (each line represents one transcript)
- < longestIsoform | compositeModel | intersection > - determines how transcript isoforms are selected/merged:
Optional arguments
- None

The file knownIsoforms.txt should have two columns (tab-delimited) and no header:

1. ID (int). Transcripts with the same id belong to the same gene.
2. Name of the transcript.

Example:

1  uc009vip.1
1  uc001aaa.2
2  uc009vis.1
2  uc001aae.2
2  uc009viu.1
2  uc009vit.1

interval2sequences

Module to retrieve genomic/exonic sequences for an annotation set.

Usage:

interval2sequences <file.2bit> <file.annotation> <exonic|genomic>

Inputs: None
Outputs: Reports the extracted sequences in FASTA format
Required arguments
- file.2bit - genome reference sequence in 2bit format
- file.annotation - annotation set in Interval format (each line represents one transcript)
- < exonic | genomic > - exonic means that only the exonic regions are extracted, while genomic indicates that the intronic sequences are extracted as well
Optional arguments
- None

Gene expression analysis

mrfQuantifier

Module to calculate expression values (RPKM). Given a set of mapped reads in MRF and an annotation set (representing exons, transcripts, or gene models) mrfQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

Usage:

mrfQuantifier <file.annotation> <singleOverlap|multipleOverlap>

Inputs: Takes MRF from STDIN.
Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
Required arguments
- file.annotation - annotation set in Interval format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See Interval for more details.
- < singleOverlap | multipleOverlap > - singleOverlap: reads that overlap with multiple annotated features are ignored; multipleOverlap: reads that overlap with multiple annotated features are counted multiple times.
Optional arguments
- None

Note: All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

Determining overlaps between annotation entries and reads

bgrQuantifier

Module to calculate expression values (RPKM) from a signal track in bedGraph (bgr) format. Given a signal track and an annotation set (representing exons, transcripts, or gene models) bgrQuantifier calculates an expression value for each annotation entry. This is done by counting all the nucleotides from the reads that overlap with a given annotation entry. Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

Usage:

bgrQuantifier <file.annotation>

Inputs: Takes BedGraph file from STDIN.
Outputs: Reports the gene expression values to STDOUT in a two-column format (tab-delimited). The first column refers to the name of the annotated feature. The second column refers to the expression values (RPKM; read coverage normalized per million mapped nucleotides and the length of the annotation model per kb [see note below]). The output is sorted by the first column.
Required arguments
- file.annotation - annotation set in Interval format (gene expression measurements: one line per gene model; exon expression measurements: one line per exon). See Interval for more details.
Optional arguments
- None

Note: All counts are performed at the nucleotide level. For example, if a read partially overlaps with an exon of a gene model, then only the overlapping nucleotides are counted (please refer to the figure below). Therefore, the normalization is also done at the nucleotide level.

Visualization tools

The following programs are useful for converting MRF into data formats that can be viewed in a genome browser.

mrf2wig

Generates signal track (WIG) of mapped reads from a MRF file. By default, the values in the WIG file are normalized by the total number of mapped reads per million. Only positions with non-zero values are reported.

Usage:

mrf2wig <prefix> [doNotNormalize]

Inputs: Takes MRF from STDIN.
Outputs: Generates a WIG file for each chromosome occurring in the MRF input file.
Required arguments
- prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.wig
Optional arguments
- doNotNormailze - the counts are NOT normalized

mrf2gff

Generates a GFF file of mapped splice junction reads from a MRF file.

Usage

mrf2gff <prefix>

Inputs: Takes MRF from STDIN.
Outputs: Generates a GFF file for each chromosome occurring in the MRF input file.
Required arguments
- prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.gff
Optional arguments
- None

mrf2bgr

Module to convert MRF to BedGraph. Generates a BedGraph, where the counts are normalized by the total number of mapped reads per million.

Usage:

mrf2bgr <prefix> [doNotNormalize]

Inputs: Takes MRF from STDIN.
Outputs: Generates a BedGraph file for each chromosome occurring in the MRF input file.
Required arguments
- prefix - specifies the prefix used to generate the output files. The following naming convention is used: prefix_chrXXX.bgr
Optional arguments
- doNotNormailze - the counts are NOT normalized.

Segmentation of mapped reads

wigSegmenter

Module to segment a WIG signal track using the maxGap-minRun algorithm. The output is a set of transcriptionally active regions (TARs) in BED format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

Usage:

wigSegmenter <wigPrefix> <threshold> <maxGap> <minRun>

Inputs: None
Outputs: Generates a BED file for each chromosome occurring in the MRF input file.
Required arguments
- wigPrefix - prefix used to generate the WIG files using mrf2wig
- threshold - level at which the segmentation is performed
- maxGap - maximum number of consecutive positions that can have values less than the threshold
- minRun - minimal length of a TAR
Optional arguments
- None

bgrSegmenter

Module to segment a BedGraph signal track using the maxGap-minRun algorithm. The output is a set of transcriptionally active regions (TARs) in BED format. This type of analysis is particularly useful in identifying novel transcribed regions such as non-coding RNAs.

Usage:

bgrSegmenter <bgrPrefix> <threshold> <maxGap> <minRun>

Inputs: None
Outputs: Generates a BED file for each chromosome occurring in the BedGraph input file.
Required arguments
- bgrPrefix - prefix used to generate the BedGraph files using mrf2bgr
- threshold - level at which the segmentation is performed
- maxGap - maximum number of consecutive positions that can have values less than the threshold
- minRun - minimal length of a TAR
Optional arguments
- None

Annotation statistics tools

The following modules are useful for calculating annotation statistics given a set of mapped reads.

mrfAnnotationCoverage

Module to calculate annotation coverage. Sample a set of mapped reads and determine the fraction of transcripts (specified in file.annotation) that have at least <coverageFactor>-times uniform coverage.

Usage:

mrfAnnotationCoverage <file.annotation> <numTotalReads> <numReadsToSample> <coverageFactor>

Inputs: Takes MRF from STDIN
Outputs: Reports the fraction of transcripts that have at least <coverageFactor>-times uniform coverage to STDOUT
Required arguments
- file.annotation -
- numTotalReads - total number of reads in the MRF input file
- numReadsToSample - number of reads to sample from the MRF input file
- coverageFactor - minimum level of uniform coverage required across a transcript
Optional arguments
- None

mrfMappingBias

Module to calculate mapping bias for a given annotation set. Aggregates mapped reads that overlap with transcripts (specified in file.annotation) and outputs the counts over a standardized transcript (divided into 100 equally sized bins) where 0 represents the 5' end of the transcript and 1 denotes the 3' end of the transcripts. This analysis is done in a strand specific way.

Usage:

mrfMappingBias <file.annotation>

Inputs: Takes MRF from STDIN
Outputs: Outputs the number of mapped reads for each bin of the standardized transcript to STDOUT
Required arguments
- file.annotation - annotation set in Interval format (each line represents one transcript)
Optional arguments
- None

MRF selection utilities

The following utilities are helpful to select subsets of an MRF file. It should be noted that these utilities operate on existing MRF files.

mrfSampler

Randomly select a subset of MRF entries.

Usage:

mrfSampler <proportionOfReadsToSample>

Inputs: Takes MRF from STDIN
Outputs: Outputs MRF to STDOUT
Required arguments
- proportionOfReadsToSample - fraction of reads to sample (on average)
Optional arguments
- None

mrfSelectRegion

Select reads that overlap with a specified genomic region.

Usage:

mrfSelectRegion <targetName:targetStart-targetEnd>

Inputs: Takes MRF from STDIN
Outputs: Outputs MRF to STDOUT
Required arguments
- targetName:targetStart-targetEnd - region of interest
Optional arguments
- None

mrfSelectSpliced

Select reads that span a splice junction.

Usage:

mrfSelectSpliced

Inputs: Takes MRF from STDIN
Outputs: Outputs MRF to STDOUT
Required arguments
- None
Optional arguments
- None

Note: The purpose of mrfSelectSpliced is to extract mapped reads that align to a splice junction from an existing MRF file. It is important to note that this utility is not used to convert the output of a specific mapping program.

mrfSubsetByTargetName

Split up an MRF file by chromosome.

Usage:

mrfSubsetByTargetName <prefix>

Inputs: Takes #MRF\MRF from STDIN
Outputs: Outputs a separate MRF file for each chromosome using the following naming convention: <prefix>_chrXXX.mrf.
Required arguments
- prefix - prefix used for the output files.
Optional arguments
- None

mrfSelectAnnotated

Module to select a subset of reads that overlap with a specified annotation set.

Usage:

mrfSelectAnnotated <file.annotation> <include|exclude>

Inputs: Takes MRF from STDIN
Outputs: Outputs MRF to STDOUT
Required arguments
- file.annotation - annotation set in Interval format (one transcript per line).
- < include | exclude > - include: report reads that overlap with exonic regions of the annotation set; exclude: report reads that do not overlap with exonic regions of the annotation set
Optional arguments
- None

mrfRegionCount

Module to count the total number of reads in a specified region.

Usage:

 mrfRegionCount <targetName:targetStart-targetEnd>

Inputs: Takes MRF from STDIN
Outputs: Outputs the total number of reads in a specified region to STDOUT.
Required arguments
- targetName:targetStart-targetEnd, specifies the region of interest
Optional arguments
- None

Auxiliary utilities

This section includes various data format conversion utilities.

bed2interval

Utility to convert BED format into Interval format.

Usage:

bed2interval

Inputs: Takes data in BED format from STDIN
Outputs: Outputs data in Interval format to STDOUT
Required arguments
- None
Optional arguments
- None

interval2bed

Utility to convert Interval format into BED format.

Usage:

interval2bed

Inputs: Takes data in Interval format from STDIN
Outputs: Outputs data in BED format to STDOUT
Required arguments
- None
Optional arguments
- None

interval2gff

Utility to convert Interval format into GFF format.

Usage:

interval2gff <trackName>

Inputs: Takes data in Interval format from STDIN
Outputs: Outputs data in GFF format to STDOUT
Required arguments
- trackName - track name used in the GFF file
Optional arguments
- None

gff2interval

Utility to convert GFF format into Interval format.

Usage:

gff2interval

Inputs: Takes data in GFF format from STDIN
Outputs: Outputs data in Interval format to STDOUT
Required arguments
- None
Optional arguments
- None

export2fastq

Module to generate FASTQ sequences from an ELAND export file.

Usage:

export2fastq

Inputs: Takes an ELAND export file from STDIN
Outputs: Reports the extracted sequences in FASTQ format to STDOUT
Required arguments
- None
Optional arguments
- None

mrf2sam

Module to convert MRF to SAM.

Usage:

mrf2sam

Inputs: Takes MRF from STDIN
Outputs: Outputs SAM to STDOUT
Required arguments
- None
Optional arguments
- None

Retrieved from "https://info.gersteinlab.org/index.php?title=RSEQtools&oldid=2354"

Navigation menu