ACT Tool

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
Line 27: Line 27:
to determine if binding around a structural site is significant relative to the experiment's background signal level (ENCODE Project Consortium 2007). Correlation analysis can also feed into downstream principal component analysis, allowing for grouping of coregulating factors with their coregulated sites. [[This would simply involve diagnolization of the correlation matrix output from ACT]] Saturation analysis can be used to inform future experimental design.
to determine if binding around a structural site is significant relative to the experiment's background signal level (ENCODE Project Consortium 2007). Correlation analysis can also feed into downstream principal component analysis, allowing for grouping of coregulating factors with their coregulated sites. [[This would simply involve diagnolization of the correlation matrix output from ACT]] Saturation analysis can be used to inform future experimental design.
 +
 +
 +
==Abstract==
 +
 +
We have implemented an efficient, multi-faceted toolbox for analyzing [[continuous]] signal or discrete region tracks from high-throughput genomic experiments, such as RNA-Seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms (SNPs) from dbSNP or the 1000 genomes project. [[We call our toolbox ACT ( aggregation and correlation toolbox) It ]] is able to generate aggregate profiles of a given track around a set of anchor points, such as transcription start sites. It is also able to correlate related signal or regions tracks as well as analyze them for saturation; i.e. how much of a certain feature is covered with each new succeeding experiment.
 +
[[The ACT site contains downloadable code in a variety of formats, interactive webservers (for use on small quanitites of data), example data sets, documentation, and a gallery of outputs. ]]
 +
 +
 +
 +
 +
multiple scripts and interactive webservers which perform each of these tasks are available: here we explain the components of the toolbox in more detail and apply them to various examples.
 +

Revision as of 06:45, 27 June 2010

Aggregation features:

Python script [Aggregation</A>], the main download, includes small example files and full documentation. Runs efficiently on large data sets. (

PUt on wiki" Update: No longer requires numpy to run

Other versions zip file [Aggregation-old]. Other drafts of code downloads (in Perl, C++, Matlab). Genomic Signal Aggregator website code [GSA] and documentation is also available

Example data for aggreation [agg-data] Data for the examples used in the ACT paper. For use with Agg.tar, see the walkthrough

Prototype website [Web-ACT], with sample run files, but limited to small data sets and Genomic Signal Aggregator [Zlab-ACT], also for limited data sets, but with some extra visualization features

Gallery. For aggregation, contains explanatory powerpoint and example figures generated using a variety of methods.




Here is some info. on http://act.gersteinlab.org (Aggregation & Correlation Toolbox)


ACT can be used as a starting point for other downstream analyses. Aggregation can also be used in conjunction with Genome Structure Correction to determine if the enrichment of a given signal with respect to anchor points are significant . This correction takes into acct the fact that a "random' distribution of anchors on the genome does arises fro ma distinctly non-uniform distirbution


to determine if binding around a structural site is significant relative to the experiment's background signal level (ENCODE Project Consortium 2007). Correlation analysis can also feed into downstream principal component analysis, allowing for grouping of coregulating factors with their coregulated sites. This would simply involve diagnolization of the correlation matrix output from ACT Saturation analysis can be used to inform future experimental design.


Contents

Abstract

We have implemented an efficient, multi-faceted toolbox for analyzing continuous signal or discrete region tracks from high-throughput genomic experiments, such as RNA-Seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms (SNPs) from dbSNP or the 1000 genomes project. We call our toolbox ACT ( aggregation and correlation toolbox) It is able to generate aggregate profiles of a given track around a set of anchor points, such as transcription start sites. It is also able to correlate related signal or regions tracks as well as analyze them for saturation; i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive webservers (for use on small quanitites of data), example data sets, documentation, and a gallery of outputs.



multiple scripts and interactive webservers which perform each of these tasks are available: here we explain the components of the toolbox in more detail and apply them to various examples.




Overview

Getting Started: select from one of the three icons on the ACT website

ACT is a toolbox for harvesting useful results from a vast sea of genomic experimental data. In particular, it is a set of scripts (Aggregation, Correlation, and Saturation) designed to be downloaded and used to analyze signal or hit tracks. These scripts, along with their supporting material (documentation, example files) can be accessed by clicking on their respective icons on the act.gersteinlab.org home page. Details of what each script is designed to do, i.e. what files it takes in, what it outputs, and important notes, are discussed below.

There are also several supporting features on the website such as a gallery and example files: these are also discussed below.

In particular, here's a common set of example files that runs on all the tools.

Aggregation

The aggregation script takes values from multiple points on a single genomic signal track and creates an average signal profile around a set of anchor points, such as Transcription Start Sites (TSS's).

The main download is written in Python. Each run takes two input files: a signal or hit track (in the form of an sgr file or point file), and an annotations file in bed format. The output is a columnar file with explanatory headers--the files can be plotted in programs like gnuplot, excel, or matlab. The main download package has an R script in the samples folder which shows one way of plotting the output data with error bars.

It should be noted that in computing the "average signal profile" there are a number of computational choices to be made: for example, bin size, whether to use the median or mean of signals within a bin as the bin's value, whether to use the median or mean of signals across all bins as the final value in the signal profile. Since the annotations file requires regions input, there is also a choice to be made as to whether to aggregate around only a single point (the 5' end of the region, such as TSS's) or to include the entire region in the aggregation. Options dealing with all of these choices are available in the main aggregation download. For an idea of how bin scaling over regions works, see the aggregation powerpoint in the gallery.

  • Specific use instructions

After downloading and unzipping the aggregation package, Agg.tar, the program can be run as follows (data files can be found under "Example Data" in the Aggregation section):

python ACT.py --nbins=50 --mbins=0 --radius=50000 hg17_ensembl.bed baf155.sgr > baf155_ensembl.out

where hg17_ensembl.bed is the annotations file and baf155.sgr is the signal track, placed in the same folder as ACT.py. An alternative run which would include the 3' boundary of each gene region can be performed using the following:

python ACT.py --nbins=50 --mbins=50 --radius=50000 --regions hg17_ensembl.bed baf155.sgr > baf155_ensembl.out

An aggregation run on point tracks (such as SNP lists) to determine average density can be performed as follows:

python ACT.py --nbins=50 --mbins=0 --radius=50000 --signalparser=PointParser gencode.pc.coords.chr1 YRI.snps.parsed.chr1 > YRI_gencode.out

There are additional tags corresponding to different aggregation options which can be viewed in the readme.

  • Contact

Robert Bjornson

Correlation

The correlation script takes multiple signal tracks of equal length and divides each one into bins, similarly to the aggregation script, except in this case the bins are not hinged around anchor points and they are generally wider (either hundreds or thousands of bases, depending on which script is chosen). Each bin is assigned a value based on the corresponding signal track values, and then the arrays of bins are correlated with each other in pairwise fashion. Ultimately, a matrix of correlation coefficients corresponding to the correlations between all signal tracks is obtained.

There are options in the correlation script allowing one to control bin (sliding window) size and the overlap of the bins (windows).

There are two versions of the correlation tool. In Kevin Yip's version (Corr/Sat bundle) a final correlation matrix is created based on either the Spearman, Pearson, or normal score correlation between each pair of binned data sets. In

In the Cor/Sat bundle, there is a .bat file with an example run. In Correlation P, an example run command is in README.

  • Contact

Correlation P was written by Lucas Lochovsky The Saturation/Correlation bundle was written by Kevin Yip

Saturation

Saturation script allows us to determine the saturation level of a given feature after multiple genomic experiments.

Each input file corresponds to one experimental condition (e.g. one new individual), and each line in a file specifies a genomic location that has the biological phenomenon under study (e.g. tagged SNP's). Our implementation makes use of special data structures to avoid redundant counting. It normally takes less than a minute to generate the plot for up to 30 input files each with a few thousand lines. To handle more files and files with more lines, the tool also provides an option to compute the coverage of a random sample of the input file combinations.

It produces saturation plots from a set of binary data files. Each line of the input flie contians a genomic region in the following format:

<ID><tab><start><tab><end>

where <ID> is the identifier of the region-at-large, such as the chromosome <start> is the starting position of the region <end> is the ending position of the region (this position is inside the region)

The y-axis could be the absolute number of nucleotides, or a fraction of an input total number of nucleotides, such as the total number of nucleotides of the coding transcripts in the example. To use the absolute number, input the total as 0.

An example file demonstrating how to use the saturation component can be found in saturation.bat

  • Contact

Kevin Yip

Web ACT

Information about input signal tracks can be found here: http://tiling.mbb.yale.edu:8080/aggcorr/documents/sigfile_readme.htm

  • Aggregation

Note: based on C++ version of source code found in "Other versions" compendium

Parameters such as mbins and nbins same as described in Aggregation above.


  • Correlation

Information about the correlation web tool can be found here: http://tiling.mbb.yale.edu:8080/aggcorr/documents/corr_readme.htm

  • Contact

Justin Jee

Other

  • Citation

A paper describing this site and software is currently in preparation . Currently, please just reference act.gersteinlab.org if you use the tool.

Personal tools