ACT Tool
From GersteinInfo
Line 1: | Line 1: | ||
Aggregation features: | Aggregation features: | ||
- | Python script <code>[Aggregation]</code>, the main download, includes small example files and full documentation. Runs efficiently on large data sets. ( | + | Python script <code>[<A HREF=http://act.gersteinlab.org/Agg.tar>Aggregation</A>]</code>, the main download, includes small example files and full documentation. Runs efficiently on large data sets. ( |
PUt on wiki" Update: No longer requires numpy to run | PUt on wiki" Update: No longer requires numpy to run |
Revision as of 06:10, 27 June 2010
Aggregation features:
Python script [<A HREF=http://act.gersteinlab.org/Agg.tar>Aggregation</A>]
, the main download, includes small example files and full documentation. Runs efficiently on large data sets. (
PUt on wiki" Update: No longer requires numpy to run
Other versions zip file [Aggregation-old]
. Other drafts of code downloads (in Perl, C++, Matlab). Genomic Signal Aggregator code [GSA]
and documentation can be found here
Example data. Data for the examples used in the ACT paper. For use with Agg.tar, see the walkthrough
Web ACT, with sample run files, but limited to small data sets and Genomic Signal Aggregator (Zlab), also for limited data sets, but with some extra visualization features
Gallery. For aggregation, contains explanatory powerpoint and example figures generated using a variety of methods.
Here is some info. on http://act.gersteinlab.org (Aggregation & Correlation Toolbox)
Contents |
Overview
ACT is a toolbox for harvesting useful results from a vast sea of genomic experimental data. In particular, it is a set of scripts (Aggregation, Correlation, and Saturation) designed to be downloaded and used to analyze signal or hit tracks. These scripts, along with their supporting material (documentation, example files) can be accessed by clicking on their respective icons on the act.gersteinlab.org home page. Details of what each script is designed to do, i.e. what files it takes in, what it outputs, and important notes, are discussed below.
There are also several supporting features on the website such as a gallery and example files: these are also discussed below.
Aggregation
The aggregation script takes values from multiple points on a single genomic signal track and creates an average signal profile around a set of anchor points, such as Transcription Start Sites (TSS's).
The main download is written in Python. Each run takes two input files: a signal or hit track (in the form of an sgr file or point file), and an annotations file in bed format. The output is a columnar file with explanatory headers--the files can be plotted in programs like gnuplot, excel, or matlab. The main download package has an R script in the samples folder which shows one way of plotting the output data with error bars.
It should be noted that in computing the "average signal profile" there are a number of computational choices to be made: for example, bin size, whether to use the median or mean of signals within a bin as the bin's value, whether to use the median or mean of signals across all bins as the final value in the signal profile. Since the annotations file requires regions input, there is also a choice to be made as to whether to aggregate around only a single point (the 5' end of the region, such as TSS's) or to include the entire region in the aggregation. Options dealing with all of these choices are available in the main aggregation download. For an idea of how bin scaling over regions works, see the aggregation powerpoint in the gallery.
- Specific use instructions
After downloading and unzipping the aggregation package, Agg.tar, the program can be run as follows (data files can be found under "Example Data" in the Aggregation section):
python ACT.py --nbins=50 --mbins=0 --radius=50000 hg17_ensembl.bed baf155.sgr > baf155_ensembl.out
where hg17_ensembl.bed is the annotations file and baf155.sgr is the signal track, placed in the same folder as ACT.py. An alternative run which would include the 3' boundary of each gene region can be performed using the following:
python ACT.py --nbins=50 --mbins=50 --radius=50000 --regions hg17_ensembl.bed baf155.sgr > baf155_ensembl.out
An aggregation run on point tracks (such as SNP lists) to determine average density can be performed as follows:
python ACT.py --nbins=50 --mbins=0 --radius=50000 --signalparser=PointParser gencode.pc.coords.chr1 YRI.snps.parsed.chr1 > YRI_gencode.out
There are additional tags corresponding to different aggregation options which can be viewed in the readme.
- Contact
Robert Bjornson
Correlation
The correlation script takes multiple signal tracks of equal length and divides each one into bins, similarly to the aggregation script, except in this case the bins are not hinged around anchor points and they are generally wider (either hundreds or thousands of bases, depending on which script is chosen). Each bin is assigned a value based on the corresponding signal track values, and then the arrays of bins are correlated with each other in pairwise fashion. Ultimately, a matrix of correlation coefficients corresponding to the correlations between all signal tracks is obtained.
There are options in the correlation script allowing one to control bin (sliding window) size and the overlap of the bins (windows).
There are two versions of the correlation tool. In Kevin Yip's version (Corr/Sat bundle) a final correlation matrix is created based on either the Spearman, Pearson, or normal score correlation between each pair of binned data sets. In
In the Cor/Sat bundle, there is a .bat file with an example run. In Correlation P, an example run command is in README.
- Contact
Correlation P was written by Lucas Lochovsky The Saturation/Correlation bundle was written by Kevin Yip
Saturation
Saturation script allows us to determine the saturation level of a given feature after multiple genomic experiments.
Each input file corresponds to one experimental condition (e.g. one new individual), and each line in a file specifies a genomic location that has the biological phenomenon under study (e.g. tagged SNP's). Our implementation makes use of special data structures to avoid redundant counting. It normally takes less than a minute to generate the plot for up to 30 input files each with a few thousand lines. To handle more files and files with more lines, the tool also provides an option to compute the coverage of a random sample of the input file combinations.
It produces saturation plots from a set of binary data files. Each line of the input flie contians a genomic region in the following format:
<ID><tab><start><tab><end>
where <ID> is the identifier of the region-at-large, such as the chromosome <start> is the starting position of the region <end> is the ending position of the region (this position is inside the region)
The y-axis could be the absolute number of nucleotides, or a fraction of an input total number of nucleotides, such as the total number of nucleotides of the coding transcripts in the example. To use the absolute number, input the total as 0.
An example file demonstrating how to use the saturation component can be found in saturation.bat
- Contact
Kevin Yip
Web ACT
Information about input signal tracks can be found here: http://tiling.mbb.yale.edu:8080/aggcorr/documents/sigfile_readme.htm
- Aggregation
Note: based on C++ version of source code found in "Other versions" compendium
Parameters such as mbins and nbins same as described in Aggregation above.
- Correlation
Information about the correlation web tool can be found here: http://tiling.mbb.yale.edu:8080/aggcorr/documents/corr_readme.htm
- Contact
Justin Jee
Other
- Citation
A paper describing this site and software is currently in preparation . Currently, please just reference act.gersteinlab.org if you use the tool.