Here is some information on the Aggregation & Correlation Toolbox (ACT.gersteinlab.org)
ACT is a toolbox for harvesting useful results from a vast sea of genomic experimental data. In particular, it is a set of scripts (Aggregation, Correlation, and Saturation) designed to be downloaded and used to analyze signal or hit tracks. These scripts, along with their supporting material (documentation, example files) can be accessed by clicking on their respective icons on the act.gersteinlab.org home page. The downloads are designed primarily with Unix/Linux users in mind. Details of what each script is designed to do, i.e. what files it takes in, what it outputs, and important notes, including dependencies, are discussed below.
General contact: jjmg [at] gersteinlab.org
The aggregation script agg-py takes values from multiple points on a single genomic signal track and creates an average signal profile around a set of anchor points, such as Transcription Start Sites (TSS's).
The main download is written in Python. Each run takes two input files: a signal or hit track (in the form of an sgr file or point file), and an annotations file in bed format. The output is a columnar file with explanatory headers--the files can be plotted in programs like gnuplot, excel, or matlab. The main download package has an R script in the samples folder which shows one way of plotting the output data with error bars.
It should be noted that in computing the "average signal profile" there are a number of computational choices to be made: for example, bin size, whether to use the median or mean of signals within a bin as the bin's value, and whether to use the median or mean of signals across all bins as the final value in the signal profile. Since the annotations file requires regions input, there is also a choice to be made as to whether to aggregate around only a single point (the 5' end of the region, such as TSS's) or to include the entire region in the aggregation. Options dealing with all of these choices are available in the main aggregation download. For an idea of how bin scaling over regions works, see the aggregation powerpoint in the gallery.
Annotation file. Tab-delimited file containing several start and end positions, chromosome locations, and strand annotations, such as a bed file. The default structure of these files is like a bed file in which the chromosome is in column 1, 5’ position in column 2, 3’ position in column 3, and strand (“+” or “-“) in column 4. The program will automatically switch orientation based on strand. For this field, you may either upload your own file or use one of the common gene annotations found in the dropdown menu.
Signal file. Tab-delimited file containing signals and positions. Depending on what type of file is specified as well as how probewidth (see below) is defined, these can either represent discrete points with an independent probe width, or they can represent the signal for all locations until the next defined spot. Take, for example, the following excerpt from a signal file:
chr1 10 2.3 chr1 40 5.0
--nbins. Specifies how many bins will be in each flanking region (for a total of 2xn bins). For jobs that do not use scaled bins in the region between start and stop sites, there will be 2xn bins: n bins on each side of the start site.
--regions. Optional tag. If included, the program will aggregate "radius" base pairs upstream of the start site, downstream of the stop site, AND the "region" between the start and stop site. The way it will do this is to create "mbins" number of bins between the start and stop site for each segment in the annotations file, and scale the bins based on the size of the segment. Note that, as a result, mbins must be smaller than the smallest segment in the annotation file.
--mbins. Optional field to be defined if stop and start sites (regions) are to be considered in the aggregation process. Defines the number of bins assigned to the region between start and stop sites (for a total number of m+2n bins).
--radius. Specifies number of base pairs the program should analyze in both directions from either the start site or the start/stop pair, depending on the option specified. (in the output, the program will present radius base pairs divided into nbins upstream and downstream of start sites or upstream of start sites and downstream of stop sites, depending on if --regions is used or not)
--mingenelen: Tells the program not to consider annotations for start/stop pairs that are shorter than a certain distance.
--mean. The program will automatically ensure that only the median signal of an individual gene (or number of probes, if the density option is selected) contributes to the final “averaging” calculation in each bin to avoid bias against shorter genes. If use mean is selected, the program will take the mean signal across all genes per bin and report the final result. Otherwise, it will use the median signal (of the median signal of each gene).
- Specific use instructions
Accepted file input types: .sgr (for signal track); abbreviated .bed file (for annotation track) Note: .wig signal tracks accepted in Web ACT only.
After unzipping, the main source file (which includes a detailed header with use instructions) is ACT.py. A makefile with example runs can be found in Agg/samples/Makefile
After downloading and unzipping the aggregation package, Agg.tar, the program can be run as follows (data files can be found under "Example Data" in the Aggregation section):
python ACT.py --nbins=50 --mbins=0 --radius=50000 hg17_ensembl.bed baf155.sgr > baf155_ensembl.out
where hg17_ensembl.bed is the annotations file and baf155.sgr is the signal track, placed in the same folder as ACT.py. An alternative run which would include the 3' boundary of each gene region can be performed using the following:
python ACT.py --nbins=50 --mbins=50 --radius=50000 --regions hg17_ensembl.bed baf155.sgr > baf155_ensembl.out
An aggregation run on point tracks (such as SNP lists) to determine average density can be performed as follows:
python ACT.py --nbins=50 --mbins=0 --radius=50000 --signalparser=PointParser gencode.pc.coords.chr1 YRI.snps.parsed.chr1 > YRI_gencode.out
There are additional tags corresponding to different aggregation options which can be viewed in the readme.
Update: no longer requires numpy to run
Update: Please see 'testrun.sh' in the samples folder for an example of how to link the text output from agg-py to the provided R script which generates graphical output.
The correlation script corr-p takes multiple signal tracks of equal length and divides each one into bins, similarly to the aggregation script, except in this case the bins are not hinged around anchor points. Each bin is assigned a value based on the corresponding signal track values, and then the arrays of bins are correlated with each other in pairwise fashion. Ultimately, a matrix of correlation coefficients corresponding to the correlations between all signal tracks is obtained. corr-p is parallelized and runs on bed files and uses Pearson's correlation coefficient to calculate the correlation matrix. For another version of a correlation tool which takes in wig files and can use a variety of correlation methods, please see corr-sat bundle.
There are options in the correlation script allowing one to control bin (sliding window) size and the overlap of the bins (windows).
- Parameters for corr-p
window size. Specifies the size of the sliding window in base pairs. For example, in Genome Res 17: 787-97 a sliding window size of 3kb was used, with an overlap of 1.5kb.
overlap size. The number of base pairs of overlap when the aggregation window “slides.” For example, in Genome Res 17: 787-97 a sliding window size of 3kb was used, with an overlap of 1.5kb.
- Specific Use Instructions
Input file type: .bed
The tool is run as follows:
java -jar EncodeTfCor2.jar [genome file] [BED file directory] [window size] [overlap size]
where [genome file] is a list of regions for the tool to consider (see dist/human_genome_file.txt for an example), [BED file directory] is the folder containing bed files to correlate, and [window size] and [overlap] are as described in parameters.
A specific example run can be found in EXAMPLE.txt
NOTE: When the corr-sat-bundle package is downloaded from the website, in the file "config.txt" all lines except for chr22 are commented out. If you uncomment the lines for other chromosomes the program should work properly.
Unlike corr-p, the correlation tool in the corr-sat bundle can take in wig files with an associated signal as input.
chr1 32948 32980 2 chrX 23434 23800 10
It then computes a correlation value (Pearson, Spearman or normal-score) for each pair of binned datasets. It takes a .wig file as input, and creates a tab-delimited file as output.
Correlation script correlates signal tracks using a relatively small window size. In correlation.bat or correlation.sh this can be set by assigning a value to "bin."
The correlation script will first run a binning program which will create broader wig files based on the size specified by "bin." It will then correlate the created files using a separate correlation script. For options regarding the correlation options see correlation.sh or .bat.
The program produces output files such as the following:
Pearson Correlation chipseq_cfos_bin100.wig chipseq_jund_bin100.wig chipseq_max_bin100.wig chipseq_pol2_bin100.wig chipseq_cfos_bin100.wig 1 0.15236918091939933 0.26114716638422464 0.14500447272313652 chipseq_jund_bin100.wig 0.15236918091939933 1 0.4091055568644753 0.1422116076377861 chipseq_max_bin100.wig 0.26114716638422464 0.4091055568644753 1 0.3309740106844516 chipseq_pol2_bin100.wig 0.14500447272313652 0.1422116076377861 0.3309740106844516 1
The resulting correlation matrix can be plotted as a heatmap using R or as a dendogram using Phylip.
Saturation script allows us to determine the saturation level of a given feature after multiple genomic experiments.
Each input file corresponds to one experimental condition (e.g. one new individual), and each line in a file specifies a genomic location that has the biological phenomenon under study (e.g. tagged SNP's). Our implementation makes use of special data structures to avoid redundant counting. It normally takes less than a minute to generate the plot for up to 30 input files each with a few thousand lines. To handle more files and files with more lines, the tool also provides an option to compute the coverage of a random sample of the input file combinations.
It produces saturation plots from a set of binary data files. Each line of the input flie contians a genomic region in the following format:
where <ID> is the identifier of the region-at-large, such as the chromosome <start> is the starting position of the region <end> is the ending position of the region (this position is inside the region)
The y-axis could be the absolute number of nucleotides, or a fraction of an input total number of nucleotides, such as the total number of nucleotides of the coding transcripts in the example. To use the absolute number, input the total as 0.
It produces output files such as the one below:
Number of features 25 percentile Median 75 percentile 1 38511.0 43483.0 46449.0 2 59867.0 63846.0 76010.0 3 74606.0 83660.0 91301.0 4 86739.0 98260.0 104150.0 5 100967.0 108464.0 114645.0 6 111935.0 117218.0 123894.0 7 120987.0 126453.0 131483.0 8 129865.0 135329.0 139093.0 9 139195.0 143462.0 144009.0 10 147732.0 147732.0 147732.0
In addition, it produces a saturation plot.
Information about input signal tracks can be found here: http://act.gersteinlab.org/sigfile_readme.htm
Information about the aggregation web tool can be found here: http://act.gersteinlab.org/agg_readme.htm
In the GSA, input signals are assigned to the nearest anchor in order to reduce the artifacts caused by subsets of anchors clustering together. For aggregation analysis of segment signals, such as sequencing tags, this requires the genomic scope of the experimental setup to be partitioned and accounted for as denominators. The techniques were used and proved effective in the ENCODE pilot and CTCF nucleosome papers.
Please see http://act.gersteinlab.org/gsa.tar for more information.
Information about the correlation web tool can be found here: http://act.gersteinlab.org/corr_readme.htm
J Jee*, J Rozowsky*, KY Yip*, L Lochovsky, R Bjornson, G Zhong, Z Zhang, Y Fu, J Wang, Z Weng, M Gerstein. ACT: Aggregation and Correlation Toolbox for Analyses of Genome Tracks. (2011) Bioinformatics
- RSeq Tools
RSeq Tools is a suite of tools which analyze RNA Seq experiment data--one of its functions is to generate a signal track from experimental data, which can be fed into ACT. RSeq Tools can be found here: http://archive.gersteinlab.org/proj/rnaseq/rseqtools/