ACT Tool

From GersteinInfo

Revision as of 23:34, 18 September 2010 by Mbg (Talk | contribs)
Jump to: navigation, search

Here is some information on the Aggregation & Correlation Toolbox (ACT.gersteinlab.org)

Contents

Overview

Getting Started: select from one of the three icons on the ACT website

ACT is a toolbox for harvesting useful results from a vast sea of genomic experimental data. In particular, it is a set of scripts (Aggregation, Correlation, and Saturation) designed to be downloaded and used to analyze signal or hit tracks. These scripts, along with their supporting material (documentation, example files) can be accessed by clicking on their respective icons on the act.gersteinlab.org home page. The downloads are designed primarily with Unix/Linux users in mind. Details of what each script is designed to do, i.e. what files it takes in, what it outputs, and important notes, including dependencies, are discussed below.

There are also several supporting features such as a gallery and integrated example which uses whole-genome ChIP-Seq data.

General contact: jjmg [at] gersteinlab.org

agg-py

The aggregation script agg-py takes values from multiple points on a single genomic signal track and creates an average signal profile around a set of anchor points, such as Transcription Start Sites (TSS's).

The main download is written in Python. Each run takes two input files: a signal or hit track (in the form of an sgr file or point file), and an annotations file in bed format. The output is a columnar file with explanatory headers--the files can be plotted in programs like gnuplot, excel, or matlab. The main download package has an R script in the samples folder which shows one way of plotting the output data with error bars.

It should be noted that in computing the "average signal profile" there are a number of computational choices to be made: for example, bin size, whether to use the median or mean of signals within a bin as the bin's value, and whether to use the median or mean of signals across all bins as the final value in the signal profile. Since the annotations file requires regions input, there is also a choice to be made as to whether to aggregate around only a single point (the 5' end of the region, such as TSS's) or to include the entire region in the aggregation. Options dealing with all of these choices are available in the main aggregation download. For an idea of how bin scaling over regions works, see the aggregation powerpoint in the gallery.

  • Parameters

Annotation file. Tab-delimited file containing several start and end positions, chromosome locations, and strand annotations, such as a bed file. The default structure of these files is like a bed file in which the chromosome is in column 1, 5’ position in column 2, 3’ position in column 3, and strand (“+” or “-“) in column 4. The program will automatically switch orientation based on strand. For this field, you may either upload your own file or use one of the common gene annotations found in the dropdown menu.

Signal file. Tab-delimited file containing signals and positions. Depending on what type of file is specified as well as how probewidth (see below) is defined, these can either represent discrete points with an independent probe width, or they can represent the signal for all locations until the next defined spot. Take, for example, the following excerpt from a signal file:

chr1     10        2.3
chr1     40        5.0

If probewidth is say 5, these two lines are read to mean that locations 10 through 15 on chromosome 1 have a signal of 2.3, and locations 40 through 45 have signal of 5.0. If probewidth is left blank, these two lines are read to mean that locations 10 through 40 have a signal of 2.3. These two options would require one to select the “sgr” format. The “wiggle” format allows one to upload a file containing four columns: one for chromosome, one for start position, one for stop position, and one for intensity, respectively. The “density” option allows one to upload a two column file containing chromosome and location. The program will assume that all points’s have a width of 1 and will calculate the number of probes that fall into the various bins, rather than the average intensity.

nbins upstream and downstream of a site
mbins between start and stop sites

Number of bins (n). Specifies how many bins will be in each flanking region (for a total of 2xn bins). For jobs that do not use scaled bins in the region between start and stop sites, there will be 2xn bins: n bins on each side of the start site.

Number of bins (m). Optional field to be defined if stop and start sites (regions) are to be considered in the aggregation process. Defines the number of bins assigned to the region between start and stop sites (for a total number of m+2n bins).

Length of flanking region. Specifies number of base pairs the program should analyze in both directions from either the start site or the start/stop pair, depending on the option specified.

Probe width. For signal files that present data for probes of a specific width (see Signal file above for use instructions).

Minimum gene length: Tells the program not to consider annotations for start/stop pairs that are shorter than a certain distance.

Include intergenic regions. Tells the program to scale bins inside the start/stop sites, or tells program to analyze in separate directions from the start site if unchecked.

Use mean. The program will automatically ensure that only the median signal of an individual gene (or number of probes, if the density option is selected) contributes to the final “averaging” calculation in each bin to avoid bias against shorter genes. If use mean is selected, the program will take the mean signal across all genes per bin and report the final result. Otherwise, it will use the median signal (of the median signal of each gene).

  • Specific use instructions

Accepted file input types: .sgr, .wig (for signal track); abbreviated .bed file (for annotation track)

After unzipping, the main source file (which includes a detailed header with use instructions) is ACT.py. A makefile with example runs can be found in Agg/samples/Makefile

After downloading and unzipping the aggregation package, Agg.tar, the program can be run as follows (data files can be found under "Example Data" in the Aggregation section):

python ACT.py --nbins=50 --mbins=0 --radius=50000 hg17_ensembl.bed baf155.sgr > baf155_ensembl.out

where hg17_ensembl.bed is the annotations file and baf155.sgr is the signal track, placed in the same folder as ACT.py. An alternative run which would include the 3' boundary of each gene region can be performed using the following:

python ACT.py --nbins=50 --mbins=50 --radius=50000 --regions hg17_ensembl.bed baf155.sgr > baf155_ensembl.out

An aggregation run on point tracks (such as SNP lists) to determine average density can be performed as follows:

python ACT.py --nbins=50 --mbins=0 --radius=50000 --signalparser=PointParser gencode.pc.coords.chr1 YRI.snps.parsed.chr1 > YRI_gencode.out

There are additional tags corresponding to different aggregation options which can be viewed in the readme.

Update: no longer requires numpy to run

corr-p

The correlation script corr-p takes multiple signal tracks of equal length and divides each one into bins, similarly to the aggregation script, except in this case the bins are not hinged around anchor points. Each bin is assigned a value based on the corresponding signal track values, and then the arrays of bins are correlated with each other in pairwise fashion. Ultimately, a matrix of correlation coefficients corresponding to the correlations between all signal tracks is obtained. corr-p is parallelized and runs on bed files and uses Pearson's correlation coefficient to calculate the correlation matrix. For another version of a correlation tool which takes in wig files and can use a variety of correlation methods, please see corr-sat bundle.

There are options in the correlation script allowing one to control bin (sliding window) size and the overlap of the bins (windows).

Correlation parameters summary, from Web ACT
  • Parameters for corr-p

window size. Specifies the size of the sliding window in base pairs. For example, in Genome Res 17: 787-97 a sliding window size of 3kb was used, with an overlap of 1.5kb.

overlap size. The number of base pairs of overlap when the aggregation window “slides.” For example, in Genome Res 17: 787-97 a sliding window size of 3kb was used, with an overlap of 1.5kb.

  • Specific Use Instructions

Input file type: .bed

The tool is run as follows:

java -jar EncodeTfCor2.jar [genome file] [BED file directory] [window size] [overlap size]

where [genome file] is a list of regions for the tool to consider (see dist/human_genome_file.txt for an example), [BED file directory] is the folder containing bed files to correlate, and [window size] and [overlap] are as described in parameters.

A specific example run can be found in EXAMPLE.txt

corr-sat-bundle

  • Correlation

Unlike corr-p, the correlation tool in the corr-sat bundle can take in wig files with an associated signal as input.

chr1	32948	32980	2
chrX	23434	23800	10

It then computes a correlation value (Pearson, Spearman or normal-score) for each pair of binned datasets. It takes a .wig file as input, and creates a tab-delimited file as output.

Correlation script correlates signal tracks using a relatively small window size. In correlation.bat or correlation.sh this can be set by assigning a value to "bin."

The correlation script will first run a binning program which will create broader wig files based on the size specified by "bin." It will then correlate the created files using a separate correlation script. For options regarding the correlation options see correlation.sh or .bat.

The program produces output files such as the following:

Pearson Correlation	chipseq_cfos_bin100.wig	chipseq_jund_bin100.wig	chipseq_max_bin100.wig	chipseq_pol2_bin100.wig
chipseq_cfos_bin100.wig	1	0.15236918091939933	0.26114716638422464	0.14500447272313652
chipseq_jund_bin100.wig	0.15236918091939933	1	0.4091055568644753	0.1422116076377861
chipseq_max_bin100.wig	0.26114716638422464	0.4091055568644753	1	0.3309740106844516
chipseq_pol2_bin100.wig	0.14500447272313652	0.1422116076377861	0.3309740106844516	1

The resulting correlation matrix can be plotted as a heatmap using R or as a dendogram using Phylip.

  • Saturation

Saturation script allows us to determine the saturation level of a given feature after multiple genomic experiments.

Each input file corresponds to one experimental condition (e.g. one new individual), and each line in a file specifies a genomic location that has the biological phenomenon under study (e.g. tagged SNP's). Our implementation makes use of special data structures to avoid redundant counting. It normally takes less than a minute to generate the plot for up to 30 input files each with a few thousand lines. To handle more files and files with more lines, the tool also provides an option to compute the coverage of a random sample of the input file combinations.

It produces saturation plots from a set of binary data files. Each line of the input flie contians a genomic region in the following format:

<ID><tab><start><tab><end>

where <ID> is the identifier of the region-at-large, such as the chromosome <start> is the starting position of the region <end> is the ending position of the region (this position is inside the region)

The y-axis could be the absolute number of nucleotides, or a fraction of an input total number of nucleotides, such as the total number of nucleotides of the coding transcripts in the example. To use the absolute number, input the total as 0.

It produces output files such as the one below:

Number of features	25 percentile	Median	75 percentile
1	38511.0	43483.0	46449.0
2	59867.0	63846.0	76010.0
3	74606.0	83660.0	91301.0
4	86739.0	98260.0	104150.0
5	100967.0	108464.0	114645.0
6	111935.0	117218.0	123894.0
7	120987.0	126453.0	131483.0
8	129865.0	135329.0	139093.0
9	139195.0	143462.0	144009.0
10	147732.0	147732.0	147732.0

In addition, it produces a saturation plot.

web-act

Information about input signal tracks can be found here: http://act.gersteinlab.org/sigfile_readme.htm

  • Aggregation

Information about the aggregation web tool can be found here: http://act.gersteinlab.org/agg_readme.htm

  • GSA

In the GSA, input signals are assigned to the nearest anchor in order to reduce the artifacts caused by subsets of anchors clustering together. For aggregation analysis of segment signals, such as sequencing tags, this requires the genomic scope of the experimental setup to be partitioned and accounted for as denominators. The techniques were used and proved effective in the ENCODE pilot and CTCF nucleosome papers.

Please see http://act.gersteinlab.org/gsa.tar for more information.

  • Correlation

Information about the correlation web tool can be found here: http://act.gersteinlab.org/corr_readme.htm

Other

  • Citation

A paper describing this site and software is currently in preparation . Currently, please just reference act.gersteinlab.org if you use the tool.

  • RSeq Tools

RSeq Tools is a suite of tools which analyze RNA Seq experiment data--one of its functions is to generate a signal track from experimental data, which can be fed into ACT. RSeq Tools can be found here: http://archive.gersteinlab.org/proj/rnaseq/rseqtools/

Personal tools