AlleleSeq

From GersteinInfo

Revision as of 16:37, 7 June 2013 by Public (Talk | contribs)
Jump to: navigation, search

Contents


General outline of pipeline

The basic goal of the pipeline is to take a large collection of reads generated from ChIP-seq or RNA-seq experiments associated with an individual and detect single nucleotide variants (SNVs) that correspond to significantly skewed number of reads. To do this, the pipeline starts with a preprocessing step, before the actual process.

(1) Pre-processing - diploid genome construction using vcf2diploid
In the Rozowsky et al. (2011) paper, the pre-processing step separate (phase) the child's diploid genome into its parental haplotypes based on the sequences of the parents.

(2) AlleleSeq pipeline - mapping and statistical testing using PIPELINE.mk package
a) Reads from ChIP-seq and RNA-seq experiments are aligned and mapped to both haplotype genomes. b) Then for each SNV position with mapped reads, we compare the allele frequencies observed in the two parental haplotypes.

vcf2diploid

Essentially, it constructs a personal genome integrating the the variants from the parents and child to the reference genome.

Installation

1. Download the tool.
2. Type

$make

Usage

java -Xmx10000m -jar vcf2diploid.jar -id sample_id -chr file1.fa file2.fa ... [-vcf file1.vcf file2.vcf ...] > logfile.txt
OPTIONS:
id          - (required) the ID of individual whose genome is being constructed (e.g., NA12878). The tool recognizes by this ID in the VCF file 
chr - (required) FASTA file(s) of reference sequence(s)
vcf - (required) VCF4.0 file(s) containing variants from parents and the individual
Xmx - max memory allocation for JAVA. In this example, 10GB was allocated. logfile.txt - stores the standard output produce from the run

Output

(a) Maternal and paternal FASTA files
These are the references used for the AlleleSeq pipeline.

(b) Map files
These are coordinate files that correspond to the variants on the parental genomes and the reference genome. This is especially important when insertions and deletions are included in the construction of the diploid genome, since the positions go out of sync in the personal and reference genomes.

(c) Chain files
Using the chain file, one can use the LiftOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

Please refer to the README of vcf2diploid for a more detailed description.

AlleleSeq Pipeline

Installation

AlleleSeq runs on LINUX/UNIX command line, and requires the following dependencies (at least the version stated or newer): (a) Python v2.5.1 (b) Bowtie v0.10.0.2 (Bowtie must be in your PATH)

For now, only Bowtie is supported. Some other aligners might be supported in future versions.

Input

(a) FASTA files for the paternal and maternal reference genomes.

(b) One or more collections of unmapped reads; zipped fastq formats "*.fastq.gz".
This is a typical file format for ChIP-seq and RNA-seq reads generated from Illumina. All files matching "*.fastq.gz" will be combined into one dataset.

---NOTE: If you have something other than fastq.gz, e.g. BAM, you will need to add a preliminary conversion step. For instance, bedtools or SAMtools can convert BAM to fastq. Currently, AlleleSeq does not automatically convert them.

---SANITY CHECK: you can use the make target "check" to print out the set of input fastq files. To do this, type "make check"; this should print the source files.

(c) SNV genotype file; tab-delimited file of a set of SNV locations for each of the trio: father, mother and child. AlleleSeq expects this format, with the header:

chr	pos	ref	Mat	Pat	Child	Phase
1	114116078	A	TT	TT	TA	MUTANT
1	114117743	G	CG	GG	GG	HOMO
1	114120250	C	AC	AA	CA	PHASED
1	114120756	A	CA	CC	AC	PHASED
col1: chromosome
col2: chromosomal position
col3: reference allele
col4: maternal genotype
col5: paternal genotype
col6: child (subject) genotype, ordered: MatPat if phased
col7: phasing status describes the phasing in the trio.  
     Possible values are:
     HETERO (all three heterozygous, no phasing possible)
     HOMO   (child is homozygous)
     PHASED (child heterozygous and at least one parent homozygous)
     MUTANT (child genotype is inconsistent with parents' genotype) 

This can be generated by the user or converted from VCF file with a simple helper PERL script provided together with the AlleleSeq package, vcf2snp.

vcf2snp -h for more information on running.

(d) CNV file; a set of normalized read depth values for all snp locations from a separate genomic sequencing experiment. This reports the read depth at that snp compared to overall coverage. This is used to filter out locations with very low or high coverage, which would tend to indicate copy number variation. The file should be in this format (with header):

chrm    snppos  rd
1       52066   0.902113
1       695745  0.909802
1       742429  0.976435

This can be generated from CNVnator (Abyzov et al. 2011).

(e) (optional) for ChIP-seq experiments, a set of "binding sites" for the transcription factor of interest, showing regions in the genome where the transcription factor binds. This will allow filtering of SNV locations to just those within binding sites. If BINDINGSITES is set to a non-existing file, no filtering will be done.

Typically, we used PeakSeq (Rozowsky J. et al. 2009) to determine binding regions. The pipeline expects a BED file (tab-delimited file with no header in this format: chr, start, end; only the first three columns will be read)

The format of chrom names must agree with the other files (usually chr#).

In PIPELINE.mk, set BNDS to this file. You can set it to an empty file, in which case depth will be set to 1.0 and no filtering will be done.

Usage

Personal tools