AlleleSeq

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
(General outline of pipeline)
(Usage)
 
(22 intermediate revisions not shown)
Line 22: Line 22:
=vcf2diploid=
=vcf2diploid=
 +
Essentially, it constructs a personal genome integrating the the variants from the parents and child to the reference genome.
 +
==Installation==
-
=AlleleSeq pipeline=
+
1. Download the tool. <br>
 +
2. Type
 +
$make
 +
 
 +
==Usage==
 +
java -Xmx10000m -jar vcf2diploid.jar -id sample_id -chr file1.fa file2.fa ... [-vcf file1.vcf file2.vcf ...] > logfile.txt
 +
 
 +
OPTIONS:
 +
id          - (required) the ID of individual whose genome is being constructed (e.g., NA12878). The tool recognizes by this ID in the VCF file <br>
 +
chr        - (required) FASTA file(s) of reference sequence(s) <br>
 +
vcf        - (required) VCF4.0 file(s) containing variants from parents and the individual <br>
 +
 +
Xmx        - max memory allocation for JAVA. In this example, 10GB was allocated.
 +
logfile.txt - stores the standard output produce from the run
 +
 
 +
==Output==
 +
'''(a) Maternal and paternal FASTA files''' <br>
 +
These are the references used for the AlleleSeq pipeline. <br><br>
 +
 
 +
'''(b) Map files''' <br>
 +
These are coordinate files that correspond to the variants on the parental genomes and
 +
the reference genome. This is especially important when insertions and deletions
 +
are included in the construction of the diploid genome, since the positions go out of sync in the personal and reference genomes.<br><br>
 +
 
 +
'''(c) Chain files''' <br>
 +
Using the chain file, one can use the [http://genome.ucsc.edu/goldenPath/help/chain.html LiftOver] tool to convert the annotation
 +
coordinates from reference genome to personal haplotypes. <br><br>
 +
 
 +
Please refer to the README of vcf2diploid for a more detailed description.<br>
 +
 
 +
=AlleleSeq Pipeline=
 +
 
 +
==Installation==
 +
AlleleSeq runs on LINUX/UNIX command line, and requires the following dependencies
 +
(at least the version stated or newer):
 +
(a) Python v2.5.1
 +
(b) Bowtie v0.10.0.2 (Bowtie must be in your PATH)
 +
 
 +
For now, only Bowtie is supported. Some other aligners might be supported in future versions.
 +
 
 +
==Input==
 +
'''(a)''' FASTA files for the paternal and maternal reference genomes. <br><br>
 +
 
 +
'''(b)''' One or more collections of unmapped reads; zipped fastq formats "*.fastq.gz". <br>
 +
This is a typical file format for ChIP-seq and RNA-seq reads generated from Illumina.   
 +
All files matching "*.fastq.gz" will be combined into one dataset. <br><br>
 +
 
 +
---NOTE: If you have something other than fastq.gz, e.g. BAM, you will need to add
 +
a preliminary conversion step. For instance, bedtools or SAMtools can convert BAM
 +
to fastq. Currently, AlleleSeq does not automatically convert them. <br><br>
 +
 
 +
---SANITY CHECK: you can use the make target "check" to print out the set of
 +
input fastq files. To do this, type "make check"; this should print the source files. <br><br>
 +
 
 +
'''(c)''' SNV genotype file; tab-delimited file of a set of SNV locations for each of
 +
the trio: father, mother and child. AlleleSeq expects this format: <br>
 +
 
 +
1 114116078 A TT TT TA MUTANT
 +
1 114117743 G CG GG GG HOMO
 +
1 114120250 C AC AA CA PHASED
 +
1 114120756 A CA CC AC PHASED
 +
 
 +
col1: chromosome
 +
col2: chromosomal position
 +
col3: reference allele
 +
col4: maternal genotype
 +
col5: paternal genotype
 +
col6: child (subject) genotype, ordered: MatPat if phased
 +
col7: phasing status describes the phasing in the trio. 
 +
      Possible values are:
 +
      HETERO (all three heterozygous, no phasing possible)
 +
      HOMO  (child is homozygous)
 +
      PHASED (child heterozygous and at least one parent homozygous)
 +
      MUTANT (child genotype is inconsistent with parents' genotype) <br>
 +
 
 +
This can be generated by the user or converted from VCF file with a simple helper
 +
PERL script provided together with the AlleleSeq package, vcf2snp. <br><br>
 +
 
 +
vcf2snp -h for more information on running.<br><br>
 +
 
 +
'''(d)''' CNV file; a set of normalized read depth values for all snp locations from a
 +
separate genomic sequencing experiment.  This reports the read depth at that snp
 +
compared to overall coverage. This is used to filter out locations with very low
 +
or high coverage, which would tend to indicate copy number variation.  The file
 +
should be in this format (with header): <br>
 +
 
 +
chrm    snppos  rd
 +
1      52066  0.902113
 +
1      695745  0.909802
 +
1      742429  0.976435
 +
 
 +
This can be generated from CNVnator (Abyzov et al. 2011).
 +
A set of scripts has been provided [http://alleleseq.gersteinlab.org/tools.html here] to help with the conversion process using CNVnator. It is by no means the only way. <br><br>
 +
 
 +
'''(e)''' (optional) for ChIP-seq experiments, a set of "binding sites" for the
 +
transcription factor of interest, showing regions in the genome where the
 +
transcription factor binds.  This will allow filtering of SNV locations to
 +
just those within binding sites. If BINDINGSITES is set to a non-existing file,
 +
no filtering will be done. <br>
 +
 
 +
Typically, we used PeakSeq (Rozowsky J. et al. 2009) to determine binding regions.
 +
The pipeline expects a BED file (tab-delimited file with no header in this format:
 +
chr, start, end; only the first three columns will be read) <br>
 +
 
 +
The format of chrom names must agree with the other files (usually chr#). <br>
 +
 
 +
In PIPELINE.mk, set BNDS to this file.  You can set it to an empty file, in which
 +
case depth will be set to 1.0 and no filtering will be done. <br>
 +
 
 +
==Usage==
 +
 
 +
'''STEP 1: Bowtie pre-processing for maternal (mat) and paternal (pat) genomes''' <br>
 +
This step is to build the index for Bowtie. AlleleSeq uses Bowtie by default.
 +
In principle, any aligner can be used. But the paths and parameters will have
 +
to be changed in PIPELINE.mk. <br><br>
 +
 
 +
(i) Create 2 folders, each for the mat and pat genomes. <br>
 +
(ii) In each pat or mat folder, the separate mat and pat FASTA files created from pre-processing (Section 3) should be soft-linked here. <br>
 +
(iii) Build the Bowtie indices, for instance, if an individual NA1 has only three chromosomes: <br>
 +
 
 +
bowtie-build 1_NA1_paternal.fa,2_NA1_paternal.fa,3_NA1_paternal.fa PatRef
 +
 
 +
--NOTE: the chromosome number is NOT prefixed by "chr". <br>
 +
 +
'''STEP 2: PIPELINE.mk''' <br>
 +
The actual pipeline driver is a makefile, "PIPELINE.mk". <br>
 +
 
 +
To run:
 +
 
 +
make -f /path/to/PIPELINE.mk
 +
 
 +
PIPELINE.mk contains the following parameters:<br>
 +
 
 +
BASE=/home/alleleSeq_run ## location of data and results <br>
 +
PL:=/home/AlleleSeq_v1.1 ## location of AlleleSeq tool <br>
 +
SNPS:=$(BASE)/snp.txt ## SNV genotype file of trio <br>
 +
CNVS:=$(BASE)/rd.txt ## CNV file for read depth <br>
 +
BNDS:=hits.bed ## binding site file; if NA, put any filename as placeholder <br>
 +
MAPS:=$(BASE)/personal_genome/%s_NA12878.map ## location of MAP files from vcf2diploid pre-processing <br>
 +
FDR_SIMS:=5 ## reads cut-off for simulations <br>
 +
FDR_CUTOFF:=0.1 ## false discovery rate q-value cutoff <br>
 +
 
 +
==Output==
 +
(a) counts.txt: Contains count information for all heterozygous SNV locations that have the minimum number of reads stipulated. <br>
 +
(b) counts.log: Contains information about why each SNV was judged significant. <br>
 +
(c) FDR.txt: the p-values and the corresponding q-values from the simulations <br>
 +
(d) interestingHets.txt:  This file contains information about all SNV locations
 +
that passed all these tests: <br>
 +
- in a binding site <br>
 +
- within range for cnv test <br>
 +
- significantly assymetric by FDR test <br>

Latest revision as of 21:29, 13 February 2014

Contents


General outline of pipeline

The basic goal of the pipeline is to take a large collection of reads generated from ChIP-seq or RNA-seq experiments associated with an individual and detect single nucleotide variants (SNVs) that correspond to significantly skewed number of reads. To do this, the pipeline starts with a preprocessing step, before the actual process.

(1) Pre-processing - diploid genome construction using vcf2diploid
In the Rozowsky et al. (2011) paper, the pre-processing step separate (phase) the child's diploid genome into its parental haplotypes based on the sequences of the parents.

(2) AlleleSeq pipeline - mapping and statistical testing using PIPELINE.mk package
a) Reads from ChIP-seq and RNA-seq experiments are aligned and mapped to both haplotype genomes. b) Then for each SNV position with mapped reads, we compare the allele frequencies observed in the two parental haplotypes.

vcf2diploid

Essentially, it constructs a personal genome integrating the the variants from the parents and child to the reference genome.

Installation

1. Download the tool.
2. Type

$make

Usage

java -Xmx10000m -jar vcf2diploid.jar -id sample_id -chr file1.fa file2.fa ... [-vcf file1.vcf file2.vcf ...] > logfile.txt
OPTIONS:
id          - (required) the ID of individual whose genome is being constructed (e.g., NA12878). The tool recognizes by this ID in the VCF file 
chr - (required) FASTA file(s) of reference sequence(s)
vcf - (required) VCF4.0 file(s) containing variants from parents and the individual
Xmx - max memory allocation for JAVA. In this example, 10GB was allocated. logfile.txt - stores the standard output produce from the run

Output

(a) Maternal and paternal FASTA files
These are the references used for the AlleleSeq pipeline.

(b) Map files
These are coordinate files that correspond to the variants on the parental genomes and the reference genome. This is especially important when insertions and deletions are included in the construction of the diploid genome, since the positions go out of sync in the personal and reference genomes.

(c) Chain files
Using the chain file, one can use the LiftOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

Please refer to the README of vcf2diploid for a more detailed description.

AlleleSeq Pipeline

Installation

AlleleSeq runs on LINUX/UNIX command line, and requires the following dependencies (at least the version stated or newer): (a) Python v2.5.1 (b) Bowtie v0.10.0.2 (Bowtie must be in your PATH)

For now, only Bowtie is supported. Some other aligners might be supported in future versions.

Input

(a) FASTA files for the paternal and maternal reference genomes.

(b) One or more collections of unmapped reads; zipped fastq formats "*.fastq.gz".
This is a typical file format for ChIP-seq and RNA-seq reads generated from Illumina. All files matching "*.fastq.gz" will be combined into one dataset.

---NOTE: If you have something other than fastq.gz, e.g. BAM, you will need to add a preliminary conversion step. For instance, bedtools or SAMtools can convert BAM to fastq. Currently, AlleleSeq does not automatically convert them.

---SANITY CHECK: you can use the make target "check" to print out the set of input fastq files. To do this, type "make check"; this should print the source files.

(c) SNV genotype file; tab-delimited file of a set of SNV locations for each of the trio: father, mother and child. AlleleSeq expects this format:

1	114116078	A	TT	TT	TA	MUTANT
1	114117743	G	CG	GG	GG	HOMO
1	114120250	C	AC	AA	CA	PHASED
1	114120756	A	CA	CC	AC	PHASED
col1: chromosome
col2: chromosomal position
col3: reference allele
col4: maternal genotype
col5: paternal genotype
col6: child (subject) genotype, ordered: MatPat if phased
col7: phasing status describes the phasing in the trio.  
     Possible values are:
     HETERO (all three heterozygous, no phasing possible)
     HOMO   (child is homozygous)
     PHASED (child heterozygous and at least one parent homozygous)
     MUTANT (child genotype is inconsistent with parents' genotype) 

This can be generated by the user or converted from VCF file with a simple helper PERL script provided together with the AlleleSeq package, vcf2snp.

vcf2snp -h for more information on running.

(d) CNV file; a set of normalized read depth values for all snp locations from a separate genomic sequencing experiment. This reports the read depth at that snp compared to overall coverage. This is used to filter out locations with very low or high coverage, which would tend to indicate copy number variation. The file should be in this format (with header):

chrm    snppos  rd
1       52066   0.902113
1       695745  0.909802
1       742429  0.976435

This can be generated from CNVnator (Abyzov et al. 2011). A set of scripts has been provided here to help with the conversion process using CNVnator. It is by no means the only way.

(e) (optional) for ChIP-seq experiments, a set of "binding sites" for the transcription factor of interest, showing regions in the genome where the transcription factor binds. This will allow filtering of SNV locations to just those within binding sites. If BINDINGSITES is set to a non-existing file, no filtering will be done.

Typically, we used PeakSeq (Rozowsky J. et al. 2009) to determine binding regions. The pipeline expects a BED file (tab-delimited file with no header in this format: chr, start, end; only the first three columns will be read)

The format of chrom names must agree with the other files (usually chr#).

In PIPELINE.mk, set BNDS to this file. You can set it to an empty file, in which case depth will be set to 1.0 and no filtering will be done.

Usage

STEP 1: Bowtie pre-processing for maternal (mat) and paternal (pat) genomes
This step is to build the index for Bowtie. AlleleSeq uses Bowtie by default. In principle, any aligner can be used. But the paths and parameters will have to be changed in PIPELINE.mk.

(i) Create 2 folders, each for the mat and pat genomes.
(ii) In each pat or mat folder, the separate mat and pat FASTA files created from pre-processing (Section 3) should be soft-linked here.
(iii) Build the Bowtie indices, for instance, if an individual NA1 has only three chromosomes:

bowtie-build 1_NA1_paternal.fa,2_NA1_paternal.fa,3_NA1_paternal.fa PatRef

--NOTE: the chromosome number is NOT prefixed by "chr".

STEP 2: PIPELINE.mk
The actual pipeline driver is a makefile, "PIPELINE.mk".

To run:

make -f /path/to/PIPELINE.mk

PIPELINE.mk contains the following parameters:

BASE=/home/alleleSeq_run ## location of data and results
PL:=/home/AlleleSeq_v1.1 ## location of AlleleSeq tool
SNPS:=$(BASE)/snp.txt ## SNV genotype file of trio
CNVS:=$(BASE)/rd.txt ## CNV file for read depth
BNDS:=hits.bed ## binding site file; if NA, put any filename as placeholder
MAPS:=$(BASE)/personal_genome/%s_NA12878.map ## location of MAP files from vcf2diploid pre-processing
FDR_SIMS:=5 ## reads cut-off for simulations
FDR_CUTOFF:=0.1 ## false discovery rate q-value cutoff

Output

(a) counts.txt: Contains count information for all heterozygous SNV locations that have the minimum number of reads stipulated.
(b) counts.log: Contains information about why each SNV was judged significant.
(c) FDR.txt: the p-values and the corresponding q-values from the simulations
(d) interestingHets.txt: This file contains information about all SNV locations that passed all these tests:
- in a binding site
- within range for cnv test
- significantly assymetric by FDR test

Personal tools