AlleleSeq
From GersteinInfo
(→General outline of pipeline) |
(→Usage) |
||
(27 intermediate revisions not shown) | |||
Line 9: | Line 9: | ||
actual process. | actual process. | ||
- | + | '''(1) Pre-processing - diploid genome construction using ''vcf2diploid''''' <br> | |
In the Rozowsky ''et al.'' (2011) paper, the | In the Rozowsky ''et al.'' (2011) paper, the | ||
pre-processing step separate (phase) the child's diploid genome into its parental | pre-processing step separate (phase) the child's diploid genome into its parental | ||
- | haplotypes based on the sequences of the parents. | + | haplotypes based on the sequences of the parents. <br> |
- | + | '''(2) AlleleSeq pipeline - mapping and statistical testing using PIPELINE.mk package''' <br> | |
- | + | a) Reads from ChIP-seq and RNA-seq experiments are aligned and mapped to both | |
- | haplotype genomes. | + | haplotype genomes. |
- | + | b) Then for each SNV position with mapped reads, we compare the allele | |
- | + | frequencies observed in the two parental haplotypes. <br> | |
- | frequencies observed in the two parental haplotypes. | + | |
=vcf2diploid= | =vcf2diploid= | ||
+ | Essentially, it constructs a personal genome integrating the the variants from the parents and child to the reference genome. | ||
+ | |||
+ | ==Installation== | ||
+ | |||
+ | 1. Download the tool. <br> | ||
+ | 2. Type | ||
+ | $make | ||
+ | |||
+ | ==Usage== | ||
+ | java -Xmx10000m -jar vcf2diploid.jar -id sample_id -chr file1.fa file2.fa ... [-vcf file1.vcf file2.vcf ...] > logfile.txt | ||
+ | |||
+ | OPTIONS: | ||
+ | id - (required) the ID of individual whose genome is being constructed (e.g., NA12878). The tool recognizes by this ID in the VCF file <br> | ||
+ | chr - (required) FASTA file(s) of reference sequence(s) <br> | ||
+ | vcf - (required) VCF4.0 file(s) containing variants from parents and the individual <br> | ||
+ | |||
+ | Xmx - max memory allocation for JAVA. In this example, 10GB was allocated. | ||
+ | logfile.txt - stores the standard output produce from the run | ||
+ | |||
+ | ==Output== | ||
+ | '''(a) Maternal and paternal FASTA files''' <br> | ||
+ | These are the references used for the AlleleSeq pipeline. <br><br> | ||
+ | |||
+ | '''(b) Map files''' <br> | ||
+ | These are coordinate files that correspond to the variants on the parental genomes and | ||
+ | the reference genome. This is especially important when insertions and deletions | ||
+ | are included in the construction of the diploid genome, since the positions go out of sync in the personal and reference genomes.<br><br> | ||
+ | |||
+ | '''(c) Chain files''' <br> | ||
+ | Using the chain file, one can use the [http://genome.ucsc.edu/goldenPath/help/chain.html LiftOver] tool to convert the annotation | ||
+ | coordinates from reference genome to personal haplotypes. <br><br> | ||
+ | |||
+ | Please refer to the README of vcf2diploid for a more detailed description.<br> | ||
+ | |||
+ | =AlleleSeq Pipeline= | ||
+ | |||
+ | ==Installation== | ||
+ | AlleleSeq runs on LINUX/UNIX command line, and requires the following dependencies | ||
+ | (at least the version stated or newer): | ||
+ | (a) Python v2.5.1 | ||
+ | (b) Bowtie v0.10.0.2 (Bowtie must be in your PATH) | ||
+ | |||
+ | For now, only Bowtie is supported. Some other aligners might be supported in future versions. | ||
+ | |||
+ | ==Input== | ||
+ | '''(a)''' FASTA files for the paternal and maternal reference genomes. <br><br> | ||
+ | |||
+ | '''(b)''' One or more collections of unmapped reads; zipped fastq formats "*.fastq.gz". <br> | ||
+ | This is a typical file format for ChIP-seq and RNA-seq reads generated from Illumina. | ||
+ | All files matching "*.fastq.gz" will be combined into one dataset. <br><br> | ||
+ | |||
+ | ---NOTE: If you have something other than fastq.gz, e.g. BAM, you will need to add | ||
+ | a preliminary conversion step. For instance, bedtools or SAMtools can convert BAM | ||
+ | to fastq. Currently, AlleleSeq does not automatically convert them. <br><br> | ||
+ | |||
+ | ---SANITY CHECK: you can use the make target "check" to print out the set of | ||
+ | input fastq files. To do this, type "make check"; this should print the source files. <br><br> | ||
+ | |||
+ | '''(c)''' SNV genotype file; tab-delimited file of a set of SNV locations for each of | ||
+ | the trio: father, mother and child. AlleleSeq expects this format: <br> | ||
+ | |||
+ | 1 114116078 A TT TT TA MUTANT | ||
+ | 1 114117743 G CG GG GG HOMO | ||
+ | 1 114120250 C AC AA CA PHASED | ||
+ | 1 114120756 A CA CC AC PHASED | ||
+ | |||
+ | col1: chromosome | ||
+ | col2: chromosomal position | ||
+ | col3: reference allele | ||
+ | col4: maternal genotype | ||
+ | col5: paternal genotype | ||
+ | col6: child (subject) genotype, ordered: MatPat if phased | ||
+ | col7: phasing status describes the phasing in the trio. | ||
+ | Possible values are: | ||
+ | HETERO (all three heterozygous, no phasing possible) | ||
+ | HOMO (child is homozygous) | ||
+ | PHASED (child heterozygous and at least one parent homozygous) | ||
+ | MUTANT (child genotype is inconsistent with parents' genotype) <br> | ||
+ | |||
+ | This can be generated by the user or converted from VCF file with a simple helper | ||
+ | PERL script provided together with the AlleleSeq package, vcf2snp. <br><br> | ||
+ | |||
+ | vcf2snp -h for more information on running.<br><br> | ||
+ | |||
+ | '''(d)''' CNV file; a set of normalized read depth values for all snp locations from a | ||
+ | separate genomic sequencing experiment. This reports the read depth at that snp | ||
+ | compared to overall coverage. This is used to filter out locations with very low | ||
+ | or high coverage, which would tend to indicate copy number variation. The file | ||
+ | should be in this format (with header): <br> | ||
+ | |||
+ | chrm snppos rd | ||
+ | 1 52066 0.902113 | ||
+ | 1 695745 0.909802 | ||
+ | 1 742429 0.976435 | ||
+ | |||
+ | This can be generated from CNVnator (Abyzov et al. 2011). | ||
+ | A set of scripts has been provided [http://alleleseq.gersteinlab.org/tools.html here] to help with the conversion process using CNVnator. It is by no means the only way. <br><br> | ||
+ | |||
+ | '''(e)''' (optional) for ChIP-seq experiments, a set of "binding sites" for the | ||
+ | transcription factor of interest, showing regions in the genome where the | ||
+ | transcription factor binds. This will allow filtering of SNV locations to | ||
+ | just those within binding sites. If BINDINGSITES is set to a non-existing file, | ||
+ | no filtering will be done. <br> | ||
+ | |||
+ | Typically, we used PeakSeq (Rozowsky J. et al. 2009) to determine binding regions. | ||
+ | The pipeline expects a BED file (tab-delimited file with no header in this format: | ||
+ | chr, start, end; only the first three columns will be read) <br> | ||
+ | |||
+ | The format of chrom names must agree with the other files (usually chr#). <br> | ||
+ | |||
+ | In PIPELINE.mk, set BNDS to this file. You can set it to an empty file, in which | ||
+ | case depth will be set to 1.0 and no filtering will be done. <br> | ||
+ | |||
+ | ==Usage== | ||
+ | |||
+ | '''STEP 1: Bowtie pre-processing for maternal (mat) and paternal (pat) genomes''' <br> | ||
+ | This step is to build the index for Bowtie. AlleleSeq uses Bowtie by default. | ||
+ | In principle, any aligner can be used. But the paths and parameters will have | ||
+ | to be changed in PIPELINE.mk. <br><br> | ||
+ | |||
+ | (i) Create 2 folders, each for the mat and pat genomes. <br> | ||
+ | (ii) In each pat or mat folder, the separate mat and pat FASTA files created from pre-processing (Section 3) should be soft-linked here. <br> | ||
+ | (iii) Build the Bowtie indices, for instance, if an individual NA1 has only three chromosomes: <br> | ||
+ | |||
+ | bowtie-build 1_NA1_paternal.fa,2_NA1_paternal.fa,3_NA1_paternal.fa PatRef | ||
+ | |||
+ | --NOTE: the chromosome number is NOT prefixed by "chr". <br> | ||
+ | |||
+ | '''STEP 2: PIPELINE.mk''' <br> | ||
+ | The actual pipeline driver is a makefile, "PIPELINE.mk". <br> | ||
+ | |||
+ | To run: | ||
+ | |||
+ | make -f /path/to/PIPELINE.mk | ||
+ | |||
+ | PIPELINE.mk contains the following parameters:<br> | ||
+ | BASE=/home/alleleSeq_run ## location of data and results <br> | ||
+ | PL:=/home/AlleleSeq_v1.1 ## location of AlleleSeq tool <br> | ||
+ | SNPS:=$(BASE)/snp.txt ## SNV genotype file of trio <br> | ||
+ | CNVS:=$(BASE)/rd.txt ## CNV file for read depth <br> | ||
+ | BNDS:=hits.bed ## binding site file; if NA, put any filename as placeholder <br> | ||
+ | MAPS:=$(BASE)/personal_genome/%s_NA12878.map ## location of MAP files from vcf2diploid pre-processing <br> | ||
+ | FDR_SIMS:=5 ## reads cut-off for simulations <br> | ||
+ | FDR_CUTOFF:=0.1 ## false discovery rate q-value cutoff <br> | ||
- | = | + | ==Output== |
+ | (a) counts.txt: Contains count information for all heterozygous SNV locations that have the minimum number of reads stipulated. <br> | ||
+ | (b) counts.log: Contains information about why each SNV was judged significant. <br> | ||
+ | (c) FDR.txt: the p-values and the corresponding q-values from the simulations <br> | ||
+ | (d) interestingHets.txt: This file contains information about all SNV locations | ||
+ | that passed all these tests: <br> | ||
+ | - in a binding site <br> | ||
+ | - within range for cnv test <br> | ||
+ | - significantly assymetric by FDR test <br> |
Latest revision as of 21:29, 13 February 2014
Contents |
General outline of pipeline
The basic goal of the pipeline is to take a large collection of reads generated from ChIP-seq or RNA-seq experiments associated with an individual and detect single nucleotide variants (SNVs) that correspond to significantly skewed number of reads. To do this, the pipeline starts with a preprocessing step, before the actual process.
(1) Pre-processing - diploid genome construction using vcf2diploid
In the Rozowsky et al. (2011) paper, the
pre-processing step separate (phase) the child's diploid genome into its parental
haplotypes based on the sequences of the parents.
(2) AlleleSeq pipeline - mapping and statistical testing using PIPELINE.mk package
a) Reads from ChIP-seq and RNA-seq experiments are aligned and mapped to both
haplotype genomes.
b) Then for each SNV position with mapped reads, we compare the allele
frequencies observed in the two parental haplotypes.
vcf2diploid
Essentially, it constructs a personal genome integrating the the variants from the parents and child to the reference genome.
Installation
1. Download the tool.
2. Type
$make
Usage
java -Xmx10000m -jar vcf2diploid.jar -id sample_id -chr file1.fa file2.fa ... [-vcf file1.vcf file2.vcf ...] > logfile.txt
OPTIONS: id - (required) the ID of individual whose genome is being constructed (e.g., NA12878). The tool recognizes by this ID in the VCF file
chr - (required) FASTA file(s) of reference sequence(s)
vcf - (required) VCF4.0 file(s) containing variants from parents and the individual
Xmx - max memory allocation for JAVA. In this example, 10GB was allocated. logfile.txt - stores the standard output produce from the run
Output
(a) Maternal and paternal FASTA files
These are the references used for the AlleleSeq pipeline.
(b) Map files
These are coordinate files that correspond to the variants on the parental genomes and
the reference genome. This is especially important when insertions and deletions
are included in the construction of the diploid genome, since the positions go out of sync in the personal and reference genomes.
(c) Chain files
Using the chain file, one can use the LiftOver tool to convert the annotation
coordinates from reference genome to personal haplotypes.
Please refer to the README of vcf2diploid for a more detailed description.
AlleleSeq Pipeline
Installation
AlleleSeq runs on LINUX/UNIX command line, and requires the following dependencies (at least the version stated or newer): (a) Python v2.5.1 (b) Bowtie v0.10.0.2 (Bowtie must be in your PATH)
For now, only Bowtie is supported. Some other aligners might be supported in future versions.
Input
(a) FASTA files for the paternal and maternal reference genomes.
(b) One or more collections of unmapped reads; zipped fastq formats "*.fastq.gz".
This is a typical file format for ChIP-seq and RNA-seq reads generated from Illumina.
All files matching "*.fastq.gz" will be combined into one dataset.
---NOTE: If you have something other than fastq.gz, e.g. BAM, you will need to add
a preliminary conversion step. For instance, bedtools or SAMtools can convert BAM
to fastq. Currently, AlleleSeq does not automatically convert them.
---SANITY CHECK: you can use the make target "check" to print out the set of
input fastq files. To do this, type "make check"; this should print the source files.
(c) SNV genotype file; tab-delimited file of a set of SNV locations for each of
the trio: father, mother and child. AlleleSeq expects this format:
1 114116078 A TT TT TA MUTANT 1 114117743 G CG GG GG HOMO 1 114120250 C AC AA CA PHASED 1 114120756 A CA CC AC PHASED
col1: chromosome col2: chromosomal position col3: reference allele col4: maternal genotype col5: paternal genotype col6: child (subject) genotype, ordered: MatPat if phased col7: phasing status describes the phasing in the trio. Possible values are: HETERO (all three heterozygous, no phasing possible) HOMO (child is homozygous) PHASED (child heterozygous and at least one parent homozygous) MUTANT (child genotype is inconsistent with parents' genotype)
This can be generated by the user or converted from VCF file with a simple helper
PERL script provided together with the AlleleSeq package, vcf2snp.
vcf2snp -h for more information on running.
(d) CNV file; a set of normalized read depth values for all snp locations from a
separate genomic sequencing experiment. This reports the read depth at that snp
compared to overall coverage. This is used to filter out locations with very low
or high coverage, which would tend to indicate copy number variation. The file
should be in this format (with header):
chrm snppos rd 1 52066 0.902113 1 695745 0.909802 1 742429 0.976435
This can be generated from CNVnator (Abyzov et al. 2011).
A set of scripts has been provided here to help with the conversion process using CNVnator. It is by no means the only way.
(e) (optional) for ChIP-seq experiments, a set of "binding sites" for the
transcription factor of interest, showing regions in the genome where the
transcription factor binds. This will allow filtering of SNV locations to
just those within binding sites. If BINDINGSITES is set to a non-existing file,
no filtering will be done.
Typically, we used PeakSeq (Rozowsky J. et al. 2009) to determine binding regions.
The pipeline expects a BED file (tab-delimited file with no header in this format:
chr, start, end; only the first three columns will be read)
The format of chrom names must agree with the other files (usually chr#).
In PIPELINE.mk, set BNDS to this file. You can set it to an empty file, in which
case depth will be set to 1.0 and no filtering will be done.
Usage
STEP 1: Bowtie pre-processing for maternal (mat) and paternal (pat) genomes
This step is to build the index for Bowtie. AlleleSeq uses Bowtie by default.
In principle, any aligner can be used. But the paths and parameters will have
to be changed in PIPELINE.mk.
(i) Create 2 folders, each for the mat and pat genomes.
(ii) In each pat or mat folder, the separate mat and pat FASTA files created from pre-processing (Section 3) should be soft-linked here.
(iii) Build the Bowtie indices, for instance, if an individual NA1 has only three chromosomes:
bowtie-build 1_NA1_paternal.fa,2_NA1_paternal.fa,3_NA1_paternal.fa PatRef
--NOTE: the chromosome number is NOT prefixed by "chr".
STEP 2: PIPELINE.mk
The actual pipeline driver is a makefile, "PIPELINE.mk".
To run:
make -f /path/to/PIPELINE.mk
PIPELINE.mk contains the following parameters:
BASE=/home/alleleSeq_run ## location of data and results
PL:=/home/AlleleSeq_v1.1 ## location of AlleleSeq tool
SNPS:=$(BASE)/snp.txt ## SNV genotype file of trio
CNVS:=$(BASE)/rd.txt ## CNV file for read depth
BNDS:=hits.bed ## binding site file; if NA, put any filename as placeholder
MAPS:=$(BASE)/personal_genome/%s_NA12878.map ## location of MAP files from vcf2diploid pre-processing
FDR_SIMS:=5 ## reads cut-off for simulations
FDR_CUTOFF:=0.1 ## false discovery rate q-value cutoff
Output
(a) counts.txt: Contains count information for all heterozygous SNV locations that have the minimum number of reads stipulated.
(b) counts.log: Contains information about why each SNV was judged significant.
(c) FDR.txt: the p-values and the corresponding q-values from the simulations
(d) interestingHets.txt: This file contains information about all SNV locations
that passed all these tests:
- in a binding site
- within range for cnv test
- significantly assymetric by FDR test