Gerstein Lab Research - 22 Jul. 2013
From GersteinInfo
Mining Personal Genomes
The number of personal genomes sequenced is expected to increase rapidly in the next few years. Soon, sequencing one’s own genome may become as commonplace in medical care as getting an X-ray. Moreover, an individual’s window into biological science is going to increasingly be through the lens of his or her own genome. Addressing this, the thrust of my laboratory is aimed at integrating personal genomes with other biological data and developing methods to assist in their interpretation. This endeavor has a number of aspects.
Human Genome Variation
First, we are very involved in the search for variants in personal genomes. We focus on a particular type of variant, structural variation, which involves the re-arrangement of blocks in the genome. It is believed that structural variants involve as many nucleotides in the genome as the better-known single-nucleotide polymorphisms, or SNPs (Mills et al., 2011; Korbel et al., 2008). We have developed a number of approaches for identifying structural variants in genomes. These involve looking at the consistency of read coverage over the genome (read depth), searching for special reads that split over breakpoints (split reads), analyzing unusual pair separations in paired end reads (PEM), and identifying and studying instances of fusion genes (Abyzov et al., 2011a,b; Korbel et al., 2009; Lam et al., 2010; Sboner et al., 2010b). Much of this work has taken place in the context of large international consortia, such as the 1000 Genomes Project, as well as in disease-focused programs such as those related to prostate cancer (Sboner et al., 2010a; Berger et al., 2011; Lin et al., 2013).
Human Genome Annotation
After one has all of the variants in a personal genome, the next step is to attempt to understand what they mean. This often takes place through genome annotation, which provides biochemical and evolutionary context to each base. We are very involved in the international genome annotation efforts carried out by the ENCODE Consortium. We focus on a number of annotations in the genome, principally transcription-factor binding sites, non-coding RNAs (ncRNAs), and pseudogenes. In relation to the latter, we have developed numerous methods for identifying pseudogenes in the genome (Zhang et al., 2006). We consider these as genomic fossils that provide abundant details about human molecular history -- much more so than our genes, particularly when they are compared to pseudogenes in other organisms (Gerstein & Zheng, 2006). We were one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, illustrating the very different pseudogene complements in different organisms (Zhang et al., 2002a,b, 2003, 2004; Harrison et al., 2001, 2002a,c, 2003a,b; Zhang & Gerstein, 2003c,e; Liu et al., 2004a; Lam et al., 2008; Pseudogene.org). Moreover, we have found hints that some of the supposedly "dead" pseudogenes may actually harbour biochemical activity (Zheng et al., 2005, 2007a,b; Harrison et al., 2005, Pei et al., 2012; Sasidharan & Gerstein, 2008). In recent years, we have increasingly worked on ncRNAs. We have analyzed selective constraints on these in the context of data generated as part of The 1000 Genomes Project (Mu et al., 2011). We have developed a number of tools to process tiling arrays and next-generation sequencing to identify regions of intragenic transcription, which are often called transcriptionally active reaches. Among these is RSEQtools for processing RNA-seq data (Habegger et al., 2011). In addition, we have developed tool set for integrating to find well-characterized non-coding RNA (Lu et al., 2011; incRNA). In relation to transcription-factor binding sites, we have developed methods for finding these elements by processing ChIP sequencing data, using the level of binding to statistically predict the expression of target genes, and putting the results into the framework of a network (see below, Zhang et al., 2008; Rozowsky et al., 2009; Yip et al., 2012; Gerstein et al., 2010, 2012; Cheng et al., 2011a,b, 2012).
Analysis of Networks
This leads into the next research topic in the laboratory, the interpretation of networks. Here we try to determine how many genes can act together as a unified system. A first step is identifying key points such as hubs and bottlenecks (Yu et al., 2004b, 2006, 2007). One of the most powerful aspects of the network representation is the fact that it can be applied to many different types of data, whether that data is biological or not. Thus, in addition to looking at transcription factor regulatory networks, we have also investigated protein-protein interactions and metabolic pathways. Moreover, as people have much more intuition for commonplace networks, such as those in social and computer systems, we have found that cross-disciplinary comparisons can help to elucidate system-level properties of biological networks (Yan et al., 2010; Bhardwaj et al., 2010, 2011a; Shou et al., 2011). Furthermore, we have developed a number of generic tools to build and analyze networks derived from genes and other forms of data in a consistent fashion (Douglas et al., 2005; Xia et al., 2004; Yu et al., 2004b, 2006; Yip et al., 2006; tYNA.gersteinlab.org, PubNet.gersteinlab.org). Because they are a fairly generic and flexible representation, networks provide an ideal framework for data integration. We have integrated networks with dynamic expression data, 3D-protein structures, and even satellite imagery. In particular, using expression data, we have identified the transient nature of hubs and systematic patterns of connectivity rewiring in the regulatory network (Luscombe et al., 2004). We have connected interaction networks to 3-D structures, conceptualizing them in terms of physical interaction surfaces (Kim et al., 2006; Kim et al., 2008a; Bhardwaj et al., 2011b). Finally, we have shown how the usage of metabolic pathways in ocean metagenomic sequencing correlates with environmental variables gleaned from satellite imagery, potentially allowing them to be used as biosensors (Patel et al., 2010; Gianoulis et al., 2009).
Macromolecular Motions & Packing
We have investigated the molecular structure of many genes within networks. In particular, we have set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains (Krebs & Gerstein, 1998, 2000; Alexandrov et al., 2005; Flores et al., 2005, 2006; Goh et al., 2004a; Gerstein & Echols, 2004; Echols et al., 2003; Krebs et al., 2002; MolMovDB.org). Part of this project involves devising a system for characterizing motions in a highly standardized fashion in terms of key statistics, such as the location of hinges and the degree of rotations about these. Our classification of motions is based on the interdigitated packing at internal interfaces (Gerstein et et al., 1994b; Gerstein & Chothia, 1999). This classification scheme is motivated by the fact that protein interiors are packed exceedingly tightly, and tight packing can greatly constrain a protein's mobility. We have developed tools for measuring and comparing the packing efficiency at different interfaces (e.g., inter-domain, protein surface, helix-helix, protein vs. RNA) using specialized geometric constructions (e.g. Voronoi polyhedra) (Voss & Gerstein, 2005, 2010; Tsai et al., 1999, 2001; Tsai & Gerstein, 2002; 3vee.molmovdb.org).
Genomics as a Big Data Discipline
Personal genomics is just one prominent example of the big-data revolution transforming the biological sciences. Simultaneously, with this increase in biological data, computers and computation have had a transformative effect on the way information is handled, stored, and mined. These computational advances, of course, apply to many facets of life. My lab acts a connector, bringing quantitative approaches from disciplines such as computer science and applied math to bear on real questions and data in molecular biology. In particular, we have extensively applied simulation, machine learning, and database design. Often, we have engaged in experimental collaborations, in which we function as part of multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include DOE KBase, Brainspan, 1000 Genomes, ENCODE, and the Centers for Mendelian Genomics. As a discipline, genomics is an exemplar for how to use big data to both construct a resource and also answer questions. Consequently, it is one of the forefront application areas for the emerging field of data science. Perhaps genomics even provides lessons for other big data disciplines, such as web analytics and particle physics (Gerstein, 2012). Specifically, it is one of the main academic disciplines that has freely available large-scale datasets to organize and mine. Consequently, it provides an ideal training ground for future data scientists. Moreover, personal genomics will be one of the main bridges connecting the biological sciences to issues facing other big data disciplines. For instance, mining big data poses many questions related to personal privacy. One fails to realize the privacy implications from mining personal genomes -- since they contain immutable personal information shared amongst relatives that can potentially be much more revealing in generations to come (Greenbaum et al., 2008, 2012; Greenbaum & Gerstein, 2009). We have also examined how general issues associated with publishing and digital libraries relate to biomedical databases, and how various legal and security concerns significantly impact their interoperation (Smith et al., 2005; Greenbaum et al., 2004; Greenbaum & Gerstein, 2003; Gerstein & Junker, 2002; Gerstein, 1999a,b,c; Gerstein, 2000). We envision a future in which there will be less distinction between databases and journals. One will be able to both find understandable prose in database entries and to apply computation directly on specially constructed parts of journal articles (Seringhaus & Gerstein, 2007; Gerstein et al., 2007; Cheung et al., 2010). Such a scenario will help overcome many of the problems now facing biological databases, including quality control, attribution of credit, and error correction.
Future Directions
In the future the lab will continue along the research directions outlined above. We will emphasize topics in the emerging world of data science and also the analysis of networks. We will also apply the tools and techniques developed for analyzing the personal genomes of healthy individuals to disease genomes, particularly related to cancer.
Notes on References
This document is closely coupled to my publication list (papers.gersteinlab.org) in the following fashion: many publications since the lab opened in 1/97 up to the present time (April of 2013) are referenced. The references are in the "Jones et al., 2002" format. However, if there is more than one paper matching this citation, a letter (e.g. a, b, c, etc) is appended to the citation in the order that the reference occurs in the publication list.
Note, to keep things simple:
(i) No attempt has been made to refer to the scientific literature generally, and this document should not be construed as a balanced review of the field.
(ii) Each paper and URL is only cited once in the text, even when it could potentially be referred at multiple places in the text.
To quickly look at relevant references go to:
http://papers.gersteinlab.org/subject/best
http://papers.gersteinlab.org/subject/best-revs
Also:
http://papers.gersteinlab.org/subject/intro-to-lab
http://papers.gersteinlab.org/subject/intro-cs
More info on Research Description