Short MG Research Interests
From GersteinInfo
The thrust of my laboratory is aimed at integrating personal genomes with other biological data and developing methods to assist in their interpretation. This endeavor has a number of aspects.
Human Genome Variation
Much of our efforts are focused on structural variation. We have developed a number of approaches for identifying structural variants in genomes. These involve looking at the consistency of read coverage over the genome (read depth), searching for special reads that split over breakpoints (split reads), analyzing unusual pair separations in paired end reads (PEM), and identifying and studying instances of fusion genes (Abyzov et al., 2011a,b; Korbel et al., 2009; Lam et al., 2010; Sboner et al., 2010b).
Human Genome Annotation
Genome annotation provides biochemical and evolutionary context to each base of the genome, and we work with international genome annotation efforts carried out by the ENCODE Consortium. We have developed numerous methods for identifying pseudogenes in the genome (Zhang et al., 2006). We were one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, illustrating the very different pseudogene complements in different organisms (Zhang et al., 2002a,b, 2003, 2004; Harrison et al., 2001, 2002a,c, 2003a,b; Zhang & Gerstein, 2003c,e; Liu et al., 2004a; Lam et al., 2008; Pseudogene.org). Moreover, we have found hints that some of the supposedly "dead" pseudogenes may actually harbour biochemical activity (Zheng et al., 2005, 2007a,b; Harrison et al., 2005, Pei et al., 2012; Sasidharan & Gerstein, 2008). In recent years, we have increasingly worked on ncRNAs. We have analyzed selective constraints on these in the context of data generated as part of The 1000 Genomes Project (Mu et al., 2011). We have developed a number of tools to process tiling arrays and next-generation sequencing to identify regions of intragenic transcription, which are often called transcriptionally active reaches.
Analysis of Networks
We try to determine how many genes can act together as a unified system, which entails identifying key points such as hubs and bottlenecks (Yu et al., 2004b, 2006, 2007). We have also investigated protein-protein interactions and metabolic pathways. We have developed a number of generic tools to build and analyze networks derived from genes and other forms of data in a consistent fashion (Douglas et al., 2005; Xia et al., 2004; Yu et al., 2004b, 2006; Yip et al., 2006; tYNA.gersteinlab.org, PubNet.gersteinlab.org). Using expression data, we have identified the transient nature of hubs and systematic patterns of connectivity rewiring in the regulatory network (Luscombe et al., 2004). We have connected interaction networks to 3-D structures, conceptualizing them in terms of physical interaction surfaces (Kim et al., 2006; Kim et al., 2008a; Bhardwaj et al., 2011b). Finally, we have shown how the usage of metabolic pathways in ocean metagenomic sequencing correlates with environmental variables gleaned from satellite imagery, potentially allowing them to be used as biosensors (Patel et al., 2010; Gianoulis et al., 2009).
Macromolecular Motions & Packing
We have set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains (Krebs & Gerstein, 1998, 2000; Alexandrov et al., 2005; Flores et al., 2005, 2006; Goh et al., 2004a; Gerstein & Echols, 2004; Echols et al., 2003; Krebs et al., 2002; MolMovDB.org). Part of this project involves devising a system for characterizing motions based on the interdigitated packing at internal interfaces (Gerstein et et al., 1994b; Gerstein & Chothia, 1999). We have developed tools for measuring and comparing the packing efficiency at different interfaces (e.g., inter-domain, protein surface, helix-helix, protein vs. RNA) using specialized geometric constructions (e.g. Voronoi polyhedra) (Voss & Gerstein, 2005, 2010; Tsai et al., 1999, 2001; Tsai & Gerstein, 2002; 3vee.molmovdb.org).
Genomics as a Big Data Discipline
My lab acts a connector, bringing quantitative approaches from disciplines such as computer science and applied math to bear on real questions and data in molecular biology. We have engaged in experimental collaborations, in which we function as part of multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include DOE KBase, Brainspan, 1000 Genomes, ENCODE, and the Centers for Mendelian Genomics. As a discipline, genomics is an exemplar for how to use big data to both construct a resource and also answer questions. Consequently, it is one of the forefront application areas for the emerging field of data science. Perhaps genomics even provides lessons for other big data disciplines, such as web analytics and particle physics (Gerstein, 2012). We have also examined how general issues associated with publishing and digital libraries relate to biomedical databases, and how various legal and security concerns significantly impact their interoperation (Smith et al., 2005; Greenbaum et al., 2004; Greenbaum & Gerstein, 2003; Gerstein & Junker, 2002; Gerstein, 1999a,b,c; Gerstein, 2000).
Future Directions
We will emphasize topics in the emerging world of data science and also the analysis of networks. We will also apply the tools and techniques developed for analyzing the personal genomes of healthy individuals to disease genomes, particularly related to cancer.
Notes on References
This document is closely coupled to my publication list (papers.gersteinlab.org) in the following fashion: many publications since the lab opened in 1/97 up to the present time (April of 2013) are referenced. The references are in the "Jones et al., 2002" format. However, if there is more than one paper matching this citation, a letter (e.g. a, b, c, etc) is appended to the citation in the order that the reference occurs in the publication list.
Note, to keep things simple:
(i) No attempt has been made to refer to the scientific literature generally, and this document should not be construed as a balanced review of the field.
(ii) Each paper and URL is only cited once in the text, even when it could potentially be referred at multiple places in the text.