Short MG Research Interests

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
 
(2 intermediate revisions not shown)
Line 1: Line 1:
 +
<big>'''Research Summary: Protein Bioinformatics'''</big>
-
The thrust of my laboratory is aimed at integrating personal genomes with other biological data and developing methods to assist in their interpretation. This endeavor has a number of aspects.
+
The biological sciences are being transformed by the advent of large-scale data. The sequencing of the human genome is a most dramatic example of this. Simultaneously, with this increase in biological data, computers and computation have had a transforming effect on the way information is handled, stored, and mined. These computational advances, of course, apply to many facets of life. The goal of my lab is to connect these two developments, harnessing computational advances for the analysis of large-scale data, principally by carrying out integrative surveys and systematic data mining.
 +
Specifically, we are focused on protein bioinformatics: understanding the structure, function, and evolution of proteins through analyzing populations of them in the databases and in whole-genome experiments. Overall we have four research foci, summarized below.
-
<big>'''Human Genome Variation'''</big><big>''''''</big>
 
-
Much of our efforts are focused on structural variation. We have developed a number of approaches for identifying structural variants in genomes. These involve looking at the consistency of read coverage over the genome (read depth), searching for special reads that split over breakpoints (split reads), analyzing unusual pair separations in paired end reads (PEM), and identifying and studying instances of fusion genes (Abyzov et al., 2011a,b; Korbel et al., 2009; Lam et al., 2010; Sboner et al., 2010b).
+
<big>'''Genomics: Mining Intergenic Regions, especially in relation to Pseudogenes'''</big>
 +
We are involved in a number of large-scale collaborations to probe the activity of intergenic regions with tiling array technology. The overall conclusion from this work has been that much of the intergenic regions of the human genome appear to be active, both transcriptionally and in terms of protein binding. In connection with tiling-array experiments, we have done an extensive amount of intergenic annotation, with a particular focus on mining intergenic regions for pseudogenes (protein fossils). Collectively, our studies enable us to determine the common "pseudofamilies" in various genomes and address important evolutionary questions about the proteins that were present in the past history of an organism.
-
</big><big>'''Human Genome Annotation'''</big>
 
-
Genome annotation provides biochemical and evolutionary context to each base of the genome, and we work with international genome annotation efforts carried out by the ENCODE Consortium. We have developed numerous methods for identifying pseudogenes in the genome (Zhang et al., 2006). We were one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, illustrating the very different pseudogene complements in different organisms (Zhang et al., 2002a,b, 2003, 2004; Harrison et al., 2001, 2002a,c, 2003a,b; Zhang & Gerstein, 2003c,e; Liu et al., 2004a; Lam et al., 2008; Pseudogene.org). Moreover, we have found hints that some of the supposedly "dead" pseudogenes may actually harbour biochemical activity (Zheng et al., 2005, 2007a,b; Harrison et al., 2005, Pei et al., 2012; Sasidharan & Gerstein, 2008). In recent years, we have increasingly worked on ncRNAs. We have analyzed selective constraints on these in the context of data generated as part of The 1000 Genomes Project (Mu et al., 2011). We have developed a number of tools to process tiling arrays and next-generation sequencing to identify regions of intragenic transcription, which are often called transcriptionally active reaches.
+
<big>'''Proteomics: Using Networks to Understand Protein Function'''</big>
 +
After the main elements of the human genome are identified, one needs to characterize their function. We are trying to characterize gene function through molecular networks. We work on systematically integrating many weak functional genomic features with data mining techniques to predict protein networks (comprising protein interactions and other functional linkages). In addition, we have studied the structure of protein networks, both on a large-scale in terms of global statistics (e.g. the diameter) and on a small-scale in terms of local network motifs (e.g. hubs).
-
</big><big>'''Analysis of Networks'''</big>
 
-
We try to determine how many genes can act together as a unified system, which entails identifying key points such as hubs and bottlenecks (Yu et al., 2004b, 2006, 2007). We have also investigated protein-protein interactions and metabolic pathways. We have developed a number of generic tools to build and analyze networks derived from genes and other forms of data in a consistent fashion (Douglas et al., 2005; Xia et al., 2004; Yu et al., 2004b, 2006; Yip et al., 2006; tYNA.gersteinlab.org, PubNet.gersteinlab.org). Using expression data, we have identified the transient nature of hubs and systematic patterns of connectivity rewiring in the regulatory network (Luscombe et al., 2004). We have connected interaction networks to 3-D structures, conceptualizing them in terms of physical interaction surfaces (Kim et al., 2006; Kim et al., 2008a; Bhardwaj et al., 2011b). Finally, we have shown how the usage of metabolic pathways in ocean metagenomic sequencing correlates with environmental variables gleaned from satellite imagery, potentially allowing them to be used as biosensors (Patel et al., 2010; Gianoulis et al., 2009).
+
<big>'''Structural Genomics: Analysis of Folds, Families and Functions on a Large-scale'''</big>
 +
Another area of research in our lab is structural genomics. Here, we conceptualize proteins not purely as character sequences or abstract network nodes, but more in terms of their molecular structure. We have examined the large-scale relationships between sequence, structure and function in order to understand the extent to which structural and functional annotation can reliably be transferred between similar sequences, particularly when similarity is expressed in modern probabilistic language. We have also related the occurrence of protein folds and families to phylogeny and deep evolutionary history.
-
</big><big>'''Macromolecular Motions & Packing'''</big>
 
-
We have set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains (Krebs & Gerstein, 1998, 2000; Alexandrov et al., 2005; Flores et al., 2005, 2006; Goh et al., 2004a; Gerstein & Echols, 2004; Echols et al., 2003; Krebs et al., 2002; MolMovDB.org). Part of this project involves devising a system for characterizing motions based on the interdigitated packing at internal interfaces (Gerstein et et al., 1994b; Gerstein & Chothia, 1999). We have developed tools for measuring and comparing the packing efficiency at different interfaces (e.g., inter-domain, protein surface, helix-helix, protein vs. RNA) using specialized geometric constructions (e.g. Voronoi polyhedra) (Voss & Gerstein, 2005, 2010; Tsai et al., 1999, 2001; Tsai & Gerstein, 2002; 3vee.molmovdb.org).
+
<big>'''Computational Biophysics: Relating Motions & Packing'''</big>
 +
The final area of focus in the lab is analyzing small populations of structures in terms of their detailed 3D-geometry and physical properties. Here, we try to interpret macromolecular motions in terms of packing. We have set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains.
-
</big><big>'''Genomics as a Big Data Discipline'''</big>
 
-
My lab acts a connector, bringing quantitative approaches from disciplines such as computer science and applied math to bear on real questions and data in molecular biology. We have engaged in experimental collaborations, in which we function as part of multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include DOE KBase, Brainspan, 1000 Genomes, ENCODE, and the Centers for Mendelian Genomics. As a discipline, genomics is an exemplar for how to use big data to both construct a resource and also answer questions. Consequently, it is one of the forefront application areas for the emerging field of data science. Perhaps genomics even provides lessons for other big data disciplines, such as web analytics and particle physics (Gerstein, 2012). We have also examined how general issues associated with publishing and digital libraries relate to biomedical databases, and how various legal and security concerns significantly impact their interoperation (Smith et al., 2005; Greenbaum et al., 2004; Greenbaum & Gerstein, 2003; Gerstein & Junker, 2002; Gerstein, 1999a,b,c; Gerstein, 2000).
+
<big>'''References'''</big>
 +
[1] Relating three-dimensional structures to protein networks provides evolutionary insights. PM Kim, LJ Lu, Y Xia, MB Gerstein (2006) Science 314: 1938-41
-
</big><big>'''Future Directions'''</big>
+
[2] The real life of pseudogenes. M Gerstein, D Zheng (2006) Sci Am 295: 48-55.
-
We will emphasize topics in the emerging world of data science and also the analysis of networks. We will also apply the tools and techniques developed for analyzing the personal genomes of healthy individuals to disease genomes, particularly related to cancer.
+
[3] Genomic analysis of regulatory network dynamics reveals large topological changes. NM Luscombe, MM Babu, H Yu, M Snyder, SA Teichmann, M Gerstein (2004) Nature 431: 308-12.
 +
[4] Genomics. Defining genes in the genomics era. M Snyder, M Gerstein (2003) Science 300: 258-60.
-
</big><big>'''Notes on References'''</big>
+
[5] Simulating water and the molecules of life.M Gerstein, M Levitt (1998) Sci Am 279: 100-5.
-
 
+
-
This document is closely coupled to my publication list (papers.gersteinlab.org) in the following fashion: many publications since the lab opened in 1/97 up to the present time (April of 2013) are referenced. The references are in the "Jones et al., 2002" format. However, if there is more than one paper matching this citation, a letter (e.g. a, b, c, etc) is appended to the citation in the order that the reference occurs in the publication list.
+
-
 
+
-
Note, to keep things simple:
+
-
 
+
-
(i) No attempt has been made to refer to the scientific literature generally, and this document should not be construed as a balanced review of the field.
+
-
 
+
-
(ii) Each paper and URL is only cited once in the text, even when it could potentially be referred at multiple places in the text.
+

Latest revision as of 14:10, 13 May 2013

Research Summary: Protein Bioinformatics

The biological sciences are being transformed by the advent of large-scale data. The sequencing of the human genome is a most dramatic example of this. Simultaneously, with this increase in biological data, computers and computation have had a transforming effect on the way information is handled, stored, and mined. These computational advances, of course, apply to many facets of life. The goal of my lab is to connect these two developments, harnessing computational advances for the analysis of large-scale data, principally by carrying out integrative surveys and systematic data mining.

Specifically, we are focused on protein bioinformatics: understanding the structure, function, and evolution of proteins through analyzing populations of them in the databases and in whole-genome experiments. Overall we have four research foci, summarized below.


Genomics: Mining Intergenic Regions, especially in relation to Pseudogenes

We are involved in a number of large-scale collaborations to probe the activity of intergenic regions with tiling array technology. The overall conclusion from this work has been that much of the intergenic regions of the human genome appear to be active, both transcriptionally and in terms of protein binding. In connection with tiling-array experiments, we have done an extensive amount of intergenic annotation, with a particular focus on mining intergenic regions for pseudogenes (protein fossils). Collectively, our studies enable us to determine the common "pseudofamilies" in various genomes and address important evolutionary questions about the proteins that were present in the past history of an organism.


Proteomics: Using Networks to Understand Protein Function

After the main elements of the human genome are identified, one needs to characterize their function. We are trying to characterize gene function through molecular networks. We work on systematically integrating many weak functional genomic features with data mining techniques to predict protein networks (comprising protein interactions and other functional linkages). In addition, we have studied the structure of protein networks, both on a large-scale in terms of global statistics (e.g. the diameter) and on a small-scale in terms of local network motifs (e.g. hubs).


Structural Genomics: Analysis of Folds, Families and Functions on a Large-scale

Another area of research in our lab is structural genomics. Here, we conceptualize proteins not purely as character sequences or abstract network nodes, but more in terms of their molecular structure. We have examined the large-scale relationships between sequence, structure and function in order to understand the extent to which structural and functional annotation can reliably be transferred between similar sequences, particularly when similarity is expressed in modern probabilistic language. We have also related the occurrence of protein folds and families to phylogeny and deep evolutionary history.


Computational Biophysics: Relating Motions & Packing

The final area of focus in the lab is analyzing small populations of structures in terms of their detailed 3D-geometry and physical properties. Here, we try to interpret macromolecular motions in terms of packing. We have set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains.


References

[1] Relating three-dimensional structures to protein networks provides evolutionary insights. PM Kim, LJ Lu, Y Xia, MB Gerstein (2006) Science 314: 1938-41

[2] The real life of pseudogenes. M Gerstein, D Zheng (2006) Sci Am 295: 48-55.

[3] Genomic analysis of regulatory network dynamics reveals large topological changes. NM Luscombe, MM Babu, H Yu, M Snyder, SA Teichmann, M Gerstein (2004) Nature 431: 308-12.

[4] Genomics. Defining genes in the genomics era. M Snyder, M Gerstein (2003) Science 300: 258-60.

[5] Simulating water and the molecules of life.M Gerstein, M Levitt (1998) Sci Am 279: 100-5.

Personal tools