Current Research Interests for Mark Gerstein

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
(Analysis of Diverse Networks)
 
(27 intermediate revisions not shown)
Line 1: Line 1:
-
<big>'''Mining Personal Genomes'''</big>
+
Soon, sequencing one’s genome may become as commonplace as getting an X-ray. Consequently, personal genomes will increasingly serve as the lenses through which the public views biology. Addressing this, the focus of the Gerstein Lab is interpreting personal genomes, particularly in relation to disorders, such as cancer. This endeavor has a number of related aspects described below. Moreover, the approaches we take have broad connections to a variety of data-intensive fields, within the emerging discipline of data science.
-
The number of personal genomes sequenced is expected to increase exponentially in the next few years. Soon, sequencing one’s own genome may become as commonplace in medical care as getting an X-ray. Moreover, an individual’s window into biological science is going to increasingly be through the lens of his or her own genome. Addressing this, the thrust of my laboratory is aimed at integrating personal genomes with other biological data and developing methods to assist in their interpretation. These endeavors have a number of aspects.
+
 +
====Personal Genome Variation: SVs====
 +
 +
We are involved in finding variants in personal genomes. We focus on particular types of variants, which involve the re-arrangement of large blocks of the genome (structural variation). It is believed that structural variants involve as many nucleotides in the genome as the better-known SNPs. Moreover, re-arrangements are very prevalent in genomic diseases such as cancer, and we have developed tools for identifying them (e.g. using split reads and fusion genes).
 +
See: [http://papers.gersteinlab.org/subject/sv SV papers].
-
<big>'''Human Genome Variation'''</big>
+
====Human Genome Annotation: Processing Next-Gen Sequencing Data====
 +
 +
After one has determined all of the variants in an individual’s genome, the next step is understanding what they mean. This involves genome annotation, where one places each base within a biochemical context. Our focus has been on transcription-factor binding sites and non-coding RNAs (ncRNAs). We have carried out this effort by processing next-generation sequencing data (i.e. RNA-seq and ChIP-seq). We have developed tools to identify ncRNAs and regions of intragenic transcription. We also have developed methods for finding transcription-factor binding sites by processing ChIP-seq reads and using the level of this binding to predict statistically the expression of target genes.
 +
See: [http://papers.gersteinlab.org/subject/ngtools Next-Gen] and [http://papers.gersteinlab.org/subject/rnaseq RNAseq papers].
-
First, we are very involved in the search for variants in personal genomes. We focus on a particular type of variant, structural variation, which involves the re-arrangement of blocks in the genome. It is believed that structural variants involve as many nucleotides in the genome as the better-known single-nucleotide polymorphisms, or SNPs (Mills et al., 2011; Korbel et al., 2008). We have developed a number of approaches for identifying structural variants in genomes. These involve looking at the consistency of read coverage over the genome (read depth), searching for special reads that split over breakpoints (split reads), analyzing unusual pair separations in paired end reads (PEM) (Abyzov et al., 2011a,b; Korbel et al., 2009; Lam et al., 2010), and identifying and studying instances of fusion genes (Sboner et al., 2010b). Much of this work has taken place in the context of large international consortia, such as the 1000 Genomes Project, as well as in disease-focused programs such as those related to prostate cancer (Sboner et al., 2010a; Berger et al., 2011; Lin et al., 2013).
+
====Comparative Genomics: Pseudogenes as Molecular Fossils====
 +
Pseudogenes provide a contrasting annotation to binding sites and ncRNAs in being derived from comparative rather than functional genomics data.  They provide information about human molecular history. We have developed methods for identifying them. We were one of the first groups to perform comprehensive surveys, illustrating the different pseudogene repertoires in different organisms. Moreover, we have found hints that some supposedly "dead" pseudogenes may actually harbor biochemical activity.
 +
See: [http://papers.gersteinlab.org/subject/pseudogenes pseudogene papers].
-
<big>'''Human Genome Annotation'''</big>
 
-
After one has all of the variants in a personal genome, the next step is to attempt to understand what they mean. This often takes place through genome annotation, which provides biochemical and evolutionary context to each base. We are very involved in the international genome annotation efforts carried out by the ENCODE Consortium. We focus on a number of annotations in the genome, principally transcription-factor binding sites, non-coding RNAs, and pseudogenes. 
+
====Protein Structure and Function: Macromolecular Motions====
-
In relation to the latter, we have developed numerous methods for identifying pseudogenes in the genome (Zhang et al., 2006). We consider these as genomic fossils that provide abundant details about human molecular history; much more so than our genes, particularly when they are compared to pseudogenes in other organisms (Gerstein & Zheng, 2006). We were one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, illustrating the very different pseudogene complements in different organisms (Zhang et al., 2002a,b, 2003, 2004; Harrison et al., 2001, 2002a,c, 2003a,b; Zhang & Gerstein, 2003c,e; Liu et al., 2004a; Lam et al., 2008; Pseudogene.org). Moreover, we have found hints that some of the supposedly "dead" pseudogenes may actually harbour biochemical activity (Zheng et al., 2005, 2007a,b; Harrison et al., 2005, Pei et al., 2012; Sasidharan & Gerstein, 2008).
+
-
In recent years, we have increasingly worked on non-coding RNAs (ncRNA). Selective constraints and evolutionary properties of non-coding segments of the genome (including ncRNA) have been analyzed in the context of data generated as part of The 1000 Genomes Project (Mu et al., 2011). In addition, evolutionary and physical features evaluated from C. elegans data has been used to predict ncRNA within the genome (Lu et al., 2011). Over the last few years, we have also developed a number of tools to process tallying the arrays and then use next-generation sequencing to identify regions of intragenic transcription, which are often called transcriptionally active reaches. Among the more popular suite of tools for processing RNA-seq data are our RSEQtools (Habegger et al., 2011). In addition, we have developed our tool set by integrating features beyond next-generation sequencing to find well-characterized and unusual non-coding RNA (lncRNA).
+
While non-coding regions play an important, if underappreciated, role in genome function and disease, we also work on characterizing coding sequences, drilling deep into their protein products. We have a particular focus on loss-of-function mutations. Moreover, by analyzing protein motions we can better predict how a mutation affects function. This effort involves devising a system for characterizing motions in standardized fashion in terms of key statistics, such as the degree of rotation about hinges. It is guided by the fact that protein mobility is highly restricted by tight packing. We have developed tools for measuring packing efficiency using specialized geometric constructions (e.g. Voronoi polyhedra).  
-
In relation to transcription-factor binding sites, we have developed methods for finding these elements by processing ChIP sequencing data, using the level of binding to statistically predict the expression of target genes, and putting the results into the framework of a network (see below, Zhang et al., 2008; Rozowsky et al., 2009; Yip et al., 2012; Gerstein et al., 2010, 2012; Cheng et al., 2011a,b, 2012).
+
See: [http://papers.gersteinlab.org/subject/motions molecular motion] and [http://papers.gersteinlab.org/subject/volumes structure papers].
 +
====Analysis of Diverse Networks====
 +
 +
Networks are a way of tying together much of our research. Network representations can be applied consistently to many different types of biological data; thus, we have developed tools to build and analyze regulatory networks, protein-protein interactions and metabolic pathways, identifying key nodes such as hubs and bottlenecks. Moreover, because they are generic and flexible representation, networks provide an ideal framework for data integration. We have integrated networks with dynamic gene-expression data (identifying transient hubs), 3D-protein structures, and even satellite imagery. Finally, as people have more intuition for commonplace networks, such as those in social and computer systems, we have found cross-disciplinary comparisons helpful elucidating system-level properties of biological networks, such as the association of greater connectivity with more evolutionary constraint.
 +
See: [http://papers.gersteinlab.org/subject/regnet networks papers].
-
<big>'''Analysis of Networks'''</big>
+
====Genomics at the Forefront of Data Science====
 +
 +
Overall the Gerstein lab acts a connector, bringing quantitative approaches from disciplines such as computer science and statistics to bear on practical questions and large-scale data in molecular biology. In particular, we have focused on applying technical approaches in simulation, machine learning, and knowledgebase design.  Often, we carry out our work in multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include [http://kbase.us KBase], [http://www.brainspan.org/ Brainspan], [http://encodeproject.org/ENCODE/ ENCODE], [http://www.modencode.org/ modENCODE], [http://www.1000genomes.org/ 1000 Genomes], [http://pancancer.info PCAWG], the [http://exrna.org exRNA Consortium] and the [http://mendelian.org/ Centers for Mendelian Genomics].
-
This leads into the next research topic in my laboratory, the interpretation of networks. Here we try to determine how many genes can act together as a unified system. One first step is identifying key points such as hubs and bottlenecks (Yu et al., 2004b, 2006, 2007). One of the most powerful aspects of the network representation is the fact that it can be applied to many different types of data, whether that data is biological or not.Thus, in addition to looking at transcription factor regulatory networks, we have also investigated protein-protein interactions and metabolic pathways. Moreover, as people have much more intuition for commonplace networks, such as those in social and computer systems, we have found that cross-disciplinary comparisons can help to elucidate system-level properties of biological networks (Yan et al., 2010; Bhardwaj et al., 2010, 2011a). Furthermore, we have developed a number of generic tools to build and analyze networks derived from genes and other forms of data in a consistent fashion (Douglas et al., 2005; Xia et al., 2004; Yu et al., 2004b, 2006; Yip et al., 2006; tYNA.gersteinlab.org, PubNet.gersteinlab.org).
 
-
Because they are a fairly generic and flexible representation, networks provide an ideal framework for data integration. We have integrated networks with dynamic expression data, 3D-protein structures, and even satellite imagery. In particular, using expression data, we have identified the transient nature of hubs and systematic patterns of connectivity rewiring in the regulatory network (Luscombe et al., 2004). We have connected interaction networks to 3-D structures, conceptualizing them in terms of physical interaction surfaces (Kim et al., 2006; Kim et al., 2008a; Bhardwaj et al., 2011b). Finally, we have shown how the usage of metabolic pathways in ocean metagenomic sequencing correlates with environmental variables gleaned from satellite imagery, potentially allowing them to be used as biosensors (Patel et al., 2010; Gianoulis et al., 2009). 
 
 +
As a discipline, genomics is an exemplar for using big data to construct a resource and answer questions. Consequently, it is at the forefront in the emerging field of data science and provides an ideal training for future data scientists.
-
<big>'''Macromolecular Motions & Packing'''</big>
 
-
We have investigated the molecular structure of many genes within networks. In particular, we have set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains (Krebs & Gerstein, 1998, 2000; Alexandrov et al., 2005; Flores et al., 2005, 2006; Goh et al., 2004a; Gerstein & Echols, 2004; Echols et al., 2003; Krebs et al., 2002; MolMovDB.org). Part of this project involves devising a system for characterizing motions in a highly standardized fashion in terms of key statistics, such as the location of hinges and the degree of rotations about these. Our classification of motions is based on the interdigitated packing at internal interfaces (Gerstein et et al., 1994b; Gerstein & Chothia, 1999). This classification scheme is motivated by the fact that protein interiors are packed exceedingly tightly, and the tight packing can greatly constrain a protein's mobility. We have developed tools for measuring and comparing the packing efficiency at different interfaces (e.g., inter-domain, protein surface, helix-helix, protein vs. RNA) using specialized geometric constructions (e.g. Voronoi polyhedra) (Voss & Gerstein, 2005, 2010; Tsai et al., 1999, 2001; Tsai & Gerstein, 2002; 3vee.molmovdb.org).
+
Personal genomics also acts as a bridge connecting the biological sciences to larger issues facing other big-data disciplines. For instance, data mining generally poses questions related to privacy. We study the fundamental privacy implications of mining personal genomes, which contain immutable information, shared amongst relatives that will be increasingly revealing in generations to come. Also, we have examined how general knowledge-representation issues associated with publishing and digital libraries relate to biological databases. We envision a future of structured literature, with less distinction between databases and journals.
 +
===References===
-
<big>'''Genomics as a Big Data Discipline'''</big>
+
See [http://papers.gersteinlab.org papers.gersteinlab.org] -- in particular,
 +
[http://papers.gersteinlab.org/subject/best Best Papers],
 +
[http://papers.gersteinlab.org/subject/best-revs Best Reviews],
 +
[http://papers.gersteinlab.org/subject/intro-to-lab Intro to the Lab],
 +
and
 +
[http://papers.gersteinlab.org/subject/intro-cs Intro to the Lab with a CS focus]
-
Personal genomics is just one prominent example of the big-data revolution transforming the biological sciences.Simultaneously, with this increase in biological data, computers and computation have had a transformative effect on the way information is handled, stored, and mined. These computational advances, of course, apply to many facets of life. My lab acts a connector, bringing quantitative approaches from disciplines such as computer science and applied math to bear on real questions and data in molecular biology. In particular, we have extensively applied simulation, machine learning, and database design. Often, we have engaged in experimental collaborations, in which we function as part of multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include DOE KBase, Brainspan, 1000 Genomes, ENCODE, and the Centers for Mendelian Genomics.
+
===[http://info.gersteinlab.org/MBG-Profile More Information on Research Interests]===
-
As a discipline, genomics is an exemplar for how to use big data to both construct a resource and also answer questions. Consequently, it is one of the forefront application areas for the emerging field of big data science. Perhaps genomics even provides lessons for other big data disciplines, such as web analytics and particle physics (Gerstein, 2012). Specifically, it is one of the main academic disciplines that has freely available large-scale datasets to organize and mine. Consequently, it provides an ideal training ground for future data scientists. Moreover, personal genomics will be one of the main bridges connecting the biological sciences to issues facing other big data disciplines. For instance, mining big data poses many questions related to personal privacy. One fails to appreciate the subtle conclusions that emerge from mining personal genomes -- since they contain immutable personal information shared amongst relatives that can potentially be much more revealing in generations to come (Greenbaum et al., 2008, 2012; Greenbaum & Gerstein, 2009). We have also examined how general issues associated with publishing and digital libraries relate to biomedical databases, and how various legal and security concerns significantly impact their interoperation (Smith et al., 2005; Greenbaum et al., 2004; Greenbaum & Gerstein, 2003; Gerstein & Junker, 2002; Gerstein, 1999a,b,c; Gerstein, 2000). We envision a future in which there will be less distinction between databases and journals. One will be able to both find understandable prose in database entries and to apply computation directly on specially constructed parts of journal articles (Seringhaus & Gerstein, 2007; Gerstein et al., 2007; Cheung et al., 2010). Such a scenario will help overcome many of the problems now facing biological databases, including quality control, attribution of credit, and error correction.
+
__NOTOC__
-
 
+
-
 
+
-
<big>'''Future Directions'''</big>
+
-
 
+
-
In the future I would like to continue along the research directions outlined above. I will emphasize topics in the emerging world of data science and also the analysis of networks. I would like to apply the tools and techniques developed for analyzing the personal genomes of healthy individuals to disease genomes, particularly related to cancer.
+
-
 
+
-
 
+
-
<big>'''Notes on References'''</big>
+
-
 
+
-
This document is closely coupled to my publication list (papers.gersteinlab.org) in the following fashion: many publications since the lab opened in 1/97 up to the present time (March of 2013) are referenced. The references are in the “Jones et al., 2002” format. However, if there is more than one paper matching this citation, a letter (e.g. a, b, c, etc) is appended to the citation in the order that the reference occurs in the publication list.
+
-
 
+
-
Note, to keep things simple:
+
-
 
+
-
(i) No attempt has been made to refer to the scientific literature generally, and this document should not be construed as a balanced review of the field.
+
-
 
+
-
(ii) Each paper and URL is only cited once in the text, even when it could potentially be referred at multiple places in the text.
+

Latest revision as of 02:20, 30 November 2016

Soon, sequencing one’s genome may become as commonplace as getting an X-ray. Consequently, personal genomes will increasingly serve as the lenses through which the public views biology. Addressing this, the focus of the Gerstein Lab is interpreting personal genomes, particularly in relation to disorders, such as cancer. This endeavor has a number of related aspects described below. Moreover, the approaches we take have broad connections to a variety of data-intensive fields, within the emerging discipline of data science.


Personal Genome Variation: SVs

We are involved in finding variants in personal genomes. We focus on particular types of variants, which involve the re-arrangement of large blocks of the genome (structural variation). It is believed that structural variants involve as many nucleotides in the genome as the better-known SNPs. Moreover, re-arrangements are very prevalent in genomic diseases such as cancer, and we have developed tools for identifying them (e.g. using split reads and fusion genes). See: SV papers.


Human Genome Annotation: Processing Next-Gen Sequencing Data

After one has determined all of the variants in an individual’s genome, the next step is understanding what they mean. This involves genome annotation, where one places each base within a biochemical context. Our focus has been on transcription-factor binding sites and non-coding RNAs (ncRNAs). We have carried out this effort by processing next-generation sequencing data (i.e. RNA-seq and ChIP-seq). We have developed tools to identify ncRNAs and regions of intragenic transcription. We also have developed methods for finding transcription-factor binding sites by processing ChIP-seq reads and using the level of this binding to predict statistically the expression of target genes. See: Next-Gen and RNAseq papers.

Comparative Genomics: Pseudogenes as Molecular Fossils

Pseudogenes provide a contrasting annotation to binding sites and ncRNAs in being derived from comparative rather than functional genomics data. They provide information about human molecular history. We have developed methods for identifying them. We were one of the first groups to perform comprehensive surveys, illustrating the different pseudogene repertoires in different organisms. Moreover, we have found hints that some supposedly "dead" pseudogenes may actually harbor biochemical activity. See: pseudogene papers.


Protein Structure and Function: Macromolecular Motions

While non-coding regions play an important, if underappreciated, role in genome function and disease, we also work on characterizing coding sequences, drilling deep into their protein products. We have a particular focus on loss-of-function mutations. Moreover, by analyzing protein motions we can better predict how a mutation affects function. This effort involves devising a system for characterizing motions in standardized fashion in terms of key statistics, such as the degree of rotation about hinges. It is guided by the fact that protein mobility is highly restricted by tight packing. We have developed tools for measuring packing efficiency using specialized geometric constructions (e.g. Voronoi polyhedra). See: molecular motion and structure papers.

Analysis of Diverse Networks

Networks are a way of tying together much of our research. Network representations can be applied consistently to many different types of biological data; thus, we have developed tools to build and analyze regulatory networks, protein-protein interactions and metabolic pathways, identifying key nodes such as hubs and bottlenecks. Moreover, because they are generic and flexible representation, networks provide an ideal framework for data integration. We have integrated networks with dynamic gene-expression data (identifying transient hubs), 3D-protein structures, and even satellite imagery. Finally, as people have more intuition for commonplace networks, such as those in social and computer systems, we have found cross-disciplinary comparisons helpful elucidating system-level properties of biological networks, such as the association of greater connectivity with more evolutionary constraint. See: networks papers.

Genomics at the Forefront of Data Science

Overall the Gerstein lab acts a connector, bringing quantitative approaches from disciplines such as computer science and statistics to bear on practical questions and large-scale data in molecular biology. In particular, we have focused on applying technical approaches in simulation, machine learning, and knowledgebase design. Often, we carry out our work in multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include KBase, Brainspan, ENCODE, modENCODE, 1000 Genomes, PCAWG, the exRNA Consortium and the Centers for Mendelian Genomics.


As a discipline, genomics is an exemplar for using big data to construct a resource and answer questions. Consequently, it is at the forefront in the emerging field of data science and provides an ideal training for future data scientists.


Personal genomics also acts as a bridge connecting the biological sciences to larger issues facing other big-data disciplines. For instance, data mining generally poses questions related to privacy. We study the fundamental privacy implications of mining personal genomes, which contain immutable information, shared amongst relatives that will be increasingly revealing in generations to come. Also, we have examined how general knowledge-representation issues associated with publishing and digital libraries relate to biological databases. We envision a future of structured literature, with less distinction between databases and journals.

References

See papers.gersteinlab.org -- in particular, Best Papers, Best Reviews, Intro to the Lab, and Intro to the Lab with a CS focus

More Information on Research Interests

Personal tools