Current Research Interests for Mark Gerstein

From GersteinInfo

(Difference between revisions)
Jump to: navigation, search
(Analysis of Diverse Networks)
 
(37 intermediate revisions not shown)
Line 1: Line 1:
-
Gerstein Lab Research Program: Mining Personal Genomes
+
Soon, sequencing one’s genome may become as commonplace as getting an X-ray. Consequently, personal genomes will increasingly serve as the lenses through which the public views biology. Addressing this, the focus of the Gerstein Lab is interpreting personal genomes, particularly in relation to disorders, such as cancer. This endeavor has a number of related aspects described below. Moreover, the approaches we take have broad connections to a variety of data-intensive fields, within the emerging discipline of data science.  
-
The number of sequenced personal genomes is expected to increase exponentially over the next few years. Soon, sequencing one’s own genome may become as routine and commonplace in medicine as X-rays. Moreover, an individual’s window into biological science will increasingly be viewed through the lens of his or her own genome. In light of these trends, the thrust of my laboratory is aimed at integrating personal genomes with other biological data, as well as developing tools and methods to assist in their interpretation. These endeavors are carried out on a number of frontiers, as outlined below.
+
-
Human Genetic Variation
+
-
First, we work extensively on searching for those variants in personal genomes that differ between individuals. In particular, we focus on structural variation, a type of variant which results from re-arrangements of blocks within the genome. It is believed that structural variants involve as many nucleotides in the genome as the better-known single-nucleotide polymorphisms, or SNPs (Mills et al., 2011; Korbel et al., 2008). We have developed a number of approaches for identifying structural variants in genomes. These include evaluating the consistency of the read coverage over the genome (read depth), searching for special reads that split breakpoints (split reads), and analyzing unusual pair separations in paired-end reads (Abyzov et al., 2011a,b; Korbel et al., 2009; Lam et al. 2010). Much of this work has been performed as part of our participation in large international consortia, such as The 1000 Genomes Project, as well as disease-focused programs such as those with a focus on prostate cancer.
+
-
Human Genome Annotation
+
-
Once all the variants of a personal genome are identified, we work to understand their consequences and implications. This is generally the objective of genome annotation, which provides biochemical and evolutionary context for each base. Thus, we are very active participants in the international genome annotation efforts carried out by the ENCODE Consortium. We focus on annotating a number of genomic elements, principally transcription-factor binding sites, non-coding RNAs, and pseudogenes.  
+
-
Along these lines, we have developed numerous methods for identifying pseudogenes (Zhang et al. 2006). We consider pseudogenes to be genomic fossils that provide a rich window into human molecular history; human pseudogenes provide much more detail than protein-coding genes, particularly when they are compared to pseudogenes in other organisms (Gerstein & Zheng, 2006). We were one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, thus illustrating the very different pseudogene complements in different organisms (Zhang et al., 2002a,b, 2003, 2004; Harrison et al., 2001, 2002a,c, 2003a,b; Zhang & Gerstein, 2003c,e; Liu et al., 2004a; Lam et al., 2008; Pseudogene.org). Moreover, we have uncovered hints that some pseudogenes, which are supposedly "dead", may actually confer biochemical functionality (Zheng et al., 2005, 2007a,b; Harrison et al., 2005, Pei et al., 2012; Sasidharan & Gerstein, 2008).
+
 +
 +
====Personal Genome Variation: SVs====
 +
 +
We are involved in finding variants in personal genomes. We focus on particular types of variants, which involve the re-arrangement of large blocks of the genome (structural variation). It is believed that structural variants involve as many nucleotides in the genome as the better-known SNPs. Moreover, re-arrangements are very prevalent in genomic diseases such as cancer, and we have developed tools for identifying them (e.g. using split reads and fusion genes).
 +
See: [http://papers.gersteinlab.org/subject/sv SV papers].
-
[http://archive.gersteinlab.org/meetings/2013/03.24/2pg-research-summary-1jan13.doc Research Summary]
+
====Human Genome Annotation: Processing Next-Gen Sequencing Data====
 +
 +
After one has determined all of the variants in an individual’s genome, the next step is understanding what they mean. This involves genome annotation, where one places each base within a biochemical context. Our focus has been on transcription-factor binding sites and non-coding RNAs (ncRNAs). We have carried out this effort by processing next-generation sequencing data (i.e. RNA-seq and ChIP-seq). We have developed tools to identify ncRNAs and regions of intragenic transcription. We also have developed methods for finding transcription-factor binding sites by processing ChIP-seq reads and using the level of this binding to predict statistically the expression of target genes.
 +
See: [http://papers.gersteinlab.org/subject/ngtools Next-Gen] and [http://papers.gersteinlab.org/subject/rnaseq RNAseq papers].
 +
 
 +
====Comparative Genomics: Pseudogenes as Molecular Fossils====
 +
 
 +
Pseudogenes provide a contrasting annotation to binding sites and ncRNAs in being derived from comparative rather than functional genomics data.  They provide information about human molecular history. We have developed methods for identifying them. We were one of the first groups to perform comprehensive surveys, illustrating the different pseudogene repertoires in different organisms. Moreover, we have found hints that some supposedly "dead" pseudogenes may actually harbor biochemical activity.
 +
See: [http://papers.gersteinlab.org/subject/pseudogenes pseudogene papers].
 +
 
 +
 
 +
====Protein Structure and Function: Macromolecular Motions====
 +
 +
While non-coding regions play an important, if underappreciated, role in genome function and disease, we also work on characterizing coding sequences, drilling deep into their protein products. We have a particular focus on loss-of-function mutations. Moreover, by analyzing protein motions we can better predict how a mutation affects function. This effort involves devising a system for characterizing motions in standardized fashion in terms of key statistics, such as the degree of rotation about hinges. It is guided by the fact that protein mobility is highly restricted by tight packing. We have developed tools for measuring packing efficiency using specialized geometric constructions (e.g. Voronoi polyhedra).
 +
See: [http://papers.gersteinlab.org/subject/motions molecular motion] and [http://papers.gersteinlab.org/subject/volumes structure papers].
 +
 
 +
====Analysis of Diverse Networks====
 +
 +
Networks are a way of tying together much of our research. Network representations can be applied consistently to many different types of biological data; thus, we have developed tools to build and analyze regulatory networks, protein-protein interactions and metabolic pathways, identifying key nodes such as hubs and bottlenecks. Moreover, because they are generic and flexible representation, networks provide an ideal framework for data integration. We have integrated networks with dynamic gene-expression data (identifying transient hubs), 3D-protein structures, and even satellite imagery. Finally, as people have more intuition for commonplace networks, such as those in social and computer systems, we have found cross-disciplinary comparisons helpful elucidating system-level properties of biological networks, such as the association of greater connectivity with more evolutionary constraint.
 +
See: [http://papers.gersteinlab.org/subject/regnet networks papers].
 +
 
 +
====Genomics at the Forefront of Data Science====
 +
 +
Overall the Gerstein lab acts a connector, bringing quantitative approaches from disciplines such as computer science and statistics to bear on practical questions and large-scale data in molecular biology. In particular, we have focused on applying technical approaches in simulation, machine learning, and knowledgebase design.  Often, we carry out our work in multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include [http://kbase.us KBase], [http://www.brainspan.org/ Brainspan], [http://encodeproject.org/ENCODE/ ENCODE], [http://www.modencode.org/ modENCODE], [http://www.1000genomes.org/ 1000 Genomes], [http://pancancer.info PCAWG], the [http://exrna.org exRNA Consortium] and the [http://mendelian.org/ Centers for Mendelian Genomics].
 +
 
 +
 
 +
As a discipline, genomics is an exemplar for using big data to construct a resource and answer questions. Consequently, it is at the forefront in the emerging field of data science and provides an ideal training for future data scientists.
 +
 
 +
 
 +
Personal genomics also acts as a bridge connecting the biological sciences to larger issues facing other big-data disciplines. For instance, data mining generally poses questions related to privacy. We study the fundamental privacy implications of mining personal genomes, which contain immutable information, shared amongst relatives that will be increasingly revealing in generations to come.  Also, we have examined how general knowledge-representation issues associated with publishing and digital libraries relate to biological databases. We envision a future of structured literature, with less distinction between databases and journals.
 +
 
 +
===References===
 +
 
 +
See [http://papers.gersteinlab.org papers.gersteinlab.org] -- in particular,
 +
[http://papers.gersteinlab.org/subject/best Best Papers],
 +
[http://papers.gersteinlab.org/subject/best-revs Best Reviews],
 +
[http://papers.gersteinlab.org/subject/intro-to-lab Intro to the Lab],
 +
and
 +
[http://papers.gersteinlab.org/subject/intro-cs Intro to the Lab with a CS focus]
 +
 
 +
===[http://info.gersteinlab.org/MBG-Profile More Information on Research Interests]===
 +
__NOTOC__

Latest revision as of 02:20, 30 November 2016

Soon, sequencing one’s genome may become as commonplace as getting an X-ray. Consequently, personal genomes will increasingly serve as the lenses through which the public views biology. Addressing this, the focus of the Gerstein Lab is interpreting personal genomes, particularly in relation to disorders, such as cancer. This endeavor has a number of related aspects described below. Moreover, the approaches we take have broad connections to a variety of data-intensive fields, within the emerging discipline of data science.


Personal Genome Variation: SVs

We are involved in finding variants in personal genomes. We focus on particular types of variants, which involve the re-arrangement of large blocks of the genome (structural variation). It is believed that structural variants involve as many nucleotides in the genome as the better-known SNPs. Moreover, re-arrangements are very prevalent in genomic diseases such as cancer, and we have developed tools for identifying them (e.g. using split reads and fusion genes). See: SV papers.


Human Genome Annotation: Processing Next-Gen Sequencing Data

After one has determined all of the variants in an individual’s genome, the next step is understanding what they mean. This involves genome annotation, where one places each base within a biochemical context. Our focus has been on transcription-factor binding sites and non-coding RNAs (ncRNAs). We have carried out this effort by processing next-generation sequencing data (i.e. RNA-seq and ChIP-seq). We have developed tools to identify ncRNAs and regions of intragenic transcription. We also have developed methods for finding transcription-factor binding sites by processing ChIP-seq reads and using the level of this binding to predict statistically the expression of target genes. See: Next-Gen and RNAseq papers.

Comparative Genomics: Pseudogenes as Molecular Fossils

Pseudogenes provide a contrasting annotation to binding sites and ncRNAs in being derived from comparative rather than functional genomics data. They provide information about human molecular history. We have developed methods for identifying them. We were one of the first groups to perform comprehensive surveys, illustrating the different pseudogene repertoires in different organisms. Moreover, we have found hints that some supposedly "dead" pseudogenes may actually harbor biochemical activity. See: pseudogene papers.


Protein Structure and Function: Macromolecular Motions

While non-coding regions play an important, if underappreciated, role in genome function and disease, we also work on characterizing coding sequences, drilling deep into their protein products. We have a particular focus on loss-of-function mutations. Moreover, by analyzing protein motions we can better predict how a mutation affects function. This effort involves devising a system for characterizing motions in standardized fashion in terms of key statistics, such as the degree of rotation about hinges. It is guided by the fact that protein mobility is highly restricted by tight packing. We have developed tools for measuring packing efficiency using specialized geometric constructions (e.g. Voronoi polyhedra). See: molecular motion and structure papers.

Analysis of Diverse Networks

Networks are a way of tying together much of our research. Network representations can be applied consistently to many different types of biological data; thus, we have developed tools to build and analyze regulatory networks, protein-protein interactions and metabolic pathways, identifying key nodes such as hubs and bottlenecks. Moreover, because they are generic and flexible representation, networks provide an ideal framework for data integration. We have integrated networks with dynamic gene-expression data (identifying transient hubs), 3D-protein structures, and even satellite imagery. Finally, as people have more intuition for commonplace networks, such as those in social and computer systems, we have found cross-disciplinary comparisons helpful elucidating system-level properties of biological networks, such as the association of greater connectivity with more evolutionary constraint. See: networks papers.

Genomics at the Forefront of Data Science

Overall the Gerstein lab acts a connector, bringing quantitative approaches from disciplines such as computer science and statistics to bear on practical questions and large-scale data in molecular biology. In particular, we have focused on applying technical approaches in simulation, machine learning, and knowledgebase design. Often, we carry out our work in multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include KBase, Brainspan, ENCODE, modENCODE, 1000 Genomes, PCAWG, the exRNA Consortium and the Centers for Mendelian Genomics.


As a discipline, genomics is an exemplar for using big data to construct a resource and answer questions. Consequently, it is at the forefront in the emerging field of data science and provides an ideal training for future data scientists.


Personal genomics also acts as a bridge connecting the biological sciences to larger issues facing other big-data disciplines. For instance, data mining generally poses questions related to privacy. We study the fundamental privacy implications of mining personal genomes, which contain immutable information, shared amongst relatives that will be increasingly revealing in generations to come. Also, we have examined how general knowledge-representation issues associated with publishing and digital libraries relate to biological databases. We envision a future of structured literature, with less distinction between databases and journals.

References

See papers.gersteinlab.org -- in particular, Best Papers, Best Reviews, Intro to the Lab, and Intro to the Lab with a CS focus

More Information on Research Interests

Personal tools