From GersteinInfo

Jump to: navigation, search

Bioinformatics: Practical Application of Simulation and Data Mining (Spring 2011). This site is OLD. Current Class Website


CBB 752

Course Information

Course Description

Bioinformatics encompasses the analysis of gene sequences, macromolecular structures, and functional genomics data on a large scale. It represents a major practical application for modern techniques in data mining and simulation. Specific topics to be covered include sequence alignment, large-scale processing, next-generation sequencing data, comparative genomics, phylogenetics, biological database design, geometric analysis of protein structure, molecular-dynamics simulation, biological networks, normalization of microarray data, mining of functional genomics data sets, and machine learning approaches for data integration.

Concise undergraduate course description

Techniques in data mining and simulation applied to bioinformatics, the computational analysis of gene sequences, macromolecular structures, and functional genomics data on a large scale. Sequence alignment, comparative genomics and phylogenetics, biological databases, geometric analysis of protein structure, molecular-dynamics simulation, biological networks, microarray normalization, and machine-learning approaches to data integration.

See entry from undergraduate catalog: http://students.yale.edu/oci/resultDetail.jsp?course=22881&term=201101 , viz:

MB&B 452 01 (22881) /MCDB452/MB&B752/CB&B752/MCDB752/CPSC752
Bioinformatics: Practical Application of Simulation and Data Mining 
Mark Gerstein
MW 1.00-2.15 BASS 305
Spring 2011 
No regular final examination
Areas Sc
Prerequisites: MB&B 301b and MATH 115a or b, or permission of instructor.
MCDB 120a or 200b is a prerequisite for courses numbered MCDB 202 and above.

Quizzes and Final Project

There will be approximately four short quizzes during the semester and a take-home final project. For CBB and CS sections, the final project will be a programming assignment. For MB&B, the final project will be a paper. Further details will be announced at a later date.

Literature discussion section

One session of 60 minutes per week, time to be arranged. Student presentations of recent research papers relevant to the topics of the course. Led by Pedro Alves (Bass, Rm 437; 432-5405; pedro.alves@yale.edu) and Jia Kang (?; jia.kang@yale.edu).

Programming Projects/Problem Sets

Students taking this course listed under Computational Biology and Bioinformatics or Computer Science will be required to complete several short programming assignments. Further details will be discussed in the literature discussion section and during class.

Grade Categories

CBB and CPSC Sections:

Quizzes - 33% Final Project - 33% Discussion Section - 8.25% Programming Assignments - 24.75%

MBB and MCDB Sections:

Quizzes - 33% Final Project - 33% Discussion Section - 16.5% Problem Sets - 16.5%

Differences Between Class Sections

In general, the graduate level CS/CBB course is significantly different than MBB/MCDB (graduate and undergraduate) in several ways. Although the lectures are the same for each section, the graduate level CPSC/CBB course has additional programming assignments in addition to the work being completed by the MBB students. homework for the MBB section centers on the completion of several problem sets without a programming component. The CPSC/CBB section forgoes these problem sets and instead requires that students implement several of the algorithms discussed in class. Also, the final project for CPSC/CBB MUST be a programming assignment rather than the final paper equired for the MBB section. Due to the distinct course requirements, category weightings for final grades are also different.

Timing & location

Class: Meeting from 1:00-2:15 pm on Monday and Wednesday, in 305 BASS. (First meeting will be on 10 Jan.)

Discussion section: TBA



Mark Gerstein, 432A BASS, Phone 203 432-6105, e-mail mark.gerstein(at)yale.edu


Corey O'Hern, Mason Laboratory e-mail corey.ohern(at)yale.edu, Office Hours: M 2:15-3:15 PM

Others to be listed

Teaching Fellows

Pedro Alves, Bass Rm 437, (203) 432-5405

Jia Kang, 300 George Street, Rm 503, (203) 785-3711


Class Schedule (including a list of topics and quiz dates)

Discussion Sections

Session 1

Metzker ML. "Sequencing technologies - the next generation” Nature Reviews Genetics. 11 (2010) PDF

Wheeler DA et al. "The complete genome of an individual by massively parallel DNA sequencing,” Nature. 452:872-876 (208) PDF

Session 2

Olsen JV, Blagoev B, Gnad F, Macek B, Kumar C, Mortensen P, Mann M. (2006) Global, in vivo, and site-specific phosphorylation dynamics in signaling networks.Cell. 2006 Nov 3;127(3):635-48. PDF

Nevan J. Krogan et al (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae Nature 440, 637-643 (30 March 2006) PDF

Session 3

T.F. Smith and M.S. Waterman. (1981) Identification of common molecular subsequences. Journal of Molecular Biology,147(1): 195-7. PMID: 7265238. PDF

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-10. PMID: 2231712. PDF

Session 4

Bailey TL, Williams N, Misleh C, Li WW. (2006) MEME: discovering and analyzing DNA and protein sequence motifs, Nucl Acids Res.34:W369-373 PDF

Garnier J, Gibrat JF, Robson B. (1996) GOR method for predicting protein secondary structure from amino acid sequence.Methods in Enzymology,266: 540-53. PMID: 8743705. PDF

Session 5

Laura J. van 't Veer et al. Gene expression profiling predicts clinical outcome of breast cancer Nature 415, 530-536 (31 January 2002) | doi:10.1038/415530a; Received 24 August 2001; Accepted 22 November 2001 TEXT

Kwang-Il Goh, Michael E. Cusick, David Vall, Barton Child, Marc Vidal, and Albert-La ́szlo ́ Barabasi (2007) The human disease network Proc Natl Acad Sci U S A. 2007 May 22;104(21):8685-90. Epub 2007 May 14. PDF

Session 6

Antezana E, Egaña M, Blondé W, Illarramendi A, Bilbao I, De Baets B, Stevens R, Mironov V, Kuiper M. (2009) The Cell Cycle Ontology: an application ontology for the representation and integrated analysis of the cell cycle process. Genome Biol. 2009;10(5):R58. Epub 2009 May 29. PDF

Session 7

Perelson AS. Modelling viral and immune system dynamics. Nat Rev Immunol. 2002 Jan;2(1):28-36. PDF

Session 8

ML Connolly. (1983) Solvent-accessible surfaces of proteins and nucleic acids. Science, 221(4612): 709-13. PMID: 6879170.PDF

Martin Karplus and J. Andrew McCammon. (2002) Molecular dynamics simulations of biomolecules. Nature Structural Biology,9, 646-52. PMID: 12198485.PDF

Session 9

Dill KA, Ozkan SB, Shell MS, Weikl TR. (2008) The Protein Folding Problem.Annu Rev Biophys,9, 37:289-316. PMID: 2443096.PDF

Bowman GR, Beauchamp KA, Boxer G, Pande VS. “Progress and challenges in the automated construction of Markov state models for full protein systems,” J. Chem. Phys. 131 (2009) 124101 PDF

Paper for Monday April 4th


Papers for Wednesday April 6th

W. F. van Gunsteren and H. J. C. Berendsen, "Algorithms for Brownian dynamics," Molecular Physics 45 (1982) 637.

C. D. Snow, H. Nguyen, V. S. Pande, and M. Gruebele, "Absolute comparison of simulated and experimental protein-folding dynamics," Nature 420 (2002) 102.

Papers for Dr. O'Hern's Lectures:

J. D. Honeycutt and D. Thirumalai, “The nature of folded states of globular proteins,” Biopolymers 32 (1992) 695 PDF

W. C. Swope and J. W. Pitera, “Describing protein folding kinetics By molecular dynamics simulations. 1. Theory,” J. Phys. Chem. B 108 (2004) 6571 PDF

W. C. Swope, J. W. Pitera, et al., "Describing protein folding kinetics by Molecular Dynamics Simulations. 2. Example applications to Alanine Dipeptide and beta-hairpin peptide," J. Phys. Chem. B 108 (2004) 6582 PDF

D. Bratko, T. Cellmer, J. M. Prausnitz, and H. W. Blanch, “Molecular Simulation of protein aggregation,” Biotechnology and Bioengineering 96 (2007) 1 PDF

Homework 3

This is the homework assignment. The first file is the actual assignment and the following two files are needed for one of the problems.




Non-CBB Final Project


Final Project


Humoral immune responses are mediated through antibodies. About 1010 to 1012 different antigen binding sites called paratopes are generated by genomic recombination. These antibodies are capable to bind to a variety of structures ranging from small molecules to protein complexes, including any posttranslational modification thereof. When studying protein-antibody interactions, two types of epitopes (the region paratopes interact with) are to be distinguished from each other: i) conformational and ii) linear epitopes. All potential linear epitopes of a protein can be represented by short peptides derived from the primary amino acid sequence. These peptides can be synthesized and arrayed on solid supports, e.g. glass slides (see Lorenz et al., 2009 [1]). By incubating these peptide arrays with antibody mixtures such as human serum or plasma, peptides can be determined that interact with antibodies in a specific fashion. The training set of this challenge comprises sequences of peptides that either bind intravenous immunoglobulin (IVIg) antibodies with high affinity/avidity (positive training set) or do not (negative training set). The challenge consists of determining for each peptide within the test set whether its reactivity with antibodies is strong or weak. Any approach that predicts the specificity scores of each peptide can in principle be applied for stratifying peptides presented in the test set into binders (to antibodies) and non-binders. Any publicly accessible information available for studying protein-protein-interactions as well as any approach enabling the determination of rule sets for predicting peptide-antibody affinities might be applied.


Antibody-protein interactions play a major role in various medicinal disciplines (infectious diseases, autoimmune diseases, oncology, vaccination and therapeutic interventions). Antibodies present in human blood interact with peptide sequences in a sequence–specific manner. Ideally, one specific antibody (monoclonal antibody) might exclusively bind one specific sequence. However, experimental data indicate that many antibodies bind to a panel of related or even distinct peptides and do so with different affinities. The open question is whether rules exist which enable the prediction of common peptide/epitope sequences that can be recognized by human antibodies. The binding site covered by an antibody typically includes a stretch of 8 to 10 amino acids. If peptides of 15 amino acids in length are incubated with one monospecific antibody, that antibody will bind to its epitope independently of the physical position of the binding motif within the peptide. Motifs running from position 1 to position 10 up to motifs running from position 6 to position 15 would be possible. This uncertainty results in difficulties for determining consensus binding sites as well as meaningful position weight matrices (PWM). Individual amino acids within epitope binding sites may have different impact on antibody recognition not only due to the nature of amino acids involved in binding (physicochemical properties) but also because of the specific position of the amino acid within the whole peptide sequence (context). In the experimental work leading to this challenge, 75,534 peptides were incubated with commercially available intravenous immunoglobulin (IVIg) fractions. IVIg is a mixture of naturally occurring human antibodies isolated from up to 100,000 healthy individuals. From this dataset a high confidence negative and positive pool of peptides was determined. The training and test datasets for this Challenge were assembled from these peptide pools.

The Challenge

From the collection of all the peptides incubated with human IVIg, a pool of 6,841 epitope containing peptide sequences reactive with human immunoglobulins was experimentally identified. This will be called the positive set. From the same original collection of peptides 20,437 peptides were identified that showed no antibody binding activity in any of the triplicate assays. This peptide set will be called the negative set. The training set was formed by picking 3,420 peptides from the positive set and 10,218 peptides from the negative set. The training set thus created contained 13,638 peptides and their respective binding reactivities. The test set was created by joining together the remaining 3,421 peptides from the positive set and the remaining 10,219 peptides from the negative set, for a total of 13,640 peptides. The epitope-antibody recognition challenge consists of determining whether each peptide in the test set belongs to the positive or negative set. Any accessible specificity information on amino acids and protein-protein-interactions available in the scientific community can be used.

The Data file contains the training set data. This file contains a \tab separated two-column table. The first column contains the peptide sequences. Most of these sequences are 15 amino acids long, but there are also some other sequence lengths (such as several 13 and a few 16, 18, and 21 amino acids long sequences). The second column contains a measure of the reactivity of the peptide to the IVIg antibodies. The data, sorted in descending order according to the second row, is represented below:












The second column ranges from 1 to 65423 (covering nearly all the possible dynamic range of 1 to 65,536 of the original peptide microarray signal intensities). The peptides whose signals range from 10,000 to 65,536 were deemed to belong to the pool of peptides reacting with the antibodies, and are located in the first 3,420 rows. On the other extreme, the peptides whose signal lies between 1 and 1,000 were deemed to belong to the non-reactive peptides and correspond to the last 10,218 rows in the training set. This binarization of the data in a reactive positive set and a non-reactive negative set is made for clarity in the scoring of the submissions, but is otherwise arbitrary. Therefore the training set contains a total of 13,638 rows, of which the first 3,420 rows constitute the pool of positive peptides and the last 10,218 rows constitute the pool of negative peptide sequences.

The Task

Your task is to create features that will be used to make these predictions. Based on the amino acid sequences provided in the training file you need to think of what features will be useful. It will be helpful to look over all the different bioiformatics programs and ideas covered in the class because most of them can help inspire ideas of features to be created. The machine learning part will be automated and performed by the TFs. You will need to turn in: A file that takes in the input file mentioned above and outputs a feature file that will be described below. You will also turn in the feature files for the training set and the test set. And a 1 to 2 page paper describing the features used and why you chose them. At any point before the deadline you may submit a feature file and I will email you back how well you are performing (AUC). The top performer will earn 5 percentage points towards their final grade.

Feature File

the first line is: @relation name (place your name instead of "name") next lines, feature names: @attribute x numberic @attribute y {red,green} @attribute class {0,1}

You can have as many attributes as you want. Each line will state the name of the attribute (you can call them whatever you want though a meaningful name is preferred). After the name you need to have either the word numeric (which means that real numbers will be used in the attribute) or if the attribute is nominal (meaning that a set number of categories are used) you need to list the categories as shown above for attribute y. After all your features (attributes) you need the class attribute which is a 0 or 1 depending on the score of the peptide which is given by the input file. If the peptide has a score of 10,000 or greater assign it a 1 otherwise a 0.

Next in the output file is the line:


followed by:

3,green,1 6,blue,0

in this example I have two instances (in your case peptides) and for each one you have a line corresponding to it with all the attributes in the same order as in the file separated by commas. These lines should be in the same order as the peptides in the input file.

This is the train file http://archive.gersteinlab.org/docs/2011/03.28/train.txt Use this to create the output file that you need to submit.

This is the test file.





Random code|AUC 1|AUC 2|AUC 3|AUC FINAL















Class Requirements

Discussion Section / Readings

Papers will be assigned throughout the course. These papers will be presented and discussed in weekly sections with the TAs. A brief summary (a half-page per article) should be submitted at the beginning of the discussion session.

Bioinformatics quizzes

There will be approximately three short quizzes (25 minutes) in class comprising SIMPLE questions that you should be able to answer from the lectures plus the main readings.

Programming Assignments (CBB and CS)

There will be several short programming assignments required for CBB and CS students taking this course. Acceptable languages and submission requirements will be discussed prior to the first assignment. These assignments are NOT required for students not taking the CBB or CS sections of the course.


The course is keyed towards CBB graduate students as well as advanced MB&B undergraduates and graduate students wishing to learn about types of large-scale quantitative analyses that whole-genome sequencing will make possible. It would also be suitable for students from other fields such as computer science or physics wanting to learn about an important new biological application for computation.

Students should have:

A basic knowledge of biochemistry and molecular biology. A knowledge of basic quantitative concepts, such as single variable calculus, some probability and statistics, and basic programming skills. These can be fulfilled by the following prerequisites statement: "Prerequisites: MBB 200 and Mathematics 115 or permission of the instructor."

Relevant Yale College Regulations

Students may have questions concerning end-of-term matters. Links to further information about these regulations can be found below:



Pages from previous years

2010, 2009 and earlier

Research Opportunities

If you're really motivated, take a look at http://bioinfo.mbb.yale.edu/jobs/.

Personal tools