Cbb752b14
From GersteinInfo
(→Discussion Section Readings) |
(→Assignment postings) |
||
(46 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
=Bioinformatics: Practical Application of Data Mining & Simulation= | =Bioinformatics: Practical Application of Data Mining & Simulation= | ||
- | 17th iteration at Yale, with material from all previous years available! ([http://GersteinLab.org/courses/452 GersteinLab.org/courses/452]) | + | 17th iteration at Yale, with material [http://info.gersteinlab.org/Cbb752b14#Pages_from_previous_years from all previous years] available! ([http://GersteinLab.org/courses/452 GersteinLab.org/courses/452]) |
=News= | =News= | ||
+ | In class poll on 3 March: which of these lectures did you like most: | ||
+ | [INTRODUCTION] | ||
+ | [ALIGNMENT] | ||
+ | [UNSUPERVISED MINING] | ||
+ | [SUPERVISED MINING] | ||
+ | [NETWORK TOPOLOGY] | ||
+ | [FUNSEQ APPLICATION] | ||
+ | [NETWORK PREDICTION] | ||
- | + | Quiz 2 is on Wednesday, 26 Feb, and will cover all of the material up through Monday, 24 Feb. | |
- | + | Quiz 1 is on Wednesday, 12 Feb, and will cover all of the material up through slide 31 of lecture 7 (3 Feb). | |
+ | |||
+ | Discussion sections start this week (week of 27 Jan)! Both sections will be held in Bass 405 (directly above our lecture classroom). One will be Wed 2:30-3:30 pm, and the other will be Fri from 4:30-5:30pm. See [http://info.gersteinlab.org/Cbb752b14#Session_1:_Next_Gen_Sequencing_.28Experimental.29 readings]. Please write a 1-2 paragraph summary of each paper, to be turned in before section. | ||
If you are still not receiving class emails, please contact Michael at michael.rutenbergschoenberg (at) yale.edu. | If you are still not receiving class emails, please contact Michael at michael.rutenbergschoenberg (at) yale.edu. | ||
Line 81: | Line 91: | ||
'''Discussion section:''' | '''Discussion section:''' | ||
- | Section 1 (Michael): Wednesdays | + | Section 1 (Michael): Wednesdays 2:30-3:30pm, starting 29 Jan 2013. |
- | Section 2 (Cong): | + | Section 2 (Cong): Fridays 4:30-5:30pm, starting week of 31 Jan 2013. |
=Instructors= | =Instructors= | ||
Line 151: | Line 161: | ||
Section 1 (Michael): Wednesdays 3-4pm, starting 29 Jan 2013. | Section 1 (Michael): Wednesdays 3-4pm, starting 29 Jan 2013. | ||
+ | |||
Section 2 (Cong): TBD, starting week of 27 Jan 2013. Email Cong at cong.li (at) yale.edu if you want to attend this section to help him with scheduling. | Section 2 (Cong): TBD, starting week of 27 Jan 2013. Email Cong at cong.li (at) yale.edu if you want to attend this section to help him with scheduling. | ||
Line 163: | Line 174: | ||
Wheeler DA et al. "The complete genome of an individual by massively parallel DNA sequencing,” Nature. 452:872-876 (2008) [http://www.gersteinlab.org/courses/452/10-spring/pdf/WatsonGenome.pdf PDF] | Wheeler DA et al. "The complete genome of an individual by massively parallel DNA sequencing,” Nature. 452:872-876 (2008) [http://www.gersteinlab.org/courses/452/10-spring/pdf/WatsonGenome.pdf PDF] | ||
- | ==Session 2: Proteomics == | + | ==Session 2: Proteomics/Sequence Alignment == |
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
T.F. Smith and M.S. Waterman. (1981) Identification of common molecular subsequences. Journal of Molecular Biology,147(1): 195-7. PMID: 7265238. [http://www.gersteinlab.org/courses/452/10-spring/pdf/sw.pdf PDF] | T.F. Smith and M.S. Waterman. (1981) Identification of common molecular subsequences. Journal of Molecular Biology,147(1): 195-7. PMID: 7265238. [http://www.gersteinlab.org/courses/452/10-spring/pdf/sw.pdf PDF] | ||
+ | Nevan J. Krogan et al (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae Nature 440, 637-643 (30 March 2006) [http://www.nature.com/nature/journal/v440/n7084/pdf/nature04670.pdf PDF] | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/Rinehart_suggested_reading_2014.docx Additional readings suggested by Professor Rinehart] | ||
+ | |||
+ | ==Session 3: Sequence Alignment/Machine learning== | ||
+ | |||
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-10. PMID: 2231712. [http://www.gersteinlab.org/courses/452/10-spring/pdf/Altschul.pdf PDF] | Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-10. PMID: 2231712. [http://www.gersteinlab.org/courses/452/10-spring/pdf/Altschul.pdf PDF] | ||
- | == Session 4: | + | Yip, KY, Cheng, C, Gerstein, M (2013). Machine learning and genome annotation: a match meant to be?. Genome Biol., 14, 5:205. [http://archive.gersteinlab.org/proj/cbb752b14/Yip_Machine_Learning_2013.pdf PDF] |
+ | |||
+ | == Session 4: Bioinformatics for Next-Gen Sequencing == | ||
Rozowsky, J, Euskirchen, G, Auerbach, RK, Zhang, ZD, Gibson, T, Bjornson, R, Carriero, N, Snyder, M, Gerstein, MB (2009). PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat. Biotechnol., 27, 1:66-75 [http://archive.gersteinlab.org/papers/e-print/PeakSeq/preprint.pdf PDF] | Rozowsky, J, Euskirchen, G, Auerbach, RK, Zhang, ZD, Gibson, T, Bjornson, R, Carriero, N, Snyder, M, Gerstein, MB (2009). PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat. Biotechnol., 27, 1:66-75 [http://archive.gersteinlab.org/papers/e-print/PeakSeq/preprint.pdf PDF] | ||
- | + | Cooper, GM, Shendure, J (2011). Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet., 12, 9:628-40 [http://www.nature.com/nrg/journal/v12/n9/pdf/nrg3046.pdf PDF] | |
== Session 5: Bioinformatics for Next-Gen Sequencing 2== | == Session 5: Bioinformatics for Next-Gen Sequencing 2== | ||
Lior Pachter. Models for Transcript Quantifications from RNA-Seq (2011) ArXiV [http://arxiv.org/pdf/1104.3889v2 PDF] | Lior Pachter. Models for Transcript Quantifications from RNA-Seq (2011) ArXiV [http://arxiv.org/pdf/1104.3889v2 PDF] | ||
- | |||
- | |||
==Session 6: Networks == | ==Session 6: Networks == | ||
Line 231: | Line 242: | ||
===Assignment postings=== | ===Assignment postings=== | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/CBB752b14_assignment1.pdf Assignment 1] '''DUE: 3 March 2014''' | ||
+ | <br>Files for programming assignment: Download [http://archive.gersteinlab.org/proj/cbb752b14/cbb752b14_assign1.zip here] <br/> | ||
+ | Test input and output files: Download [http://archive.gersteinlab.org/proj/cbb752b14/sample-input-updated.tgz here] | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/Homework_Kleinstein.2014.doc Assignment 2] ' '''DUE: 7 April 2014''' | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/hw3_2014_20April2014.pdf Assignment 3 (updated with reference containing values for model constants)] '''DUE: 24 April 2014''' | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/config.dat supplementary file for programming assignment] | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/hw3_ohern_non_programming.pdf Assignment 3 (non-programming)] | ||
===Final Project=== | ===Final Project=== | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/CBB752b14_Final_Project_140409.pdf Final Project "Final version" - updated 4 April 2014 (filepaths of project materials corrected) ] '''DUE: 30 Apr 2014 11.59pm''' | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/cbb752b14_galaxy_rnaseq_files.zip Files for MBB/MCDB pseudocomputational section] | ||
+ | |||
+ | [http://archive.gersteinlab.org/proj/cbb752b14/Accessing_and_using_BulldogJ.pdf Accessing and Using BulldogJ] | ||
===Grade Categories=== | ===Grade Categories=== |
Latest revision as of 00:32, 23 April 2014
Bioinformatics: Practical Application of Data Mining & Simulation
17th iteration at Yale, with material from all previous years available! (GersteinLab.org/courses/452)
News
In class poll on 3 March: which of these lectures did you like most: [INTRODUCTION] [ALIGNMENT] [UNSUPERVISED MINING] [SUPERVISED MINING] [NETWORK TOPOLOGY] [FUNSEQ APPLICATION] [NETWORK PREDICTION]
Quiz 2 is on Wednesday, 26 Feb, and will cover all of the material up through Monday, 24 Feb.
Quiz 1 is on Wednesday, 12 Feb, and will cover all of the material up through slide 31 of lecture 7 (3 Feb).
Discussion sections start this week (week of 27 Jan)! Both sections will be held in Bass 405 (directly above our lecture classroom). One will be Wed 2:30-3:30 pm, and the other will be Fri from 4:30-5:30pm. See readings. Please write a 1-2 paragraph summary of each paper, to be turned in before section.
If you are still not receiving class emails, please contact Michael at michael.rutenbergschoenberg (at) yale.edu.
Schedule
Class Schedule (including a list of topics and quiz dates)
Course Information
Course Description
Bioinformatics encompasses the analysis of gene sequences, macromolecular structures, and functional genomics data on a large scale. It represents a major practical application for modern techniques in data mining and simulation. Specific topics to be covered include sequence alignment, large-scale processing, next-generation sequencing data, comparative genomics, phylogenetics, biological database design, geometric analysis of protein structure, molecular-dynamics simulation, biological networks, normalization of microarray data, mining of functional genomics data sets, and machine learning approaches for data integration.
Concise undergraduate course description
Techniques in data mining and simulation applied to bioinformatics, the computational analysis of gene sequences, macromolecular structures, and functional genomics data on a large scale. Sequence alignment, comparative genomics and phylogenetics, biological databases, geometric analysis of protein structure, molecular-dynamics simulation, biological networks, microarray normalization, and machine-learning approaches to data integration.
See entry from undergraduate catalog: http://students.yale.edu/oci/resultDetail.jsp?course=23441&term=201401, viz:
MB&B 452 01 (23441) /MCDB452/CB&B752/MCDB752/CPSC752/MB&B452 Bioinformatics:Mining&Simulatn Mark Gerstein MW 1.00-2.15 BASS 305
Fall 2014 No regular final examination Areas Sc Prerequisites: MB&B 301b and MATH 115a or b, or permission of instructor. MCDB 120a or 200b is a prerequisite for courses numbered MCDB 202 and above.
Different headings for this class
MB&B452/MCDB452
This version of the course consists of lectures, written problem sets, and a final (semi-computational section and a literature survey) project.
MB&B752/MCDB752
This version of the course consists of lectures, written problem sets, and a final (semi-computational section and a literature survey) project.
CB&B752/CPSC752
This version of the course consists of lectures, programming assignments, and a final programming project.
For graduate students the course can be broken up into two "modules" (each counting 0.5 credit towards MB&B course requirement):
MB&B 753a3, Bioinformatics: Practical Application of Data Mining (1st half of term)
MB&B 754a4, Bioinformatics: Practical Application of Simulation (2nd half of term)
Each module consists of lectures, written problem sets, and a final, graduate level written project that is half the length of the full course's final project.
For the grade weighting schemes of each course version, see Class Requirements section.
Prerequisites
The course is keyed towards CBB graduate students as well as advanced MB&B undergraduates and graduate students wishing to learn about types of large-scale quantitative analyses that whole-genome sequencing will make possible. It would also be suitable for students from other fields such as computer science or physics wanting to learn about an important new biological application for computation.
Students should have:
A basic knowledge of biochemistry and molecular biology. A knowledge of basic quantitative concepts, such as single variable calculus, some probability and statistics, and basic programming skills. These can be fulfilled by the following prerequisites statement: "Prerequisites: MBB 200 and Mathematics 115 or permission of the instructor."
Timing & location
Class: Meeting from 1:00-2:15 pm on Monday and Wednesday, in Bass 305. (First meeting will be on 13 Jan 2014 (Mon). The third meeting will be 17 Jan 2014 (Fri), as part of Yale's compensation for canceling classes on 20 Jan 2014 (Mon.), in observance of MLK day. See Course Schedule for details.)
Discussion section:
Section 1 (Michael): Wednesdays 2:30-3:30pm, starting 29 Jan 2013. Section 2 (Cong): Fridays 4:30-5:30pm, starting week of 31 Jan 2013.
Instructors
Consultation is available UPON REQUEST or according to times stipulated by the individual instructors. Email cbb752(at)gersteinlab.org to reach the instructor and the TFs .
Instructor-in-Charge
Name | Office | |
---|---|---|
Mark Gerstein | Bass 432A | mark.gerstein *at* yale.edu |
Guest Instructors
Name | Office | |
---|---|---|
Corey O'Hern | Mason Laboratory | corey.ohern(at)yale.edu |
Jesse Rinehart | 300 George St | jesse.rinehart(at)yale.edu |
James Noonan | 333 Cedar St | james.noonan(at)yale.edu |
Kei Cheung | 300 George St | kei.cheung(at)yale.edu |
Steven Kleinstein | 300 George St | steven.kleinstein(at)yale.edu |
Teaching Fellows
Name | Office | |
---|---|---|
Michael Rutenberg Schoenberg | Bass 437 | michael.rutenbergschoenberg(at)yale.edu |
Cong Li | 300 George, Suite 503 | cong.li(at)yale.edu |
Discussion Section
Section 1 (Michael): Wednesdays 3-4pm, starting 29 Jan 2013.
Section 2 (Cong): TBD, starting week of 27 Jan 2013. Email Cong at cong.li (at) yale.edu if you want to attend this section to help him with scheduling.
Each section will include discussion of papers assigned (below). Students are expected to submit 1-2 paragraph summaries of each paper before the section. In Section 1 (Wed 3-4pm), students will give 15-20 min presentations of the papers. The second section will likely be much smaller, and will have a discussion format. The written assignment will be the same, and students will be graded on a combination of the written assignments and your participation in discussions.
Discussion Section Readings
Session 1: Next Gen Sequencing (Experimental)
Metzker ML. "Sequencing technologies - the next generation” Nature Reviews Genetics. 11 (2010) PDF
Wheeler DA et al. "The complete genome of an individual by massively parallel DNA sequencing,” Nature. 452:872-876 (2008) PDF
Session 2: Proteomics/Sequence Alignment
T.F. Smith and M.S. Waterman. (1981) Identification of common molecular subsequences. Journal of Molecular Biology,147(1): 195-7. PMID: 7265238. PDF
Nevan J. Krogan et al (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae Nature 440, 637-643 (30 March 2006) PDF
Additional readings suggested by Professor Rinehart
Session 3: Sequence Alignment/Machine learning
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-10. PMID: 2231712. PDF
Yip, KY, Cheng, C, Gerstein, M (2013). Machine learning and genome annotation: a match meant to be?. Genome Biol., 14, 5:205. PDF
Session 4: Bioinformatics for Next-Gen Sequencing
Rozowsky, J, Euskirchen, G, Auerbach, RK, Zhang, ZD, Gibson, T, Bjornson, R, Carriero, N, Snyder, M, Gerstein, MB (2009). PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat. Biotechnol., 27, 1:66-75 PDF
Cooper, GM, Shendure, J (2011). Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet., 12, 9:628-40 PDF
Session 5: Bioinformatics for Next-Gen Sequencing 2
Lior Pachter. Models for Transcript Quantifications from RNA-Seq (2011) ArXiV PDF
Session 6: Networks
Ekman D, Light S, Björklund AK, Elofsson A. (2006) What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol. 2006;7(6):R45. PDF
Barabási, AL, Oltvai, ZN (2004). Network biology: understanding the cell's functional organization. Nat. Rev. Genet., 5, 2:101-13. PDF
Session 7: Immunological Modeling/Semantic Web
Perelson AS. Modelling viral and immune system dynamics. Nat Rev Immunol. 2002 Jan;2(1):28-36. PDF
Antezana E, Egaña M, Blondé W, Illarramendi A, Bilbao I, De Baets B, Stevens R, Mironov V, Kuiper M. (2009) The Cell Cycle Ontology: an application ontology for the representation and integrated analysis of the cell cycle process. Genome Biol. 2009;10(5):R58. Epub 2009 May 29. PDF
Session 8: Protein Simulation 1
Martin Karplus and J. Andrew McCammon. (2002) Molecular dynamics simulations of biomolecules. Nature Structural Biology,9, 646-52. PMID: 12198485.PDF
Zhou, AQ, O'Hern, CS, Regan, L (2011). Revisiting the Ramachandran plot from a new angle. Protein Sci., 20, 7:1166-71 PDF
Session 9: Protein Simulation 2
Dill KA, Ozkan SB, Shell MS, Weikl TR. (2008) The Protein Folding Problem.Annu Rev Biophys,9, 37:289-316. PMID: 2443096.PDF
Bowman GR, Beauchamp KA, Boxer G, Pande VS. “Progress and challenges in the automated construction of Markov state models for full protein systems,” J. Chem. Phys. 131 (2009) 124101 PDF
Class Requirements
Discussion Section / Readings
Papers will be assigned throughout the course. These papers will be presented and discussed in weekly 60-minute sections with the TFs. A brief summary (a half-page per article) should be submitted at the beginning of the discussion session.
Bioinformatics quizzes
There will be four short quizzes (25 minutes) in class comprising SIMPLE questions that you should be able to answer from the lectures plus the main readings.
Answer keys to Quizzes 1-4 cbb752a12: found here
Programming Assignments (CBB and CS) and Programming issues
There will be several short programming assignments required for CBB and CS students taking this course. Acceptable languages and submission requirements will be discussed prior to the first assignment. These assignments are NOT required for students not taking the CBB or CS sections of the course.
These are the programming languages that we permit in the programming assignments and final project: Perl, Python, C, C++, MATLAB and R. If you really feel more comfortable with other languages, please email the TFs to discuss. Also, packages such as BioPerl and BioPython are not allowed in the assignments and final project. If in doubt, please consult the TFs.
We recommend the use of PERL for most of the programming. A useful resource is the following book: Programming Perl, 3rd Edition in the O' Reilly series, by Larry Wall, Tom Christiansen, Jon Orwant. The Yale Library has also older editions, which would work too. We would also recommend the following online resources: http://www.perlmonks.org/ and http://stackoverflow.com/. Otherwise, Google is your best friend.
Assignment postings
Assignment 1 DUE: 3 March 2014
Files for programming assignment: Download here
Test input and output files: Download here
Assignment 2 ' DUE: 7 April 2014
Assignment 3 (updated with reference containing values for model constants) DUE: 24 April 2014
supplementary file for programming assignment
Assignment 3 (non-programming)
Final Project
Final Project "Final version" - updated 4 April 2014 (filepaths of project materials corrected) DUE: 30 Apr 2014 11.59pm
Files for MBB/MCDB pseudocomputational section
Grade Categories
The following are the approximate grading systems:
CBB and CPSC Sections:
Category | % of Total Grade |
---|---|
Quizzes | 33% |
Final Project | 33% |
Discussion Section | 9% |
Programming Assignments | 25% |
MBB and MCDB Sections:
Category | % of Total Grade |
---|---|
Quizzes | 33% |
Final Project | 33% |
Discussion Section | 17% |
Problem Sets | 17% |
Relevant Yale College Regulations
Students may have questions concerning end-of-term matters. Links to further information about these regulations can be found below:
http://yalecollege.yale.edu/content/reading-period-and-final-examination-period
http://yalecollege.yale.edu/content/completion-course-work
Brief presentation on how to cite correctly : http://archive.gersteinlab.org/mark/out/log/2012/06.12/cbb752b12/cbb752_cite.ppt
Plagiarism
Below is a message from Dean Mary Miller of Yale College about citing your references and sources of information and plagiarism:
"
You need to cite all sources used for papers, including drafts of papers, and repeat the reference each time you use the source in your written work.
You need to place quotation marks around any cited or cut-and-pasted materials, IN ADDITION TO footnoting or otherwise marking the source.
If you do not quote directly – that is, if you paraphrase – you still need to mark your source each time you use borrowed material.
Otherwise you have plagiarized.
It is also advisable that you list all sources consulted for the draft or paper in the closing materials, such as a bibliography or roster of sources consulted.
You may not submit the same paper, or substantially the same paper, in more than one course.
If topics for two courses coincide, you need written permission from both instructors before
either combining work on two papers or revising an earlier paper for submission to a new course.
It is the policy of Yale College that all cases of academic dishonesty be reported to the chair of the Executive Committee.... "
Also, it might be of interest to people, to look at this recent article regarding academic dishonesty.
Misc
Permissions on using website material
Graphic for course homepage
If you're really motivated, take a look at http://gersteinlab.org/jobs for further Research Opportunities
Polls
Poll for students' sign up and good times for the weekly discussion section
Poll 2 Section sign-up
Pages from previous years
2014 is the 17th time Bioinformatics has been taught at Yale. Pages for the 16 previous iterations of the class are available. Look at how things evolve!
2012 fall, 2012 spring (quizzes), 2011, 2010, 2009 and earlier (12 years of classes, staring in '98)