Machine Learning for Biology [[DEMO]]

Course description

This course is designed to give an overview on machine learning techiques including data preparation, supervised vs unsupervised machine learning algorithms, and application considerations. We will review major software tools in informatics including R, git, and LaTex/markdown. Example data sets will be drawn from the genetics and analytical chemistry.

Learning goals

Constructing data pipelines for reproducable analysis.
Apply and interperate different machine learning algorithm with appropreate pre-processing and validation including, - supervised machine learning: Random forest, k-nearest neighbor, support vector machines - unsupervised machine learning: k-mean cluster, partitioning around medroids, truncated singular value decomposition
Interperate published studies utlizing machine learning and understand the strenghts, weaknesses and caveats of each approach.

Course resources

Office hours: TBD
Git repository with lecture notes and assignments
Textbook: TBD
Library access for primary cited literature

Course outline

Week	Topic
1	Tools: R, git, Latex, and Markdown
2	Data cleaning, regular expressions, and outliers
3	Eigenvectors, eiganvalues, and distance
4	Supervised machine learning: k-nearest neighbor
5	Supervised machine learning: Random forest
6	Review and midterm
7	Supervised machine learning: Support vector machines
8	Unsupervised machine learning: k-clustering
9	Unsupervised machine learning: Partitioning around medroids
10	Unsupervised machine learning: Truncated singular value decomposition
11	Training and validation
12	Review

Homework:

Assignments will be due on Mondays and structured around topics covered the previous week.
Homework assignments are expected to be formatted in either LaTex or Rmarkdown. If there is a code element (most homeworks), Rmarkdown executable files are expected. Handwritten assignments will not be accepted, nor will documents written in Microsoft Word or Excel be accepted. Jupyter notebook and other executable formats may be accepted, please contact instructor.
Homework solutions should include a copy of the problem statement the homework is answering.
Assignments will be assessed on:
- 10% attempted assignment and includes problem statements
- 40% readability of code and prose
- 40% method correctness and executable code
- 10% correct answer
Late assignments are accepted at the discretion of the instructor. Poor planning on the part of the student is not sufficient reason for acceptance of a late assignment.

Authorship and attribution

You are expected to do your own work and cite/credit others appropriately. Cases of plagiarism will be brought to the appropriate university officials.
Group projects will have an Author Attribution paragraph where individual contributions to the project will be clearly identified.
While you are encouraged to work through homework assignments in study groups, they should be written up individually and reflect your own understanding of the material.

Final project

Final projects will be selected middle of the semester. Students will select a study from the research literature that uses one of the techniques covered in class and reproduce the analysis. Students can also analyze their own data with a method discussed, or related to those discussed, in class. This analysis must be completely automated including data download, analysis, and any relevant figure generation. Students are expected to analyze different data sets and specify both the data and analysis they will be working with by week 8.

Grade weights:

30% homework assignments (lowest homework thrown out)
20% midterm
20% final project
30% final

ktoddbrown/demo_BioML