Machine Learning in Medical Bioinformatics

This repository contains preparatory tasks, reading material and exercises for the Machine Learning in Medical Bioinformatics.

Preparatory tasks (To be finished before the Workshop)

Below is a list of tasks and reading material that should be finished before coming to the Workshop. There are quite a lot of material so it is completely alright if you don't understand every single detail. The workload for the the preparatory tasks should be approximately one week, it is expected that you spend this amount of time to be prepared for the physical meeting. How much you spend on the various parts is up to you and your specific background and interest, but you should come prepared and contribute to the workshop. You are encouraged to bring any questions that has come up for discussion at the workshop.

After finishing the preparatory exercise you should post at least three questions and/or discussion points as answers to this in the pre-course assignments in canvas the day before the course starts. We will spend some time the first day discussing these questions, so ideally they should be open ended. Like I have this data X in my research project, how can I apply machine learning to it to learn trait Y? Even though they could also be simple like explain concept X and Y.

Good luck and if you have any question, do not hesitate to contact me.

1. Setup

Many of the exercises will use Jupyter notebooks, an interactive Python environments that makes it possible to combine documentation with code. It is also possible to run Python and R code together. Below are som instructions on how to set everything up.

Read all of the instructions before starting, some tasks are practial, and some are reading, and some might contain some overlap.
Install Anaconda on your laptop. This will install a special version of Python that includes the Jupyter Notebook and basically all Python modules needed (deep-learning modules will be installed separately).

If you are unfamiliar with Jupyter notebooks you can learn more using the following tutorial: https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook.
Git clone this repo: https://github.com/bjornwallner/ML_medbioinfo and make sure you can open the notebook in notebooks/intro.ipynb using the following commands:

git clone https://github.com/bjornwallner/ML_medbioinfo
cd ML_medbioinfo/notebooks
jupyter-notebook intro.ipynb

You can use this notebook or create your own when doing the next exercise.

2. PREP: Machine Learning Introduction

We will use the scikit-learn module to do machine learning in Python. It is built on NumPy, SciPy, and matplotlib and is fairly easy to use and it contains all the basic functions to do regular supervised and unsupervised learning. It contains a Neural network module as well, but it is fairly limited, so for neural nets we will use Tensorflow and the Keras API

Use a Jupyter notebook to do the first two tutorials on scikit-learn. Focusing on the key concepts outlined below. The goal with this preparatory exercise is:
- To understand the key concepts
- Familiar yourself with Jupyter notebooks
Actions:
- Write down descriptions for the concepts
- Think about cases were ML can be applied to your particular area (this can be used in the project later as well).

Hint: you can hide the prompt and output of code blocks by clicking the top right corner (see below)

An introduction to machine learning with scikit-learn
- Key concepts:
  - Training set
  - Testing set
  - Samples
  - Features
  - Target
  - Classification
  - Regression
  - Model fit
  - Model predict
A tutorial on statistical-learning for scientific data processing, stop at "Putting it all together"
- Key concepts:
  - Supervised learning, incl. examples of methods
  - Unsupervised learning, incl. examples of methods
  - Model selection
  - Model estimator
  - Model parameters
  - Score
  - Cross-validation
  - Grid-search

3. PREP: Statistical principles in supervised machine learning: overfitting, regularization and all that

Read one of these:

For those with maths/stats background that want to go slightly deeper into the topic:
- Hastie et al (2009). The Elements of Statistical Learning. Springer. https://web.stanford.edu/~hastie/ElemStatLearn/. Chapters 2 and 3
For those with other backgrounds:
- James et al. (2013). An Introduction to Statistical Learning – with Applications in R. Springer. http://www-bcf.usc.edu/~gareth/ISL/. Chapters 2 and 6

4. PREP: Deep Learning

Browse or read: Michael Nielsen, Neural Networks and Deep Learning, http://neuralnetworksanddeeplearning.com/ chapter 1-4

5. PREP: Project

Read the project description below
Think about use cases of ML in your problem domain.

Don't forget that after finishing the preparatory exercise you should post at least three questions and/or discussion points as answers to this pre-course assignments in canvas, 23:59 day before the course starts at the latest.

Project

The project is the examining part of the course. Together with participation at the workshop it is compulsory to gain the course credits. The workload is expected to be about a week. The project is your chance to learn a bit more about some particular ML methods. The project consists of applying some ML metods to a particular dataset or datasets, and the compare the results. The results should be compiled in a written report including:

Description of the chosen methods. In order to compare performance you need either to choose two (or more) different methods or in case of deep learning you could compare different architectures.
What parameters are important to optimize for the chosen ML methods
Which performance measures will be used, correlation, PPV, F1 or AUC? Does it matter?
Description of your data set.
Description of how cross-validation was performed. How was the data split to avoid similar examples in training and validation?
Results from parameter optimizations, plots or tables.
- What parameters are optimal?
Conclusions on the difference between ML methods, performance, sensitivity to parameter choices, ease-of-use etc.

If you cannot find a suitable ML project within your particular domain, you can use data from ProQDock, paper: https://academic.oup.com/bioinformatics/article/32/12/i262/2288786. Or you can choose a data set from the Machine Learning Repository To study. Make sure it has a good balance between number of examples (# Instances) and number of features (# Attributes).

For example:

Upload a pdf of your report as answer to the Project assignment in canvas

IMJoeyZhu/ML_medbioinfo