/MachineLearning4Genomics

Machine Learning for Genomics Course.

Primary LanguageJupyter NotebookCreative Commons Attribution 4.0 InternationalCC-BY-4.0

Machine Learning For Genomics

Introduction

Genomic data generated by researchers has grown exponentially. This increase demands even better tools to derive insights from the data, including augmenting other data for better inference and decision-making. Machine learning, Deep learning, and artificial intelligence have matured with powerful tools that can be applied in Genomics. However, in Africa, there is still a skills gap among Bioinformatics students in these technologies. In this course, we introduce the basics of machine learning, including practical skills in transforming genomic data for machine learning modelling. Although the MSc Bioinformatics curricula contain the course, it is not being taught, putting the students at a disadvantage, as Bioinformatics leans more towards data science.

Competencies

In this short course, we intend to impart knowledge and skills in the following competencies (See ISCB Competencies):

  1. Knowledge and skills: Details of the scientific discovery process and the role of bioinformatics in it.
  2. Knowledge, comprehension, and Application: Statistical, machine learning, and data science research methods in the context of molecular biology, genomics, medical or population genetics research
  3. Knowledge and Application: Command line and scripting-based computing skills appropriate to the discipline.
  4. Knowledge and skills: Data management

Learning Objectives

To attain the above competencies, the workshop participants should be able to:

  1. Describe the application of machine learning in genomics
  2. Explain the various machine learning principles and how they can be applied to genomics
  3. Explain the research design approaches as applied to machine learning for genomics
  4. Know the various open science tools (Jupyter Notebooks, Pandas, Conda)and how they support a reproducible bioinformatics research
  5. Know the various machine learning frameworks in Python

Learning Outcomes

From the above objectives, the workshop participant should acquire the following skills;

  1. Be able to set up Jupyter and Conda environments for machine learning for a genomic project to ensure reproducibility
  2. Be able to transform genomic data for machine learning modelling
  3. Be able to perform exploratory analysis on genomic data, feature engineering, and parameter selection
  4. Be able to develop and validate machine learning models using genomic data

Instructors

  1. Caleb Kibet

Who should attend?

EANBiT Fellows

Contents

This course is broken up into several notebooks (lectures).

Session 1

Session 2

Session 3

Session 4

  • Notebook_06 Machine Learning Using VCF output: Dimensionality Reduction

Quick Introduction to Jupyter Notebooks

Throughout this course, we will be using Jupyter Notebooks.

Introduction

The Jupyter Notebook is an interactive computing environment that enables users to author notebooks, which contain a complete and self-contained record of a computation. These notebooks can be shared more efficiently. The notebooks may contain:

  • Live code
  • Interactive widgets
  • Plots
  • Narrative text
  • Equations
  • Images
  • Video

It is good to note that "Jupyter" is a loose acronym meaning Julia, Python, and R; the primary languages supported by Jupyter.

The notebook can allow a computational researcher to create reproducible documentation of their research. As Bioinformatics is datacentric, the use of Jupyter Notebooks increases research transparency, hence promoting open science.

Pre-requisites

Machine learning for genomics assumes familiarity with Python and Pandas. Please have a look at the Python4Bioinformatics training materials for a refresher.

First Steps

Installation

  1. Download Miniconda for your specific OS to your home directory
    • Linux: wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    • Mac: curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
  2. Run:
    • bash Miniconda3-latest-Linux-x86_64.sh
    • bash Miniconda3-latest-MacOSX-x86_64.sh
  3. Follow all the prompts: if unsure, accept defaults
  4. Close and re-open your terminal
  5. If the installation is successful, you should see a list of installed packages with
    • conda list

If the command cannot be found, you can add the Anaconda bin to the path using: export PATH=~/miniconda3/bin:$PATH

For reproducible analysis, you can create a conda environment with all the Python packages you used.

`conda create --name ml_genomics python jupyter`

To activate the conda environment:

`source activate ml_genomics`

Having set-up conda environment, you can install jupyter lab using pip.

conda install -c conda-forge jupyterlab

or by using pip

pip3 install jupyter

How to learn from this resource?

Download all the notebooks from MachineLearning4Genomics. The easiest way to do that is to clone the GitHub repository to your working directory using any of the following commands:

git clone https://github.com/mbbu/MachineLearning4Genomics.git

or

wget https://github.com/mbbu/MachineLearning4Genomics/archive/master.zip

unzip master.zip

rm master.zip

cd MachineLearning4Genomics-master

Then you can quickly launch jupyter lab using:

jupyter lab

NB: We will use a jupyter lab for training. A Jupyter notebook is made up of many cells. Each cell can contain Python code. You can execute a cell by clicking on it and pressing Shift-Enter or Ctrl-Enter (run without moving to the next line).

Resources to use:

  1. Encoding DNA

  2. Machine Learning in Bioinformatics: Genome Geography:From raw sequencing reads to a machine learning model, which infers an individual's geographical origin based on their genomic variation.

  3. Deep Learning for Genomics

  4. Machine Learning for Genomics. How to transform your genomics data to fit into machine learning models.

  5. Machine Learning For Good

  6. Machine Leaning in Bioinformatics

  7. Feature Engineering in Genomics - Variant calling

  8. Machine leaning for genomic classification

  9. Support Vector Machines

  10. Mathematics For Machine Learning

  11. Deep Learning Book - Machine Learning Chapter

To Find datasets and get learning even further, use Kaggle

How to Contribute

To contribute, fork the repository, make some updates and send me a pull request.

Alternatively, you can open an issue.

License

This work is licensed under the Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/