/Bioinformatics-training-collection

A collection of resources for learning about tools and programming languages related to Bioinformatics

Bioinformatics-training-collection

A collection of resources for learning about tools and languages related to Bioinformatics

This page contains a curated list of resources for learning bioinformatic tools and programming languages related to it.

Programming

Python

Python is one of the most versatile, powerful and easy-to-use programming languages. It is very well suited for Bioinformatics because of it's versatile light-weight nature, and it's extendability in the form of different (Bioinformatics-related) packages. These packages extend the capabilities of the language, adding new functionality.

To program using Python, you will first need to download an Python distribution and then (preferably) download an Integrated Development Environment (IDE) to write code in. Alternatively, if using conda (see Environment and package management systems), Python comes pre-installed.

Python IDE

When learning and applying Python, code should be developed within an Integrated Development Environment (IDE). This is dedicated software for editing code, and for running the code.

Some of the more popular dedicated IDE's include:

Aside from these options, a recently popular method for editing and running code exists in the form of Jupyter Notebooks, or Jupyter Lab. These are different from the above mentioned dedicated IDE's in the sense that these create an interactive 'Notebook' environment in which code can be ran inside separate cells, at separate moments. This is highly recommended for Data Science purposes, and can be a useful tool for (exploratory) data analysis in Bioinformatics. This is thus mostly recommended for data analysis, and not for general programming purposes or for project purposes.

Python Courses

There are numerous sources to learn from, both free and paid. In this section, I will recommend several tutorials or MOOC's (Massive Open Online Courses).

Python Books

In this section, several E-books (or hardcover, depending on your liking) will be listed that could be useful references for learning Python. Personally, I prefer online courses for learning and having books as quick references for looking things up quickly. It is worth noting that the Humble Bundle periodically has a Programming E-books bundle available for low pricing. These bundles often include books from reputable publishers and authors such as O'Reilly or Wiley. These bundles can include e-books for every programming language, or they can be language-specific.

Python Packages

These packages extend the capabilities of the language, adding new functionality. In this section, I will first sum up some of the more specific and useful bioinformatics-related packages, and then link to a general list of great packages.

  • Numpy - This package provides a way to handle vectorized data, and lends itself for incredibly fast code for scientific projects.

  • Pandas - This package provides a way to handle tabular data in a very efficient manner.

  • Scikit-Learn - This package provides several statistical functions, as well as being the primary package for machine learning purposes.

  • Statsmodels - This package provides several important statistical functionality.

  • Matplotlib - This package provides extensive plotting and visualisation functionality, and it can be beautified with Seaborn.

  • Biopython - Biopython is a comprehensive set/repository of freely available packages for biological computation written in Python by an international team of developers.

  • Compilation of more specific Python packages:

R

R is a language developed for statistical computation and graphical visualisation, although it is currently being expanded into a more versatile programming language. It is gaining popularity in Bioinformatics due to it's capabilities for data science.

To program using R, you will first need to download an R distribution and then (preferably) download an Integrated Development Environment (IDE) to write code in. Alternatively, if using conda (see Environment and package management systems), you can install R through this.

R IDE

When learning and applying R, code should be developed within an IDE This is dedicated software for editing code, and for running the code.

It is to be noted here that RStudio is the only recommended dedicated IDE since it's capabilities are enormous, and the team backing it is incredibly good.

It is, however, also possible to use R in Jupyter Notebooks by installing the IRKernel.

R Courses

There are numerous sources to learn from, both free and paid. In this section, I will recommend several tutorials or MOOC's (Massive Open Online Courses).

It is to be noted that although Udemy's courses are mostly paid, these are not directly linked to academic institutions. This is an advantage over EDx and Coursera, as it gives instructors the ability to freely design their own courses instead of being bound to academic regulations for the order and content of teaching. Teachers on Udemy are often from the academia, however, and often have a PhD and experience in the field.

R Books

In this section, several E-books (or hardcover, depending on your liking) will be listed that could be useful references for learning R. Personally, I prefer online courses for learning and having books as quick references for looking things up quickly. It is worth noting that the Humble Bundle periodically has a Programming E-books bundle available for low pricing. These bundles often include books from reputable publishers and authors such as O'Reilly or Wiley. These bundles can include e-books for every programming language, or they can be language-specific.

The RStudio team lists several books in their resources section. Out of these books, the following are highly recommended.

The Extending R book could be of use for advanced R users.

The R Workflow book by Frank Harrell provides a great overview of how to structure a workflow in R, focused on data analysis. See #General-Computational-Skills for more information.

The Big Book of R by Oscar Baruffa is a comprehensive repository of free books for R, divided into different subjects and fields (thanks for the suggestion Mikhael Dito Manurung)

In addition, these bioinformatics domain-specific books could be of use:

R Packages

These packages extend the capabilities of the language, adding new functionality. In this section, I will first sum up some of the more specific and useful bioinformatics-related packages, and then link to a general list of great packages.

  • The Tidyverse is a collection of packages that will make R life infinitely easier. These packages provide incredible functions for data science. It is maintained by the RStudio team, and the RStudio team provides several cheatsheets that will make learning the use of these packages easier. Although included in the Tidyverse, GGPlot2 deserves a special mention as this provides beautiful data visualisation and graphical plots that can be easily customized. Several extensions to this package exist. For example, ggsci is an important one as it allows for colors used in scientific journals. This is by far not the only one, and I recommend you look up more. Additionally, it is worth mentioning that Reticulate, from the same team, provides a package that enables the combination of both R and Python in RStudio.
  • Similar to Biopython, there is also an extensive repository for open-source computational biology software packages curated by the Bioconductor initiative.
  • naniar. This package is very useful for exploring your data, and checking missingness of your data.
  • patchwork. This package allows combining and arranging separate ggplots in an easy manner.
  • The Shiny package provides easy web application development, mostly in dashboard format, without the need for prior HTML/CSS/Javascript knowledge. It will handle every aspect of web development: the layout, structure and graphics using Shiny code, and the computation using R code.
  • The ggpubr package provides some easy-to-use functions for creating and customizing ggplot2- based publication ready plots.
  • The Targets package to maintain a reproducible workflow. See #General-Computational-Skills for further explanation.
  • The Here package to simplify file referencing in a project-based workflow. See #General-Computational-Skills for further explanation.
  • The Rio package is a swiss-army knife package for data import & export.
  • Compilation of more specific R packages:

Bash

Although some will hate me for this, I usually refer to Bash as both a Unix/Linux terminal and as a 'form' of a programming language, or rather, 'a command language'. So first things first: what is a terminal? A terminal, or shell, is a program that takes commands and sends them to the operating system to perform. If you're familiar with Microsoft Windows CMD or PowerShell, this is an example of a shell or terminal, although much less powerful and easy to use as Bash. Now Bash (Bourne Again Shell) is the default terminal shell for most Linux/Unix-based operating systems, and it is very very powerful. There are some alternatives, such as Zsh, which are gaining popularity (due to amazing plugins for it, google it), however bash is still the most commonly used. Most Bioinformatics or Computational Biology programs have been written to be executable on a command line interface (the terminal, or shell), usually often in bash. In addition, bash comes with a lot of pre-installed applications that might be considered command languages by itself as well (awk, sed, for example), making it very powerful. Furthermore, it's functionality is extendable with other software packages.

Most Linux/Unix distributions come pre-installed with bash, and MacOS used to have bash as the default terminal as well, although it switched to Zsh. Microsoft Windows, however, is an exception. It does not come pre-installed with bash at all. In fact, until quite recently, users had to find workarounds on using it (Dual booting, for example). Currently, however, the Microsoft-supported Windows Subsystem for Linux feature enables Microsoft Windows users to run a Linux kernel that supports the usage of Bash.

To install Bash on a Microsoft Windows OS, users will be first required to install Windows Subsystem for Linux. Next, users will be required to install the Ubuntu terminal from the Microsoft Store. Interested readers can check out this small guide. Once Ubuntu is installed, and set up, users are free to use the bash-default Ubuntu terminal.

Bash Courses

There are numerous sources to learn from, both free and paid. In this section, I will recommend several tutorials or MOOC's (Massive Open Online Courses).

It is to be noted that although Udemy's courses are mostly paid, these are not directly linked to academic institutions. This is an advantage, as it gives instructors the ability to freely design their own courses instead of being bound to academic regulations for the order and content of teaching. Teachers on Udemy are often from the academia, however, and often have a PhD and experience in the field.

Bash Books

In this section, several E-books (or hardcover, depending on your liking) will be listed that could be useful references for learning Bash. Personally, I prefer online courses for learning and having books as quick references for looking things up quickly. It is worth noting that the Humble Bundle periodically has a Programming E-books bundle available for low pricing. These bundles often include books from reputable publishers and authors such as O'Reilly or Wiley. These bundles can include e-books for every programming language, or they can be language-specific.

Good Bash or shell scripting books include:

Small resource for learning Awk:

Web Development

Disclaimer: I'm not very well-versed in web development, and only have experience in using R Shiny. Approach this section with a grain of salt. Web development is split into front-end and back-end sides. The front-end is what is known as the layout and styling, and what the user sees in their browser. The back-end is focused on the computation behind the website. For front-end development, HTML and CSS are the most popular languages. HTML gives basic functionality, and provides structure and layout to the website. It is the backbone to the website. CSS, however, performs the styling for the website. Another popular tool is JavaScript, which extends the functionality enormously. For back-end development, more general purpose programming languages can be used (R, Python, Java, etc). SQL Databases are often used as well.

Currently, several methods exist to reduce the need for front-end development knowledge by using back-end languages to develop the front-end. These include:

  • R Shiny - This R package will allow the usage of R code to develop HTML+CSS+JavaScript elements, while the R code handles the back-end computations.
  • Dash - This Python package is similar to Shiny, and enables users to build a dashboard app without too much HTML+CSS+JS knowledge. Thanks to heyyyjude for adding.
  • Streamlit - Another Python package to build a dashboard app (similar to Shiny). Thanks to heyyyjude for adding.
  • Django - This Python package is similar to Shiny, although it requires more HTML+CSS+JavaScript knowledge. It is, however, more scalable.
  • Flask - This Python package, although very similar to Django, is more lightweight and easy to use. It is, however, not as scalable.

Web Dev IDE

When developing websites using HTML and CSS, code is written inside a text editor or Integrated Development Environment (IDE).

  • Atom
  • VSCode
  • For R Shiny, code can be written in RStudio (see above). Similar for Python, Python webapps can be developed in Python IDE's.

Web Dev Courses

Web Dev Books

SQL

SQL is a language for handling tabular databases.

The Whodunnit game provides a fun interactive way to learn SQL.

Other resources:

Machine Learning

Machine Learning in R

Machine Learning in Python

General Computational Skills

Aside from just learning how to program, there are a couple of general (often also called DevOps) tools that make the life of a programmer easier. These tools often focus on reproducibility of results, and teach/improve best practices in coding. For example, these tools range from workflow languages that automate input/output across different programming languages, to easy-to-use package management systems.

While this repo will introduce you to several tools and concepts, for further reading I encourage you to take a look at the following resources:

Aside from these resources, there is actually a highly recommendable course that includes some of these computational skills:

A recent course by Dirk Eddelbuettel of the Department of Statistics, University of Illinois provides an R-focused dive into some computational skills:

Project-based workflows

Every coding project will of course be saved in a directory, and often divided into multiple subdirectories. However, unless you structure it properly, it can get quite messy. Luckily, there are several recommendations for optimal project structure. It should be noted, however, that you are of course free to adapt and improvise as you see fit. This is often in your best interest, as this allows you to easily find back important files.

Recommendations

Templates

Reproducibility and automated workflows

In addition to working in a project-based workflow, a reproducible workflow language could be of benefit. This means creating reproducible and automated pipelines with languages such as Make, Snakemake, Nextflow, etc.

Environment and package management systems

  • conda. This is a 'package and environment management system', which basicly means it lets you create 'virtual environments'. This lets you easily switch between versions of installed packages or versions of Python/R/any other supported languagge. In addition, it allows the management of version conflicts (program A might depend on a version of program B, but program C might depend on a different version of program B), by creating two different environments (one where program A and B are installed, one where program B and C are installed). The conda website itself hosts a short tutorial on how to get started with conda. Conda comes in two versions:
    • Miniconda. This is a minimal install, including only the barebones of conda (conda, Python and dependents).
    • Anaconda. Anaconda includes conda, Python, several open-source scientific packages that are included in the Anaconda Repository and also the Anaconda Navigator (a Graphical User Interface), enabling you to interact with conda without a command line interface.
  • renv. Unlike conda, this is not a system-wide package and environment manager, and also does not include any Python or R distribution. No, renv is specifically focused on being a package dependency manager for the R programming language. Similar to conda, it will create virtual environments, however, it does this only for R on a project-linked manner.

Version control

Have you ever dealt with the problem of updating features in a programming script, and having to save it as a new file called script_v2.py, only to end up having version 9001 and not knowing what was changed in each version? Or have you ever broken a function in a script, and now you can not figure out how to restore it? That's where version control systems come in. In brief, you can ask a version control system to takes snapshots of a repository (a project folder you want to track for updates). This will get saved, and you are able to restore and move between versions of a file easily. Think of this as the Microsoft Office or Google Docs version control, but on steroids. For a better explanation, I refer the reader to Atlassian.

Although there are multiple version control systems, the most popular one is definitely Git. This should not be confused with GitHub (on which this repository is hosted). Git is the version control system, and GitHub is basicly just an online hoster of Git repositories. Think of GitHub as OneDrive or Google Drive, but for Git repositories.

Some resources on learning how to use Git are:

Containerization

Briefly, containerization tools allow you to package (and run) a script (or software tool) in a container (duh). This container is an isolated sandbox that runs on your host (local) operating system (OS). The container will not only contain the application itself, but also all of its dependencies, including a (stripped-down) operating system. There are several benefits to this. Firstly, this enables you to run software without installing it or the dependencies. Next, it also allows you to transfer applications across operating systems, and makes your script or application reproducible on other machines than your own.

The most popular containerization tool is Docker.

Some resources on learning how to use Docker are:

Cloud Computing for Bioinformatics

This section was added per suggestion by Lynn Langit.

Bioinformatics Tools and Resources

While it would be an impossible(!) task to introduce all Bioinformatics tools in this section, I will refer the reader to some tutorials for common Bioinformatics applications.

Resources:

Statistics

Of course, statistics are a neccessary tool for any researcher, bioinformatician or not. This section provides some recommendations for statistical learning.

Statistics Courses

There are numerous sources to learn from, both free and paid. In this section, I will recommend several tutorials or MOOC's (Massive Open Online Courses).

Statistics Books

In this section, several E-books (or hardcover, depending on your liking) will be listed that could be useful references for learning Statistics.

Useful statistics packages

In this subsection, I will provide a few programming packages I find useful for Statistics.

  • Python
  • R
    • dabestr. Package to create estimation plots.