Recommended learning ressources

We provide here a first non-exhaustive list of frequently recommended learning ressources and readings in data science and machine-learning for the Rausell lab's newcomers. We will be populating it in the next months with regular updates. You may follow us in Twitter for further news: https://twitter.com/AntonioRausell

================================

ML Programming

Python

How to Think Like a ComputerScientist: Learning with Python 3 https://buildmedia.readthedocs.org/media/pdf/howtothink/latest/howtothink.pdf

Scikit-Learn: Machine-learning in Python https://scikit-learn.org/stable/index.html

Pytorch: Python-based scientific computing package https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py

Tensorflow https://github.com/tensorflow/tensorflow

ggplot: Grammar of Graphics in Python with Plotnine https://towardsdatascience.com/ggplot-grammar-of-graphics-in-python-with-plotnine-2e97edd4dacf

TensorSensor: library to visualize Python code indicating the shape of tensor variables (works with Tensorflow, PyTorch, Numpy, Keras and fastai) https://explained.ai/tensor-sensor/index.html

Spark

O’Reilly’s new Learning Spark, 2nd Edition https://databricks.com/p/ebook/learning-spark-from-oreilly

MLlib: Apache Spark's scalable machine learning library. https://spark.apache.org/docs/latest/ml-guide.html

Databricks https://docs.databricks.com/?_ga=2.112366788.1999755491.1598871417-1771114318.1597218746

Hail: Python-based data analysis tool with additional data types and methods for working with genomic data https://github.com/hail-is/hail

VariantSpark: https://doi.org/10.1093/gigascience/giaa077 https://github.com/aehrc/VariantSpark

Glow: an open-source toolkit natively built on Apache Spark for working with genomic data at biobank-scale. https://glow.readthedocs.io/en/latest/

Hadoop

Hadoop: The Definitive Guide, 4th Edition https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/

Jupyter notebooks

https://jupyter.readthedocs.io/en/latest/content-quickstart.html

R

R for Statistical Learning - David Lapiaz https://daviddalpiaz.github.io/r4sl/

Data Analysis and Prediction Algorithms with R - Rafael A. Irizarry https://rafalab.github.io/dsbook/index.html

R interface to tensorflow https://tensorflow.rstudio.com/

==============================

Managing conda environments

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html

Installing Anaconda and managing environments in a server for multiple users: https://medium.com/@pjptech/installing-anaconda-for-multiple-users-650b2a6666c6

==============================

Database management systems for Big Data:

MongoDB

MongoDB 4.4 Manual https://docs.mongodb.com/manual/

Graphing data from MongoDB: how to connect to a MongoDB instance from a Jupyter notebook and make plots with python. https://towardsdatascience.com/graphing-data-from-mongodb-99c3722650da

MongoDB - Spark connector https://www.mongodb.com/products/spark-connector https://github.com/mongodb/mongo-spark

Hbase

Apache HBase for Hadoop and HDFS: http://hbase.apache.org/book.html

Spark SQL

https://spark.apache.org/docs/latest/sql-programming-guide.html

PostgreSQL

Prototyping of SQL queries: db-fiddle: https://www.db-fiddle.com/

pgAdmin developper tools: https://www.pgadmin.org/docs/pgadmin4/development/developer_tools.html

==============================

Open-source libraries for graph neural networks:

Graph Nets: DeepMind's library for building graph networks in Tensorflow and Sonnet. https://github.com/deepmind/graph_nets

PyTorch Geometric: geometric deep learning extension library for PyTorch. https://github.com/rusty1s/pytorch_geometric

Deep Graph Library (DGL): Python package for deep learning on graphs https://github.com/dmlc/dgl

================================

Machine learning on graphs

Representation Learning on Graphs: Methods and Applications https://arxiv.org/abs/1709.05584

Graph representation learning. William L. Hamilton https://www.cs.mcgill.ca/~wlh/grl_book/files/GRL_Book.pdf

Thomas Kipf's PhD thesis "Deep Learning with Graph-Structured Representations" https://hdl.handle.net/11245.1/1b63b965-24c4-4bcd-aabb-b849056fa76d

Machine Learning for Graphs and Sequential Data (Prof. Günnemann, TUM) https://www.in.tum.de/daml/teaching/mlgs/

The graph neural network model: https://persagen.com/files/misc/scarselli2009graph.pdf

Modeling Relational Data with Graph Convolutional Networks https://arxiv.org/pdf/1703.06103.pdf

Semi-Supervised Classification with Graph Convolutional Networks https://arxiv.org/abs/1609.02907

AAAI Tutorial Forum. 2019. Tutorial on Graph Representation Learning. William L. Hamilton and Jian Tang https://cs.mcgill.ca/~wlh/files/AAAI19_GRLTutorial.zip

AAAI 2020 Tutorial: Graph Neural Networks: Models and Applications http://cse.msu.edu/~mayao4/tutorials/aaai2020/

A Tutorial of Graph Representation https://link.springer.com/chapter/10.1007%2F978-3-030-24274-9_33

Directional Graph Networks https://arxiv.org/abs/2010.02863

Graph representation learning in biomedicine and healthcare https://pubmed.ncbi.nlm.nih.gov/36316368/

Machine Learning on Graphs: A Model and Comprehensive Taxonomy https://arxiv.org/abs/2005.03675

================================

Basic ML and deep learning concepts

Basics of Deep Learning, course by Marc Lelarge: https://mlelarge.github.io/dataflowr-web/plutonai.html

Lectures for UC Berkeley CS 182: Deep Learning. https://www.youtube.com/playlist?list=PL_iWQOsE6TfVmKkQHucjPAoRtIJYt8a5A

Calculating Gradient Descent Manually https://towardsdatascience.com/calculating-gradient-descent-manually-6d9bee09aa0b

================================

MOOCs in basic bioinformatics concepts:

MOOC Bioinformatique pour la Génétique Médicale (en français): https://www.fun-mooc.fr/courses/course-v1:USPC+37028+session01/about

Bioinformatics: Genomes and Algorithms: https://www.fun-mooc.fr/courses/course-v1:inria+41007+archiveouvert/about

================================

Single-cell data analysis: tutorials and online courses

Current best practices in single-cell RNA-seq analysis: a tutorial: https://www.embopress.org/doi/10.15252/msb.20188746 https://github.com/theislab/single-cell-tutorial

Orchestrating Single-Cell Analysis with Bioconductor http://osca.bioconductor.org

Complete course on Single-cell RNA-seq data analysis, Univ Cambridge (2018) http://hemberg-lab.github.io/scRNA.seq.course/index.html

Bioinformatics Training channel on YouTube http://goo.gl/uaG8ce

Roscoff single-cell transcriptomics & epigenomics workshop 2019 (slides & scripts) from our french working group on single-cell data analysis: http://goo.gl/m1q1Rs

A step-by-step workflow for low-level analysis of single-cell RNA-seq data https://f1000research.com/articles/5-2122/v2

“All-in-one” environments (I): Seurat R toolkit for single-cell genomics https://satijalab.org/seurat/get_started.html

“All-in-one” environments (II): SCANPY: Scanpy – Single-Cell Analysis in Python https://scanpy.readthedocs.io/en/latest/index.html

Single-Cell Workshop 2014: RNA-seq, Harvard http://pklab.med.harvard.edu/scw2014/