/Training_Bioinformatics

A full skills list requirements for a fresh bioinformatics student

Primary LanguageTeXMIT LicenseMIT

Training Resources

Onboarding Steps

  • There are a few steps for incoming lab members, If you are a full time lab member be sure to run through this checklist
  • If you are a part-time member and/or student follow this checklist

Crash course into the lab

  • Subscribe to the lab calendar. This is where all meetings and events are organized. To do so, select this link then, subscribe by either:

    • Selecting the + Google Calendar button at the bottom of the Im Lab Calendar, which will take you to your own Google Calendar and ask if you would like to add it. Or,
    • Log into your Google Calendar. On the left side, select Add Calendar, and then From URL. Copy and paste the URL from the Im Lab Calendar.
  • Go through the RStudio primers 1 to 6 (if they are too basic, skip all except for the reproducibility section)

  • Github intro click here

    • When going through the tutorial skip the setting up ssh section
    • Fill out this form
  • TODO: post your first note following the instructions here

  • TODO: run your first GWAS, QC included, following these instructions

  • TODO: run imputed transcriptome association, colocalization, and Mendelian Randomization following this lab

    • Begin in the optional items section and first set up your system for the lab
    • If working on a lab destop you may need to update/install miniconda -install from the bash with the .sh file and where the code in the lab calls for conda you will enter the file path ./miniconda3/bin/conda
  • TODO: read and write a short post for the in the internal-notes.hakyimlab.org with a graphical summary of the following papers

    • A brief history of human disease genetics link
    • PrediXcan paper link
    • GTEx GWAS paper link
    • S-PrediXcan link

Training Resources

We work with many different tools on many different projects. The training resources are organized into functional groups below. You may want to skip reading the material in some groups, and it may be worthwhile to spend a longer time with other groups.

GitHub

We use GitHub to store and organize our code. There is a introduction here. If you are curious about when one would use certain GitHub features, look at this link which describes 'GitHub flow'.

The lab's main GitHub page can be found at https://github.com/hakyimlab. If you have been added to lab-members and you are logged in, you can see the lab's private repositories as well.

GitHub has stoped useing passwords in the terminal and Rstudio be sure to set up your token. Instructions on how to do so here

Introduction to Data Science

Machine Learning and Statistics

  • An introduction to machine learning problems and model metrics: link
  • We work fairly heavily with the generalized linear model, so it may be good to brush up on it:

Python

  • This is a python course for data science, and covers running commands in the shell link
  • SQLite in Python link

R

  • Introduction to Data Analysis with R link
  • Another data science course in R: link
  • R Studio's cheatsheets: link
  • Hadley's R Style link
  • R tools for reporting data analyses in a reproducible manner link
R Packages
  • Some basics on tidyverse and ggplot2
  • This course introduces ggplot2, plyr, dplyr, tidyr, and knitr for data analysis link
  • Our lab does a lot of work with SQLite databases using the RSQLite package
  • Data Manipulation in R with dplyr link
  • Data Visualization in R with ggplot2 link1, link2
  • A machine learning package for R, mlr link
  • Docker is not really an R package, but this presentation gives a good overview of use cases for Docker, and how to integrate with R link
R Cheatsheets

Unix

CRI Gardner, RCC midway, and most of the Bionimbus virtual machines all run on Linux, so we use the command line a lot.

  • If you haven't used a bash command line before, here is a good place to start: link
  • This lesson covers more commandslink
  • This is a great cheatsheet for using the command line and shell scripting, including flow control and function declaration: link
  • Knowledge of some bash commands can go a long way. Comfort with grep, awk, sed, and xargs might go a long way.

Sqlite

Some knowledge of sqlite will be useful. See vignette here

Databases in R

  • On how to use databases in R here
  • Using dplyr to query dbs here

Genomics

Introduction to Genomics

  • UCLA Big Bio: intro to genomics videos. These are very helpful to understand the field of genomics at a high level.
  • The New Genetics is an NIH publication surveying what we know about the biological mechanisms of genetics.

Lab-Specific Genomics Papers

For more background, the projects the lab is currently working on are similar to the ones in these papers.

Other Useful Reading

GTEx Consortium: The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 2015, 348:648–660.

The 1000 Genomes Consortium: A global reference for human genetic variation link

Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–30.

Li YI, van de Geijn B, Raj A, Knowles DA, Petti AA, Golan D, et al. RNA splicing is a primary link between genetic variation and disease. Science. American Association for the Advancement of Science; 2016;352:600–4.

Albert FW, Kruglyak L: The role of regulatory variation in complex traits and disease. Nat Rev Genet 2015, 16:197–212.

Das S, Abecasis GR, Browning BL: Genotype Imputation from Large Reference Panels. Annu Rev Genomics Hum Genet 2018;19:73-96.

Im HK, Gamazon ER, Nicolae DL, Cox NJ: On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. Am J Hum Genet 2012, 90:591–598.

Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh P.-R., et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics 2015, 47:1228-1235.

Finucane HK, Reshef YA, Anttila V, Slowikowski K, Gusev A, Byrnes A, et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics 2018, 50:621-629.

Visscher PM: Human Complex Trait Genetics in the 21st Century. Genetics 2016, 202:377–379.

Computational Resources

CRI Gardner

Gardner is a large, high-performance computing cluster and data storage system. We use it to run computations and store data. The lab's group folder is located at /gpfs/data/im-lab/

  • UChicago CRI Workshop Tutorials: CRI does a seminar series each academic year. You can find the schedule here: link
  • Intro to Gardner: this is a good explanation of what Gardner does, and why a high-performance computing cluster is important to bioinformatics: link
Job submission and management
  • Gardner uses Torque as its job scheduler, which means that the submission types are PBS files.
  • A short, incomplete list of commands that may help when using PBS:
    • To submit a job, qsub <path to whatever job file>. It will print to the console the job_id, which is often useful for searching the queue and finding logs.
    • To view the status of your jobs, qstat
    • To delete a job, qdel <job_id>
    • Gardner has a few different queues to which you can submit jobs. Knowing the resources alotted to jobs in each queue can help. Jobs will be submitted faster if you request fewer resources. You can use qstat -q to list all queues with current usage statistics, and you can use qstat -Qf <queue name> for details on the resources.
    • qstat | grep Q will list only queued jobs, and if you're submitting a bunch of them, qstat | grep Q | wc -l will count the jobs in the queue.
    • Hopefully this doesn't happen, but if you need to cancel all of your queued jobs, run qselect -s Q | xargs qdel.
  • If you need to run a long file submission, like a python script that submits jobs for hours, you don't have to keep a terminal window open to continue the process if you use screen. Here are the steps I used:
    $ ssh gardner
    [cri-gardner-in001] $ screen
    [cri-gardner-in001] $ <the command you wanted to run>
    
    • The important thing is to exit the screen by ctrl+a d. Then you should see a message [detached]
Mounting Gardner File System
  • On MacOS: from Finder, click 'Go', then 'connect to server', then connect to smb://prfs.cri.uchicago.edu/im-lab
  • On Linux: mounting via sftp://cri-syncmon.cri.uchicago.edu/gpfs/data/im-lab has worked for us in the past.

Bionimbus PDC

Bionimbus Protected Data Cloud is a storage/computation resource where the lab is alotted a certain amount of processors and storage, and we store and compute on virtual machines. If you'll be working on Bioinimbus, make sure to begin your application(s) quickly because the process has multiple steps.

Easy ssh Access

For both Gardner and Bionimbus, you'll be working through ssh tunnels a good deal, so it will pay off to configure your ssh settings once and not have to fill in passwords all the time.

First, to avoid having to enter a password at each login, generate and forward an ssh keypair. To create a RSA keypair, open terminal and type

$ ssh-keygen -t rsa

Press enter when you are prompted to Enter a file in which to save the key Type and enter a password

Your private key will be generated using the default filename (for example, id_rsa) or the filename you specified, and stored on your computer in a .ssh directory off your home directory (for example, ~/.ssh/id_rsa ).

Your public key will be generated using the same filename (but with a .pub extension added) and stored in the same location (for example, ~/.ssh/id_rsa.pub). Do not share your private key. Only share your public one.

Once you have your RSA keypair, you will copy and paste your public key into ~/.ssh/authorized_keys on the host you are trying to access.

If your account doesn't already contain a ~/.ssh/authorized_keys file, create one

mkdir -p ~/.ssh
touch ~/.ssh/authorized_keys

Copy and paste your public id (for example, ~/id_rsa.pub), using

cat ~/id_rsa.pub >> ~/.ssh/authorized_keys

Create and configure your SSH config file

touch ~/.ssh/config
chmod 600 ~/.ssh/config
emacs ~/.ssh/config

Enter the following

Host gardner
 HostName gardner.cri.uchicago.edu
 IdentityFile ~/.ssh/username
 User yourusername
Host midway2
 HostName midway2.rcc.uchicago.edu
 IdentityFile ~/.ssh/username
 User yourusername
Host bionimbus
 HostName bionimbus-pdc.opensciencedatacloud.org
 IdentityFile ~/.ssh/username
 User yourusername
Host argonne
 Hostname login.mcs.anl.gov
 User yourusername
 IdentityFile ~/.ssh/username
Host washington
 HostName washington.cels.anl.gov
 User yourusername
 IdentityFile ~/.ssh/username
 ProxyCommand ssh -q -A argonne -W %h:%p

Now you should be able to directly ssh into any of the above hosts.

If you want to be able to log in with your rsa key pair instead of password, you need to add your public key to the authorized_keys file in the remote host. For example, if you want to log in directly to gardner, go to

cd ~/.ssh
vi authorized_keys

and paste in your public key.

BigQuery

  • BigQuery tutorials link
  • Google Cloud training document link
  • To do uploads from CRI to BigQuery, you will need to install the Google Cloud SDK link

Miscellaneous

Cloning issues

install git (brew install git) or
upgrade git (brew upgrade git)
install git-lfs (brew install git-lfs)

Heather Wheeler's tutorials

https://www.notion.so/Heather-Wheeler-s-tutorials-f2e3a612d3d040a08db1becc139449b4