Training Resources

Onboarding Steps

There are a few steps for incoming lab members, If you are a full time lab member be sure to run through this checklist
If you are a part-time member and/or student follow this checklist

Crash course into the lab

Subscribe to the lab calendar. This is where all meetings and events are organized. To do so, select this link then, subscribe by either:
- Selecting the + Google Calendar button at the bottom of the Im Lab Calendar, which will take you to your own Google Calendar and ask if you would like to add it. Or,
- Log into your Google Calendar. On the left side, select Add Calendar, and then From URL. Copy and paste the URL from the Im Lab Calendar.
Go through the RStudio primers 1 to 6 (if they are too basic, skip all except for the reproducibility section)
- After finishing each of the following tutorials, fill out this form.
- The Basics
- Work with Data
- Visualize Data
- Tidy Your Data
- Iterate
- Write Functions
- Report Reproducibility
Github intro click here
- When going through the tutorial skip the setting up ssh section
- Fill out this form
TODO: post your first note following the instructions here
TODO: run your first GWAS, QC included, following these instructions
TODO: run imputed transcriptome association, colocalization, and Mendelian Randomization following this lab
- Begin in the optional items section and first set up your system for the lab
- If working on a lab destop you may need to update/install miniconda -install from the bash with the .sh file and where the code in the lab calls for conda you will enter the file path ./miniconda3/bin/conda
TODO: read and write a short post for the in the internal-notes.hakyimlab.org with a graphical summary of the following papers
- A brief history of human disease genetics link
- PrediXcan paper link
- GTEx GWAS paper link
- S-PrediXcan link

Training Resources

We work with many different tools on many different projects. The training resources are organized into functional groups below. You may want to skip reading the material in some groups, and it may be worthwhile to spend a longer time with other groups.

GitHub
Introduction to Data Science
Blogdown
Genomics
- Introduction to Genomics
- Lab-Specific Genomics Papers
Computational Resources
Miscellaneous
Hands on training

GitHub

We use GitHub to store and organize our code. There is a introduction here. If you are curious about when one would use certain GitHub features, look at this link which describes 'GitHub flow'.

The lab's main GitHub page can be found at https://github.com/hakyimlab. If you have been added to lab-members and you are logged in, you can see the lab's private repositories as well.

GitHub has stoped useing passwords in the terminal and Rstudio be sure to set up your token. Instructions on how to do so here

Introduction to Data Science

Machine Learning and Statistics

An introduction to machine learning problems and model metrics: link
We work fairly heavily with the generalized linear model, so it may be good to brush up on it:
- Generalized Linear Models
- Wikipedia

Python

This is a python course for data science, and covers running commands in the shell link
SQLite in Python link

R

Introduction to Data Analysis with R link
Another data science course in R: link
R Studio's cheatsheets: link
Hadley's R Style link
R tools for reporting data analyses in a reproducible manner link

R Packages

Some basics on tidyverse and ggplot2
This course introduces ggplot2, plyr, dplyr, tidyr, and knitr for data analysis link
Our lab does a lot of work with SQLite databases using the RSQLite package
- An intro: link
- Applying dplyr: link
Data Manipulation in R with dplyr link
Data Visualization in R with ggplot2 link1, link2

A machine learning package for R, mlr link
Docker is not really an R package, but this presentation gives a good overview of use cases for Docker, and how to integrate with R link

R Cheatsheets

Data Wrangling download pdf
R Markdown download pdf
Data visualization download pdf

Unix

CRI Gardner, RCC midway, and most of the Bionimbus virtual machines all run on Linux, so we use the command line a lot.

If you haven't used a bash command line before, here is a good place to start: link
This lesson covers more commandslink
This is a great cheatsheet for using the command line and shell scripting, including flow control and function declaration: link
Knowledge of some bash commands can go a long way. Comfort with grep, awk, sed, and xargs might go a long way.

Sqlite

Some knowledge of sqlite will be useful. See vignette here

Databases in R

On how to use databases in R here
Using dplyr to query dbs here

Genomics

Introduction to Genomics

UCLA Big Bio: intro to genomics videos. These are very helpful to understand the field of genomics at a high level.
The New Genetics is an NIH publication surveying what we know about the biological mechanisms of genetics.

Lab-Specific Genomics Papers

For more background, the projects the lab is currently working on are similar to the ones in these papers.

Computational Resources

CRI Gardner

Gardner is a large, high-performance computing cluster and data storage system. We use it to run computations and store data. The lab's group folder is located at /gpfs/data/im-lab/

UChicago CRI Workshop Tutorials: CRI does a seminar series each academic year. You can find the schedule here: link
Intro to Gardner: this is a good explanation of what Gardner does, and why a high-performance computing cluster is important to bioinformatics: link

Job submission and management

Gardner uses Torque as its job scheduler, which means that the submission types are PBS files.
A short, incomplete list of commands that may help when using PBS:
- To submit a job, qsub <path to whatever job file>. It will print to the console the job_id, which is often useful for searching the queue and finding logs.
- To view the status of your jobs, qstat
- To delete a job, qdel <job_id>
- Gardner has a few different queues to which you can submit jobs. Knowing the resources alotted to jobs in each queue can help. Jobs will be submitted faster if you request fewer resources. You can use qstat -q to list all queues with current usage statistics, and you can use qstat -Qf <queue name> for details on the resources.
- qstat | grep Q will list only queued jobs, and if you're submitting a bunch of them, qstat | grep Q | wc -l will count the jobs in the queue.
- Hopefully this doesn't happen, but if you need to cancel all of your queued jobs, run qselect -s Q | xargs qdel.
If you need to run a long file submission, like a python script that submits jobs for hours, you don't have to keep a terminal window open to continue the process if you use screen. Here are the steps I used:
```
$ ssh gardner
[cri-gardner-in001] $ screen
[cri-gardner-in001] $ <the command you wanted to run>
```
- The important thing is to exit the screen by ctrl+a d. Then you should see a message [detached]

Mounting Gardner File System

On MacOS: from Finder, click 'Go', then 'connect to server', then connect to smb://prfs.cri.uchicago.edu/im-lab
On Linux: mounting via sftp://cri-syncmon.cri.uchicago.edu/gpfs/data/im-lab has worked for us in the past.

Bionimbus PDC

Bionimbus Protected Data Cloud is a storage/computation resource where the lab is alotted a certain amount of processors and storage, and we store and compute on virtual machines. If you'll be working on Bioinimbus, make sure to begin your application(s) quickly because the process has multiple steps.

Documentation

Easy ssh Access

For both Gardner and Bionimbus, you'll be working through ssh tunnels a good deal, so it will pay off to configure your ssh settings once and not have to fill in passwords all the time.

First, to avoid having to enter a password at each login, generate and forward an ssh keypair. To create a RSA keypair, open terminal and type

$ ssh-keygen -t rsa

Press enter when you are prompted to Enter a file in which to save the key Type and enter a password

Your private key will be generated using the default filename (for example, id_rsa) or the filename you specified, and stored on your computer in a .ssh directory off your home directory (for example, ~/.ssh/id_rsa ).

Your public key will be generated using the same filename (but with a .pub extension added) and stored in the same location (for example, ~/.ssh/id_rsa.pub). Do not share your private key. Only share your public one.

Once you have your RSA keypair, you will copy and paste your public key into ~/.ssh/authorized_keys on the host you are trying to access.

If your account doesn't already contain a ~/.ssh/authorized_keys file, create one

mkdir -p ~/.ssh
touch ~/.ssh/authorized_keys

Copy and paste your public id (for example, ~/id_rsa.pub), using

cat ~/id_rsa.pub >> ~/.ssh/authorized_keys

Create and configure your SSH config file

touch ~/.ssh/config
chmod 600 ~/.ssh/config
emacs ~/.ssh/config

Enter the following

Host gardner
 HostName gardner.cri.uchicago.edu
 IdentityFile ~/.ssh/username
 User yourusername
Host midway2
 HostName midway2.rcc.uchicago.edu
 IdentityFile ~/.ssh/username
 User yourusername
Host bionimbus
 HostName bionimbus-pdc.opensciencedatacloud.org
 IdentityFile ~/.ssh/username
 User yourusername
Host argonne
 Hostname login.mcs.anl.gov
 User yourusername
 IdentityFile ~/.ssh/username
Host washington
 HostName washington.cels.anl.gov
 User yourusername
 IdentityFile ~/.ssh/username
 ProxyCommand ssh -q -A argonne -W %h:%p

Now you should be able to directly ssh into any of the above hosts.

If you want to be able to log in with your rsa key pair instead of password, you need to add your public key to the authorized_keys file in the remote host. For example, if you want to log in directly to gardner, go to

cd ~/.ssh
vi authorized_keys

and paste in your public key.

BigQuery

BigQuery tutorials link
Google Cloud training document link
To do uploads from CRI to BigQuery, you will need to install the Google Cloud SDK link

Miscellaneous

This is another great collection of tools / intros for genomics and computational biology. It's like this training page, but has even more resources.
Read genomic data user code of conduct
Reproducible Research link
Get CITI training link
- Basics of Health Privacy
- Responsible Conduct of Research (RCR) Basic
- Human Subjects Research – Biomedical
- Basics of Information Security
- Conflict of Interest
Enloc-coloc comparison
Jeff Leek's Github page

Cloning issues

install git (brew install git) or
upgrade git (brew upgrade git)
install git-lfs (brew install git-lfs)

Heather Wheeler's tutorials

https://www.notion.so/Heather-Wheeler-s-tutorials-f2e3a612d3d040a08db1becc139449b4

yangchuhua/Training_Bioinformatics

Training Resources

Onboarding Steps

Crash course into the lab

Training Resources

GitHub

Introduction to Data Science

Machine Learning and Statistics

Python

R

R Packages

R Cheatsheets

Unix

Sqlite

Databases in R

Genomics

Introduction to Genomics

Lab-Specific Genomics Papers

Other Useful Reading

Computational Resources

CRI Gardner

Job submission and management

Mounting Gardner File System

Bionimbus PDC

Easy ssh Access

BigQuery

Miscellaneous

Cloning issues

Heather Wheeler's tutorials