MSc_TT2020

Welcome!

Here is a constantly updating set of resources that I have found useful and think will help to prime you with working with 'high dimensional data'. I will keep updating the table as you go along, but please add to the list and put your own notes in if you recommend something you have found. This field is constantly developing, and the techniques described are used in a spectrum of disciplines, so you may find answers to your questions on websites not immediately obvious to you!

Please sign up for an Amazon Web Services account using your educational email address, to access their 'free tier', where you can experiment. Please then join 'AWS Educate', so you can obtain free credits for your account, to use more powerful hardware. On AWS, always use 'Europe - London (EU-West-2)'. Set up a t2.micro (free-tier) 'EC2' instance, with 30GB (free-tier) of 'EBS' storage. If this is confusing, I agree! Google helps a lot.
Please sign up for a GitHub account.
Note: Do not install R version 4.0.0! Stay on 3.6.1. Do a quick google to find out why, and we can discuss it if you are interested!

Resources

Topic	Link(s)	Notes
Next-generation sequencing	StatQuest Intro to RNA-Seq	NGS is getting cheaper every day, and the volume of data being output is huge. It is vital to understand the principles of NGS and to understand the basics of how an 'Illumina Sequencer' works. Find a video/article that works for you and we will talk about it
Learn the command line	CodeAcademy	A large amount of pre-processing has to be done to our raw data, before we visualise and analyse it in software such as R or Python. The tools we use to pre-process raw data are designed to be run 'from the command line', specifically, the 'Bash' command line. Codeacademy provides a great primer in how to use some of its functions, which will be essential for when you need to handle the raw data files
Fast-what?	Fasta/Fastq/SAM	An example of a great StackOverflow answer discussing advice about different file formats you will encounter in genetic data
PCA and many other guides	PCA StatQuest	StatQuest often provides a good basic introduction to a many statistical concepts. PCA is integral to our work, as we will see later on. I recommend watching other StatQuest videos, and then exploring other videos/online resources if you are interested
Learn Base R	Codeacademy	This site provides an environment to learn the very basics of R in a short space of time.
Learn R the 'tidy' way	R4DS	Follow this guide's instructions on installing RStudio, the tidyverse package, and use the example datasets it provides to test out some of R's analysis and plotting features. We will use these features with our own data, once we have pre-processed it
StackExchange/Overflow	StackOverflow	I highly encourage getting in the habit of google errors you encounter, as you will find the answer to these errors on sites such as StackOverflow 99% of the time. Googling problems and finding solutions is a valid learning method!
Biostars	Biostars	A forum providing discussion topics on many aspects of genetic data analysis.
Bioconductor	Bioconductor	Data analysis packages for R are often submitted to 'Bioconductor'; a repository of packages that are required to be maintained and documented to a certain standard, to ensure the public can use the package properly. An example package is 'DESeq2', below. While you are on Bioconductor, check out the list of top packages to see what packages people are using, and think why you might want to use them. They are going to help you!
Differential Expression Analysis	DESeq2 Vignette	DESeq2 is the often the best R package for differential gene expression analysis. This link will take you to the user guide, which is referred to as a 'vignette' for Bioconductor packages. Start to notice the formats of data that DESeq2 accepts, the importance of how DESeq2 normalises samples, as is plainly explained in the StatQuest video below.
DESeq2 Analysis Workflow	Vignette	The title author of DESeq2, Mike Love, also maintains a workflow vignette that you may find easier to interpret in the early stages. This workflow uses an example RNA-seq dataset; 'airway'. This dataset can be installed in R, as a package (google this!). Notice how in the workflow, he refers to the PubMed accession and GEO accession for 'metadata' and 'raw data': These are important sources of information about the experimental details.
Library normalisation	StatQuest	The first video of a series on normalisation. This is one of many resources you will find on the internet about the importance of normalisation in data analysis. You may find that watching the FPKM/TPM video alongside helps.
Visualising data with Shiny	Shiny Gallery	Getting familiar with different ways of plotting data helps you understand your data better and share your findings. R provides 'Shiny', a service whereby your analysis results can be input into a great-looking website that you can share. Check out the gallery and then google some examples of RNA-seq analyses that are in R Shiny format.

Papers

Here are some of the papers from our lab that concern RNA-seq data. Make a note of the methods used to produce and analyse the data, and think of the challenges that may be present in analysing it. We will be using these datasets by first replicating the analysis, then looking to compare independent studies (GBA & LRRK2 - Tara), or investigate splicing (LRRK2 transcriptome-wide - Eugenio, LRRK2 specifically - Guusje

Analysis package papers/examples of interesting analysis methods

RNA-Seq/Statistics papers

Packages/commands often used

This is growing list of packages/commands that I use very often

- ls, cd, mv, cp, rm, htop, ssh, screen, cut, sort, uniq, >, |, for i in *, parallel, echo
- BioMaRt
- Tidyverse (ggplot2, dplyr, tidyr...)
- prcomp
- kallisto
- tximport
- deseq2
- fastqc
- STAR
- multiqc
- trim_galore
- Cluster Window Manager for Google Chrome: A Godsend

Week 1 Task list - Updated Wednesday

Wednesday: Guide

Find a paper with bulk RNA sequencing. I recommend this airway paper, as we may be using this paper's data for more practice later on.
Locate the EBI 'Nucleotide Sequencing' Project page for the paper. Here is the page for the above paper
Use 'Select columns' to choose which columns you need. We need the 'fastq ftp' column, and a column that helps us identify which sample is which
Download the 'TSV' (tab-separated values) file and open it in excel.
Use some of the 'fastq ftp' links to 'wget' the fastq files to our AWS instance
Run fastqc on each fastqc file
Install Filezilla, WinSCP or equivalent on your local machine
Add a new 'site': Protocol is SFTP, Host is the 'public DNS' of your AWS instance, Port is 22, User is ubuntu and key file is the '.pem' file you made yesterday.
Connect and download all the 'fastqc' output files
Open the '.html' files in your browser and take a look at the information. Google these quality checks and read the 'Babraham' guidance on them: Example

Thursday: Guide

Access the EC2 instance (ec2-3-11-80-159.eu-west-2.compute.amazonaws.com) (password: trinity)
Install miniconda in your home folder
Install STAR in a conda environment
Download the fastq files from the airway project (perhaps divide up the task of downloading)
Download the Gencode primary assembly FASTA and GTF for human
Generate a genome index using STAR: Follow section 2.1 of the manual
Use a for loop to map your fastq reads to the genome index in STAR: See section 3.1 of the manual

Example for loop:
for file in *1.fastq.gz ; do echo "STAR --numThreadN 4 --genomeDir folder_where_genome_index_is $file ${file%1.fastq.gz}2.fastq.gz" ; done

Friday: Guide

Our objective is complete mapping the airway fastq files to the human genome reference 38 from GENCODE.
Your AWS instance is accessible using: ssh -i "msc_tt2020.pem" ubuntu@ec2-18-132-67-54.eu-west-2.compute.amazonaws.com from the directory in which your .pem file is stored.
One person install miniconda, then notify everyone when it is installed.
Each person creates their own environment, including their initials.
Thursday's work is stored in the /home/ubuntu/thursday folder
Friday's space is /home/ubuntu/friday
To see how much space is free, type df -h: You will see that the thursday folder is currently almost full.
To see how much space each folder in a given directory takes up, type du -chd 1
Using these commands, try to decide whether you can optimise the amount of space you are using in the thursday folder (you could do this while waiting for STAR to generate an index in the friday folder)
One way to save space is to avoid having duplicates of large files. You could agree upon a single folder to store the fastq files, reference genome files, and STAR index, and then create symbolic links to your own working folders (or just reference this single shared folder in each of your commands).

Monday 11th May: Guide

Today's AWS address: ec2-18-130-119-110.eu-west-2.compute.amazonaws.com
Samtools tutorial
Airway paper, see RNA seq methods section
Picard, CollectRNASeqMetrics: *Note, picard can be fussy about how the command is written!
RSeqQC
MultiQC: Check out the modules section for information on what it needs to compile information from each tool
featurecounts

Sync your personal AWS S3 folder to a personal folder within the Monday folder on the instance.
Set up miniconda + an environment with samtools, picard, rseqc and subread.
Convert your SAM files to BAM files, sort them, index them.
Run Picard CollectRNASeqMetrics on each bam file
Run any RSeqQC modules you see as informative (e.g. junction saturation, inner_distance, read_distribution, read_duplication...)
Compile QC reports from fastqc, STAR, Picard and RSeqQC into a multiQC report (see multiQC documentation)

Wednesday 13th May: Guide

Merge/Join Featurecounts counts tables.
Convert sample names to just SRR accession number
Create sample metadata/coldata table with relevant biological and technical groups labelled
Import count data and metadata into DESeq2 in R
Run a treatment group comparison in DESeq2, by expressing a 'design' (e.g. ~ treatment_group)
Produce the DESeq2 results table and filter for genes below an adjust P value of 0.01 and with a greater log2Foldchange than 1, both up and down
Plot a gene's expression between treatment groups
Export a normalised counts table from DESeq2
Use a normalised counts table to run PCA with prcomp
Plot a PCA biplot and label the samples according to their biological or techincal groups.

Monday 18th May: Info

Google drive link to RWM data
Go to 'experiments' folder and download the fastq files for gba_bulk and lrrk2
Apply Fastqc/multiqc approach
Acquaint yourself with trim_galore
Acquaint yourself with kallisto
For kallisto, you will need a transcriptome reference, not a primary genome reference. Beware of this when downloading a reference Fasta from Gencode/Ensembl.
Kallisto does not produce a log file, but outputs useful information to the terminal. MultiQC can use this output, so pipe the output to a file
You will need for loops for some of this
Guusje: Read the Snaptron paper and Snaptron User Guide.
We should be able to query LRRK2 splicing using the snaptron client to their web service. We then want all the metadata possible for this.

Adding directories to the PATH variable

If we remember, the PATH variable is where our shell looks for applications. We can see what it is currently equal to by running echo $PATH:

/Users/peterkilfeather/miniconda3/bin:/Users/peterkilfeather/miniconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

If we want to add a folder containing an application, or "binary" to our PATH, we have to add it to the PATH. You can do this with a command, but often the simplest way is to edit the file where the PATH variable is set. In linux, this is ~/.bashrc, on mac, this is ~/.bash_profile. In that file, you will see the PATH variable.
Add a line at the bottom: export PATH="/path/to/new/binaries:$PATH"
Save and load the new bash: source ~/.bashrc
Test it: echo $PATH

legbar/MSc_TT2020