Here is a constantly updating set of resources that I have found useful and think will help to prime you with working with 'high dimensional data'. I will keep updating the table as you go along, but please add to the list and put your own notes in if you recommend something you have found. This field is constantly developing, and the techniques described are used in a spectrum of disciplines, so you may find answers to your questions on websites not immediately obvious to you!
-
Please sign up for an Amazon Web Services account using your educational email address, to access their 'free tier', where you can experiment. Please then join 'AWS Educate', so you can obtain free credits for your account, to use more powerful hardware. On AWS, always use 'Europe - London (EU-West-2)'. Set up a t2.micro (free-tier) 'EC2' instance, with 30GB (free-tier) of 'EBS' storage. If this is confusing, I agree! Google helps a lot.
-
Please sign up for a GitHub account.
-
Note: Do not install R version 4.0.0! Stay on 3.6.1. Do a quick google to find out why, and we can discuss it if you are interested!
Topic | Link(s) | Notes |
---|---|---|
Next-generation sequencing | StatQuest Intro to RNA-Seq | NGS is getting cheaper every day, and the volume of data being output is huge. It is vital to understand the principles of NGS and to understand the basics of how an 'Illumina Sequencer' works. Find a video/article that works for you and we will talk about it |
Learn the command line | CodeAcademy | A large amount of pre-processing has to be done to our raw data, before we visualise and analyse it in software such as R or Python. The tools we use to pre-process raw data are designed to be run 'from the command line', specifically, the 'Bash' command line. Codeacademy provides a great primer in how to use some of its functions, which will be essential for when you need to handle the raw data files |
Fast-what? | Fasta/Fastq/SAM | An example of a great StackOverflow answer discussing advice about different file formats you will encounter in genetic data |
PCA and many other guides | PCA StatQuest | StatQuest often provides a good basic introduction to a many statistical concepts. PCA is integral to our work, as we will see later on. I recommend watching other StatQuest videos, and then exploring other videos/online resources if you are interested |
Learn Base R | Codeacademy | This site provides an environment to learn the very basics of R in a short space of time. |
Learn R the 'tidy' way | R4DS | Follow this guide's instructions on installing RStudio, the tidyverse package, and use the example datasets it provides to test out some of R's analysis and plotting features. We will use these features with our own data, once we have pre-processed it |
StackExchange/Overflow | StackOverflow | I highly encourage getting in the habit of google errors you encounter, as you will find the answer to these errors on sites such as StackOverflow 99% of the time. Googling problems and finding solutions is a valid learning method! |
Biostars | Biostars | A forum providing discussion topics on many aspects of genetic data analysis. |
Bioconductor | Bioconductor | Data analysis packages for R are often submitted to 'Bioconductor'; a repository of packages that are required to be maintained and documented to a certain standard, to ensure the public can use the package properly. An example package is 'DESeq2', below. While you are on Bioconductor, check out the list of top packages to see what packages people are using, and think why you might want to use them. They are going to help you! |
Differential Expression Analysis | DESeq2 Vignette | DESeq2 is the often the best R package for differential gene expression analysis. This link will take you to the user guide, which is referred to as a 'vignette' for Bioconductor packages. Start to notice the formats of data that DESeq2 accepts, the importance of how DESeq2 normalises samples, as is plainly explained in the StatQuest video below. |
DESeq2 Analysis Workflow | Vignette | The title author of DESeq2, Mike Love, also maintains a workflow vignette that you may find easier to interpret in the early stages. This workflow uses an example RNA-seq dataset; 'airway'. This dataset can be installed in R, as a package (google this!). Notice how in the workflow, he refers to the PubMed accession and GEO accession for 'metadata' and 'raw data': These are important sources of information about the experimental details. |
Library normalisation | StatQuest | The first video of a series on normalisation. This is one of many resources you will find on the internet about the importance of normalisation in data analysis. You may find that watching the FPKM/TPM video alongside helps. |
Visualising data with Shiny | Shiny Gallery | Getting familiar with different ways of plotting data helps you understand your data better and share your findings. R provides 'Shiny', a service whereby your analysis results can be input into a great-looking website that you can share. Check out the gallery and then google some examples of RNA-seq analyses that are in R Shiny format. |
Here are some of the papers from our lab that concern RNA-seq data. Make a note of the methods used to produce and analyse the data, and think of the challenges that may be present in analysing it. We will be using these datasets by first replicating the analysis, then looking to compare independent studies (GBA & LRRK2 - Tara), or investigate splicing (LRRK2 transcriptome-wide - Eugenio, LRRK2 specifically - Guusje
- Single-Cell Sequencing of iPSC-Dopamine Neurons Reconstructs Disease Progression and Identifies HDAC4 as a Regulator of Parkinson Cell Phenotypes.
- RNA sequencing reveals MMP2 and TGFB1 downregulation in LRRK2 G2019S Parkinson's iPSC-derived astrocytes.
- An integrated transcriptomics and proteomics analysis reveals functional endocytic dysregulation caused by mutations in LRRK2
- Transcriptomic profiling of purified patient-derived dopamine neurons identifies convergent perturbations and therapeutics for Parkinson’s disease.
- Cellular α-synuclein pathology is associated with bioenergetic dysfunction in Parkinson’s iPSC-derived dopamine neurons
Analysis package papers/examples of interesting analysis methods
- Annotation-free quantification of RNA splicing using LeafCutter
- Integrative transcriptome analyses of the aging brain implicate altered splicing in Alzheimer’s disease susceptibility
RNA-Seq/Statistics papers
This is growing list of packages/commands that I use very often
- ls, cd, mv, cp, rm, htop, ssh, screen, cut, sort, uniq, >, |, for i in *, parallel, echo
- BioMaRt
- Tidyverse (ggplot2, dplyr, tidyr...)
- prcomp
- kallisto
- tximport
- deseq2
- fastqc
- STAR
- multiqc
- trim_galore
- Cluster Window Manager for Google Chrome: A Godsend
- Learn command line essentials (cd, ls, mv, cp, rm, wget, echo)
- Set up an AWS t2.micro instance with 30 GB EBS storage and ssh into this instance
- Install conda on your AWS instance
- Install fastqc in a conda environment named 'week_1' on your AWS instance
- Download some fastq files of RNAseq project (GBA bulk, LRRK2 neurons, etc) using wget
- Install 'filezilla'/WinSCP, or any program capable of 'sftp' and download your fastqc report(s) to your computer
- Evaluate the quality of a fastq file from the report generated by fastqc
- Practice using tab autocompletion on the command line
- Practice using the up and down arrows to cycle through command history
- Copy your command 'history' to a GitHub document for reference later
- Start using 'screen' to run programs in the background (Ctrl-a, Ctrl-d, Ctrl-k, screen -S name, screen -r name)
- Find a paper with bulk RNA sequencing. I recommend this airway paper, as we may be using this paper's data for more practice later on.
- Locate the EBI 'Nucleotide Sequencing' Project page for the paper. Here is the page for the above paper
- Use 'Select columns' to choose which columns you need. We need the 'fastq ftp' column, and a column that helps us identify which sample is which
- Download the 'TSV' (tab-separated values) file and open it in excel.
- Use some of the 'fastq ftp' links to 'wget' the fastq files to our AWS instance
- Run fastqc on each fastqc file
- Install Filezilla, WinSCP or equivalent on your local machine
- Add a new 'site': Protocol is SFTP, Host is the 'public DNS' of your AWS instance, Port is 22, User is ubuntu and key file is the '.pem' file you made yesterday.
- Connect and download all the 'fastqc' output files
- Open the '.html' files in your browser and take a look at the information. Google these quality checks and read the 'Babraham' guidance on them: Example
- Access the EC2 instance (ec2-3-11-80-159.eu-west-2.compute.amazonaws.com) (password: trinity)
- Install miniconda in your home folder
- Install STAR in a conda environment
- Download the fastq files from the airway project (perhaps divide up the task of downloading)
- Download the Gencode primary assembly FASTA and GTF for human
- Generate a genome index using STAR: Follow section 2.1 of the manual
- Use a for loop to map your fastq reads to the genome index in STAR: See section 3.1 of the manual
Example for loop:
for file in *1.fastq.gz ; do echo "STAR --numThreadN 4 --genomeDir folder_where_genome_index_is $file ${file%1.fastq.gz}2.fastq.gz" ; done
- Our objective is complete mapping the airway fastq files to the human genome reference 38 from GENCODE.
- Your AWS instance is accessible using:
ssh -i "msc_tt2020.pem" ubuntu@ec2-18-132-67-54.eu-west-2.compute.amazonaws.com
from the directory in which your.pem
file is stored. - One person install miniconda, then notify everyone when it is installed.
- Each person creates their own environment, including their initials.
- Thursday's work is stored in the
/home/ubuntu/thursday
folder - Friday's space is
/home/ubuntu/friday
- To see how much space is free, type
df -h
: You will see that thethursday
folder is currently almost full. - To see how much space each folder in a given directory takes up, type
du -chd 1
- Using these commands, try to decide whether you can optimise the amount of space you are using in the
thursday
folder (you could do this while waiting for STAR to generate an index in thefriday
folder) - One way to save space is to avoid having duplicates of large files. You could agree upon a single folder to store the fastq files, reference genome files, and STAR index, and then create symbolic links to your own working folders (or just reference this single shared folder in each of your commands).
- Today's AWS address:
ec2-18-130-119-110.eu-west-2.compute.amazonaws.com
- Samtools tutorial
- Airway paper, see RNA seq methods section
- Picard, CollectRNASeqMetrics: *Note, picard can be fussy about how the command is written!
- RSeqQC
- MultiQC: Check out the modules section for information on what it needs to compile information from each tool
- featurecounts
- Sync your personal AWS S3 folder to a personal folder within the
Monday
folder on the instance. - Set up miniconda + an environment with samtools, picard, rseqc and subread.
- Convert your SAM files to BAM files, sort them, index them.
- Run Picard CollectRNASeqMetrics on each bam file
- Run any RSeqQC modules you see as informative (e.g. junction saturation, inner_distance, read_distribution, read_duplication...)
- Compile QC reports from fastqc, STAR, Picard and RSeqQC into a multiQC report (see multiQC documentation)
- Merge/Join Featurecounts counts tables.
- Convert sample names to just SRR accession number
- Create sample metadata/coldata table with relevant biological and technical groups labelled
- Import count data and metadata into DESeq2 in R
- Run a treatment group comparison in DESeq2, by expressing a 'design' (e.g. ~ treatment_group)
- Produce the DESeq2 results table and filter for genes below an adjust P value of 0.01 and with a greater log2Foldchange than 1, both up and down
- Plot a gene's expression between treatment groups
- Export a normalised counts table from DESeq2
- Use a normalised counts table to run PCA with prcomp
- Plot a PCA biplot and label the samples according to their biological or techincal groups.
-
Go to 'experiments' folder and download the fastq files for gba_bulk and lrrk2
-
Apply Fastqc/multiqc approach
-
Acquaint yourself with trim_galore
-
Acquaint yourself with kallisto
-
For kallisto, you will need a transcriptome reference, not a primary genome reference. Beware of this when downloading a reference Fasta from Gencode/Ensembl.
-
Kallisto does not produce a log file, but outputs useful information to the terminal. MultiQC can use this output, so pipe the output to a file
-
You will need for loops for some of this
-
Guusje: Read the Snaptron paper and Snaptron User Guide.
-
We should be able to query LRRK2 splicing using the snaptron client to their web service. We then want all the metadata possible for this.
- If we remember, the PATH variable is where our shell looks for applications. We can see what it is currently equal to by running
echo $PATH
:
/Users/peterkilfeather/miniconda3/bin:/Users/peterkilfeather/miniconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
- If we want to add a folder containing an application, or "binary" to our PATH, we have to add it to the PATH. You can do this with a command, but often the simplest way is to edit the file where the PATH variable is set. In linux, this is
~/.bashrc
, on mac, this is~/.bash_profile
. In that file, you will see the PATH variable. - Add a line at the bottom:
export PATH="/path/to/new/binaries:$PATH"
- Save and load the new bash:
source ~/.bashrc
- Test it:
echo $PATH