This workshop is intended to introduce the command-line tools and file types used in a simple bioinformatics workflow for mutation detection with Patient Derived Xenograft (PDX) Whole Exome Sequencing (WES) data. It was written for the workshop for Humanized Mice in Biomedicine (2019) in EMBL, Heidelberg.
- Basic scientific background of PDX models and mutations
- You should have Docker and ORCA installed (done for you)
- Part 1: This will introduce you to the command-line
- Part 2: You will use ORCA, a Docker-based bioinformatics environment to align some
FASTQ
files from the NCI Patient-Derived Models Repository (PDMR) - Part 3: You will use
freebayes
to call genomic variants
Time plan:
- 14:30 Part 1
- 15:00 Part 2
- 15:30 Part 3
- Become familiar with the Command-Line Interface
- Understand the common file formats used in genomic variant detection
- Perform sequencing read alignment
- Run a variant caller to identify Single Nucleotide Variations (SNVs) and insertion/deletion mutations (indels)
For a lot of scientists, bioinformatics is a mysterious and daunting black box. My aim for this workshop is not necessarily to get you performing variant analysis in an hour (though there's content here ready if you want to!), but rather to lift up the curtain and give you an idea of what's actually happening in that black box, and convince you that you can do it too.
There is a lot to cover even with a simplified variant calling pipeline, and it is unlikely we will cover everything in the 1.5 hours we have, however, the tutorial should give you the starting tools to takle the rest of it. You are encouraged to finish the workshop in your own time should you wish. I will respond to any issues or questions you have, on Humanised Mice slack, the github repository, or by email if you prefer.
Already comfortable with the command line? Here's what you can expect to add to your bioinformatics repetoire:
- A troubleshooting strategy for most situations
- Modern best practices regarding reproducible analyses - using and creating your own docker images
- An introduction to using R for downstream analysis and presentation
The workshop will continue to be updated, so please feel free to submit questions or suggestions. Let's make a useful resource for other PDX scientists interested in analysing their own data!
When working with bioinformatics, it is impossible to avoid working with the Command Line Interface (CLI). You will use it to access all of the tools of the trade. At first it will be daunting, and will take a while to get things done, but soon enough you will get more comfortable with it.
First, complete the exercises in Amber Wright's Command-line Tutorials:
- Introduction to the Command-Line
- Beginner's guide to pagers
- Familiarising yourself with the unfamiliar
- Reading files again
Some files for analysis have already been downloaded to your computer in the /data
directory. Using your newfound command-line skills, we are going to investigate them and show you what some of the files actually look like.
These FASTQ files are from the sample OT-FWCP10 from the PDMR database. In the interests of time, I have isolated only those reads that should map to p13.1 of chromosome 17 (chr17:6,600,000-10,600,000, around 4Mb worth). The reference genome (build hg38) is from Illumina's igenomes.
This will build on what you learned in Part 1.
- Change directory into
/data
- List the files in the directory
- Let's organise things a bit:
- Make a
fastqs
directory - Move the FASTQ files into it. You can use
mv
to do this:mv r1.fastq fastqs/ mv r2.fastq fastqs/
- Make a
- List the files that are in
fastqs/
We are going to use ORCA for these exercises. You can think of it as a tiny, self-contained little linux computer working inside your current machine. There are benefits to using such an environment, such as for ease of program installation and reproducibility.
To start working in ORCA, use the following command:
docker run --interactive --tty --volume /data:/Users/training --workdir /Users/training bcgsc/orca
There are a number of options here that we've called docker with:
--interactive
allows you to interact with the bash interface of the container--tty
opens a tty terminal session (Ross needs to update this)--volume <OS directory>:<docker directory>
mounts a volume to docker. This just allows you to access the fastq files from your machine from within the docker environment--workdir
sets the current working directory for when you start the container
You will spin up your virtual linux operating system and see your prompt within ORCA's shell:
All paths from now on will be relative paths from your /data
directory
root@b4601473c56e:/Users/training#
You can use exit
to close the container and return to your system shell at any time.
SAM files describe where your sequencing reads map to on a reference genome. Here is an example of a line that represents an alignment. Most of these fields do not convey much information, but the following fields make intuitive sense:
-
- (
chr17
) 'RNAME': this shows the reference chromosome that was mapped to.
- (
-
- (
10450470
) ''POS': the coordinate of the leftmost read base
- (
D00748:170:CD31MANXX:5:1101:1136:89194 99 chr17 10450470 255 124M = 10450535 189 CTCTGTGCGCTNGATGGCGTCCGTCTCGTACTTGGTCCNCCACTGGGCAACCTCACTGTTNNCCTTGGACATTCCCCTCTGCAGCTCAGCCTTGGCTTCCTGCTCCTCCTCATACTGTTCCCGC BBBBBFFFFFF#<BFBFFFFFFFFBFFFFFFFFFFFBB#<<FBFF/FBFF/BFFFFFFFF##<<<FF/<F/FFFBFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBBFFFFFFFBBFFBFBBBBB XA:i:1 MD:Z:11G26T21G0G62 NM:i:4 XM:i:2
-
Align the reads in the fastq files to a reference genome using a CLI tool called
bowtie
.We will use the command like so:
bowtie --sam <reference> -1 <R1.fastq> -2 <R2.fastq> <output.sam>
bowtie --sam --chunkmbs 200 Homo_sapiens/UCSC/hg38/Sequence/BowtieIndex/genome -1 fastqs/r1.fastq -2 fastqs/r2.fastq aligned.sam
The options we pass to
bowtie
are:--sam
indicates we want the output to be in SAM format--chunkmbs 200
is required to prevent our machines running out of memory while trying to find the alignment for each readHomo_sapiens/UCSC/hg38/Sequence/BowtieIndex/
is the first unnamed option we use. It points to our reference genome, as described above-1
points to the location of our first FASTQ file - this contains the first mates of each paired-end read-2
likewise points to the location of our second FASTQ filealigned.sam
is our second unnamed option. It is the name that we want the output file to be
-
We need to do a couple of operations on our SAM file before it is ready for mutation detection, we need to compress and sort the data.
-
Compress the file. We are going to use the binary SAM (bam) format:
samtools view -b aligned.sam > aligned.bam
-
Sort the file. We can do this by using samtools again:
samtools sort aligned.bam -o aligned.bai
Now that we have aligned our reads to the reference genome, we can perform variant calling - we can identify where there are mismatches between the two, which suggest a mutation in the experimental sample.
- Call variants using
freebayes
:
freebayes -f Homo_sapiens/UCSC/hg38/Sequence/BowtieIndex/genome.fa aligned.bai > test_mutations.vcf
- If we call
head test_mutations.vcf
then we can preview the contents of the file. VCF files are a tab-delimited file. You can think of them like Excel spreadsheets, except they use tabs to separate columns and returns to separate rows. You can even open it in Excel or another spreadsheet program. - Take a vcf that has been generated from the full read dataset here and upload it to CRAVAT.us
The vcfR package for R. To use it, we will use RStudio (via Rocker), which is a nice editor for working in R with.
docker run -e PASSWORD=<YOUR_PASSWORD> -p 8787:8787 rocker/rstudio:3.6.1
- Navigate your web browser to localhost:8787
- Log into the rstudio interface with username / password
rstudio
/<YOUR_PASSWORD>
- Install the package with
install.packages("vcfR")
⏎
Whoops, everything looked like it was going fine, but it says that the install terminated somewhere! 🙊 We are now going to cover an essential programming skill: troubleshooting. Every programmer and bioinformatician - beginner or seasoned veteran - has to do this, and you'll have to learn sooner or later too.
Simply, we are going to use a search engine with two pieces of information, 1) The most relevent part of the debug error and 2) the context in which we're encountering it. In this case I searched for:
vcfr zlib.h: No such file or directory #include <zlib.h>
- We find a similar question asked by someone on stackoverflow, and a couple of answers. This answer seems to be the key to solving it.
- According to the discussion, it appears that we're missing a dependency, a
C
library calledzlib
. The answer details how we can install this dependency and resolve our installation, run:
# Note: Run in container
sudo apt-get update
sudo apt-get install libz-dev
- Now that's installed, we can return to RStudio and run the
install.packages
command again to complete the install and load the library
install.packages("vcfR")
library("vcfR")
- Now we are ready to follow the vignette for
vcfR
Docker is a great tool that can be used to tackle one of the long-standing problems of bioinformatics analyses: Reproducibility. For an example of why, you can see the recent story about python in the newsNeupane et al. 2019
Any feedback would be very welcome! Three short questions, no required answers, completely anonymous!
Build upon, and acknowledge the work of others, and don't repeat yourself when you don't have to.
- A huge thank you to Amber Wright for the use of her excellent Command-line Tutorials
- Thanks to the authors of this tutorial by Melbourne Bioinformatics. I have based this workshop around their excellent content. I encourage you to check it out if you would like an approach using the GUI bioinformatics tool, Galaxy
We will introduce 3 frequently used bioinformatic file formats in this tutorial:
Filetype | Contains | Description |
---|---|---|
FASTA | Sequences | A collection of DNA sequences. We store reference genomes in this format. |
FASTQ | Reads | A collection of signals ("reads") that represent DNA sequences, generated from a high-throughput sequencer. Contains quality scores for each read. |
SAM | Alignments | A collection of reads that have been aligned to a reference genome. |
VCF | Mutations | A list of variations (i.e. mutations) found in an alignment. |
From the descriptions, you might guess that these build on each other. We will start with FASTA and FASTQ files, and generate SAM alignments from them. From the SAM files, we will generate a VCF that will contain processed information about the mutations we have found in our sample.