London Bioinformatics Frontiers Hackathon Tutorial

Tutorial for The London Bioinformatics Frontiers Hackathon 2019.

In this tutorial you will learn:

Nextflow - how to build parallelisable & scalable computational pipelines
Docker - how to build & run containers to bundle dependencies
FlowCraft - how to build & use modular, extensible and flexible components for Nextflow pipelines
Deploit - how to scale your analyses over the cloud

Prerequisites

The following are required for the hackathon:

Java 8 or later
Docker engine 1.10.x (or higher)
Git
Python3

If you have them installed that's great! Don't worry if not we will help you to install them & other software throughout the tutorial

Session 1: Nextflow

What is Nextflow? Why use it? See about Nextflow slides

Main outcome: During the first session you will build a FastQC & MultiQC pipeline to learn the basics of Nextflow including:

Parameters
Processes (inputs, outputs & scripts)
Channels
Operators
Configuration

a) Installation

i. Installing Nextflow

You will need to have Java 8 or later installed for Nextflow to work. You can check your version of Java by entering the following command:

java -version

To install Nextflow open a terminal & enter the following command:

curl -fsSL get.nextflow.io | bash

This will create a nextflow executable file in your current directory. To complete the installation so that you can run Nextflow run anywhere you may want to add it a directory in your $PATH, eg:

mv nextflow /usr/local/bin

You can then test your installation of Nextflow with:

nextflow run hello

ii. Installing Docker

To check if you have Docker installed you can type:

docker -v

If you need to install docker you can do so by following the instructions here. Be sure to select your correct OS:

b) Parameters

Now that we have Nextflow & Docker installed we're ready to run our first script

Create a file main.nf & open this in your favourite code/text editor eg VSCode or vim
In this file write the following:

// main.nf
params.reads = false

println "My reads: ${params.reads}"

The first line initialises a new variable (params.reads) & sets it to false The second line prints the value of this variable on execution of the pipeline.

We can now run this script & set the value of params.reads to one of our FASTQ files in the testdata folder with the following command:

nextflow run main.nf --reads testdata/test.20k_reads_1.fastq.gz

This should return the value you passed on the command line

Recap

Here we learnt how to define parameters & pass command line arguments to them in Nextflow

c) Processes (inputs, outputs & scripts)

Nextflow allows the execution of any command or user script by using a process definition.

A process is defined by providing three main declarations: the process inputs, the process outputs and finally the command script.

In our main script we want to add the following:

//mainf.nf
reads = file(params.reads)

process fastqc {

    publishDir "results", mode: 'copy'

    input:
    file(reads) from reads

    output:
    file "*_fastqc.{zip,html}" into fastqc_results

    script:
    """
    fastqc $reads
    """
}

Here we created the variable reads which is a file from the command line input.

We can then create the process fastqc including:

the directive publishDir to specify which folder to copy the output files to
the inputs where we declare a file reads from our variable reads
the output which is anything ending in _fastqc.zip or _fastqc.html which will go into a fastqc_results channel
the script where we are running the fastqc command on our reads variable

We can then run our script with the following command:

nextflow run main.nf --reads testdata/test.20k_reads_1.fastq.gz -with-docker flowcraft/fastqc:0.11.7-1

By running Nextflow using the with-docker flag we can specify a Docker container to execute this command in. This is beneficial because it means we do not need to have fastqc installed locally on our laptop. We just need to specify a Docker container that has fastqc installed.

d) Channels

Channels are the preferred method of transferring data in Nextflow & can connect two processes or operators.

Here we will use the method fromFilePairs to create a channel to load paired-end FASTQ data, rather than just a single FASTQ file.

To do this we will replace the code from 1c with the following

//main.nf
reads = Channel.fromFilePairs(params.reads, size: 2)

process fastqc {

    tag "$name"
    publishDir "results", mode: 'copy'
    container 'flowcraft/fastqc:0.11.7-1'

    input:
    set val(name), file(reads) from reads

    output:
    file "*_fastqc.{zip,html}" into fastqc_results

    script:
    """
    fastqc $reads
    """
}

The reads variable is now equal to a channel which contains the reads prefix & paired-end FASTQ data. Therefore, the input declaration has also changed to reflect this by declaring the value name. This name can be used as a tag for when the pipeline is run. Also as we are now declaring two inputs the set keyword also has to be used. Finally, we can also specify the container name within the processes as a directive.

To run the pipeline:

nextflow run main.nf --reads "testdata/test.20k_reads_{1,2}.fastq.gz" -with-docker flowcraft/fastqc:0.11.7-1

Recap

Here we learnt how to the fromFilePairs method to generate a channel for our input data.

e) Operators

Operators are methods that allow you to manipulate & connect channels.

Here we will add a new process multiqc & use the .collect() operator

Add the following process after fastqc:

//main.nf
process multiqc {

    publishDir "results", mode: 'copy'
    container 'ewels/multiqc:v1.7'

    input:
    file (fastqc:'fastqc/*') from fastqc_results.collect()

    output:
    file "*multiqc_report.html" into multiqc_report
    file "*_data"

    script:
    """
    multiqc . -m fastqc
    """
}

Here we have added another process multiqc. We have used the collect operator here so that if fastqc ran for more than two pairs of files multiqc would collect all of the files & run only once.

The pipeline can be run with the following:

nextflow run main.nf --reads "testdata/test.20k_reads_{1,2}.fastq.gz" -with-docker flowcraft/fastqc:0.11.7-1

Recap

Here we learnt how to use operators such as collect & connect processes via channels

f) Configuration

Configuration, such as parameters, containers & resources eg memory can be set in config files such as nextflow.config.

For example our nextflow.config file might look like this:

docker.enabled = true
params.reads = false

process {
  cpus = 2
  memory = "2.GB"

  withName: fastqc {
    container = "flowcraft/fastqc:0.11.7-1"
  }
  withName: multiqc {
    container = "ewels/multiqc:v1.7"
  }
}

Here we have enabled docker by default, initialised parameters, set resources & containers. It is best practice to keep these in the config file so that they can more easily be set or removed. Containers & params.reads can then be removed from main.nf.

Recap

Here we learnt how to use configuration files to set parameters, resources & containers

Session 2: Docker

What is Docker? Why use it? See about Docker slides

Main outcome: During this session, you will learn how to build & run your own Docker container to bundle dependencies for FastQC & MultiQC

a) Running images

Running a container is as easy as using the following command:

docker run <container-name>

For example:

docker run hello-world

Run a container in interactive mode

Launching a BASH shell in the container allows you to operate in an interactive mode in the containerised operating system. For example:

docker run -it flowcraft/fastqc:0.11.7-1 bash

Once the container is launched you will notice that's running as root (!). Use the usual commands to navigate in the file system.

To exit from the container, stop the BASH session with the exit command.

b) Dockerfiles

Docker images are created by using a so called Dockerfile i.e. a simple text file containing a list of commands to be executed to assemble and configure the image with the software packages required.

In this step, you will create a Docker image containing the FastQC & MultiQC tools.

Warning: the Docker build process automatically copies all files that are located in the current directory to the Docker daemon in order to create the image. This can take a lot of time when big/many files exist. For this reason, it's important to always work in a directory containing only the files you really need to include in your Docker image. Alternatively, you can use the .dockerignore file to select the path to exclude from the build.

Then use your favourite editor eg. vim to create a file named Dockerfile and copy the following content:

FROM nfcore/base

LABEL authors="phil@lifebit.ai" \
      description="Docker image containing fastqc & multiqc for LBF hackathon tutorial"

RUN conda install -c bioconda fastqc=0.11.8 && \
    conda install -c bioconda multiqc=1.7

When done save the file.

c) Building images

Build the Docker image by using the following command:

docker build -t my-image .

Note: don't miss the dot in the above command. When it completes, verify that the image has been created listing all available images:

docker images

For example:

With the Dockerfile from above you might want to run:

docker build -t lifebitai/lbf-hack .

And then you can enter inside the container to check everything is working:

docker run -it lifebitai/lbf-hack:latest bash

The container can be used in our Nextflow pipeline replacing the two different containers we currently have because it has both fastqc & multiqc installed

d) BONUS: Upload the container to Docker Hub

Publish your container in Docker Hub to share it with other people.

Create an account in the https://hub.docker.com web site. Then from your shell terminal run the following command, entering the user name and password you specified registering in the Hub:

docker login

Tag the image with your Docker user name account:

docker tag my-image <user-name>/my-image

Finally, push it to the Docker Hub:

docker push <user-name>/my-image

After that anyone will be able to download it by using the command:

docker pull <user-name>/my-image

Note how after a pull and push operation, Docker prints the container digest number e.g.

Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
Status: Downloaded newer image for nextflow/rnaseq-nf:latest

This is a unique and immutable identifier that can be used to reference container image in a univocally manner. For example:

docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266

Session 3: FlowCraft

What is FlowCraft? Why use it? See FlowCraft slides

Main outcome: During this session, you will learn how to build your own Fastqc FlowCraft component & GATK pipeline

a) Installation

FlowCraft is available to install via both Conda & Pip. However, as we are going to building components we want to install the development version. This can be done with the following commands:

git clone https://github.com/assemblerflow/flowcraft.git
cd flowcraft
python3 setup.py install

b) How to build a FlowCraft Component

FlowCraft allows you to build pipelines from components. In order to create a new Component, two files are required. These are the template & the class.

i. Templates

Inside of the flowcraft directory, create & open a new file flowcraft/generator/templates/fastqc2.nf in your favourite code editor:

process fastqc2_{{ pid }} {

    {% include "post.txt" ignore missing %}

    tag { sample_id }
    publishDir "results/fastqc2_{{ pid }}", mode: 'copy'

    input:
    set sample_id, file(fastq_pair) from {{ input_channel }}

    output:
    file "*_fastqc.{zip,html}" into {{ output_channel }}
    {% with task_name="fastqc2" %}
    {%- include "compiler_channels.txt" ignore missing -%}
    {% endwith %}

    script:
    """
    fastqc $fastq_pair
    """
}

{{ forks }}

This is standard Nextflow code which is used as a template. Any code in the double curley brackets {{}} is FlowCraft code which will be replaced when building pipelines.

ii. Classes

Inside of the flowcraft directory, open & add the following changes to the file flowcraft/generator/components/reads_quality_control.py in your favourite code editor:

class Fastqc2(Process):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

        self.input_type = "fastq"
        self.output_type = "fastq"

        self.directives = {"fastqc2": {
            "cpus": 2,
            "memory": "'4GB'",
            "container": "flowcraft/fastqc",
            "version": "0.11.7-1"
        }}

        self.status_channels = [
            "fastqc2"
        ]

Here we set the following:

the inputs & outputs which allows processes to be connected
the parameters required by the process (none in this case)
the directives for the process, including the docker container we want to use. Here the version is the tag of the docker container
the status channels for the process to log its status

c) Building a pipeline with FlowCraft

Now if we add the directory containing flowcraft.py to our path, we can then build a pipeline from any directory, eg:

export PATH=$PATH:/path/to/flowcraft/flowcraft

Now we can test the component we have built with the command:

flowcraft.py build -t "fastqc2" -o fastqc.nf

This will create a Nextflow script fastqc.nf

More complex pipelines such as a GATK pipeline can be built with one command:

flowcraft.py build -t "bwa mark_duplicates haplotypecaller" -o main.nf --merge-params

Here the merge-params flag is used to merges all parameters with the same name in a single parameter

Session 4: Running Nextflow Pipelines on The Cloud on Deploit

Main outcome: During this session, you will learn how to scale the GATK pipeline you built in the previous session to run an analysis on the Cloud using the Deploit platform.

Deploit is a bioinformatics platform, developed by Lifebit, where you can run your analysis over the Cloud/AWS.

a) Creating an account

First, create an account/log in here. You will get $10 free credits. If you prefer you can connect & use your own AWS account/credentials.

b) Importing a Nextflow pipeline on Deploit

We are able to import the GATK pipeline we created with FlowCraft from the previous section (Session 3) on Deploit. This will enable us to scale our analyses. All we need to import a pipeline is the URL from GitHub. For simplicity, we have already created a GitHub repository for the pipeline here: https://github.com/lifebit-ai/gatk-flowcraft

To import the pipeline we must first navigate to the pipelines page. This can be found in the navigation bar on the left-hand-side:

To then import the pipeline you need to:

Click the green New button
Select the GitHub icon to import the Nextflow pipeline from GitHub
Paste the URL of our pipeline https://github.com/lifebit-ai/gatk-flowcraft
Name our pipeline, eg gatk-flowcraft
(Optional:) enter a pipeline description
Click Next & Create pipeline 🎉

c) Running the pipeline

Pipelines can be run in three simple steps:

Select the pipeline
Select data & parameters
Run the analysis

i. Selecting the pipeline

Once the pipeline is imported it will automatically be selected.

Alternatively, you can navigate to the pipelines page. Where you can find the imported pipeline under MY PIPELINES & TOOLS. To select the pipeline you need to click the card for the pipeline.

ii. Selecting the data & parameters

The pipeline requires three parameters to be set. These are:

fastq - paired-end reads to be analysed in fastq.gz format
reference - name of reference genome fasta, fai & dict files
intervals - interval_list file to specify the regions to call variants in

To select the data & parameters you must:

Click the green plus to add more lines to for two additional parameters
Specify the parameter names for fastq, reference & intervals
Import the testdata. This has already been added to the AWS S3 bucket s3://lifebit-featured-datasets/hackathon/gatk-flowcraft (although you can also upload files from your local machine via the web interface)
Once the testdata has been imported you must specify the values for each parameter:
- fastq use the blue plus button to Choose the imported folder & click +Regex & type *{1,2}.fastq.gz
- reference you can also use strings to specify the location. Set the reference to s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/WholeGenomeFasta/human_g1k_v37_decoy
- For the intervals click the blue plus again & select the GRCh37WholeGenome.interval_list file within the imported folder
Finally, click Next

See below for all of the steps:

iii. Run the job - selecting the project & resources

Select a project & instance:

Before running the job you must:

Select the project (which is like a folder used to group multiple analyses/jobs). You can select the already created Demo project
Choose the instance to set the compute resources such as CPUs & memory. Here you can select Dedicated Instances > 16 CPUs > c4.4xlarge
Finally, click Run job

d) Monitoring an analysis

To monitor jobs you can click on the row for any given job. Immediately after running a job its status will be initialising. This is where AWS in launching the instance. This normally occurs for ~5mins before you are able to view the progress of the job.

Once on the job monitor page, you can see the progress of the job update in real time. Information such as the resources i.e. memory & CPUs is displayed. Once the job has finished the results can be found in the results tab as well as any reports for select pipelines.

You can view a successfully completed example job here:

Thanks for taking part

Well done you survived! You’ve made it to the end of the hackathon tutorial. You’ve learned about the magic of Nextflow, Docker, Flowcraft & Deploit. You can now go out & analyse all the things.

Hope you enjoyed the conference & let us know if you have any feedback or questions.

Credits

Credit to Lifebit & The Francis Crick Institute for organising & hosting the event

Many thanks to everyone who helped out along the way, including (but not limited to): @ODiogoSilva, @cgpu, @clairealix, @cimendes & @pprieto

Thanks to everyone involved in the nf-hack17-tutorial which was heavily used as inspiration for this tutorial

PhilPalmer/lbf-hack-tutorial

London Bioinformatics Frontiers Hackathon Tutorial

Prerequisites

Session 1: Nextflow

a) Installation

i. Installing Nextflow

ii. Installing Docker

b) Parameters

Recap

c) Processes (inputs, outputs & scripts)

d) Channels

Recap

e) Operators

Recap

f) Configuration

Recap

Session 2: Docker

a) Running images

Run a container in interactive mode

b) Dockerfiles

c) Building images

For example:

d) BONUS: Upload the container to Docker Hub

Session 3: FlowCraft

a) Installation

b) How to build a FlowCraft Component

i. Templates

ii. Classes

c) Building a pipeline with FlowCraft

Session 4: Running Nextflow Pipelines on The Cloud on Deploit

a) Creating an account

b) Importing a Nextflow pipeline on Deploit

c) Running the pipeline

i. Selecting the pipeline

ii. Selecting the data & parameters

iii. Run the job - selecting the project & resources

d) Monitoring an analysis

Thanks for taking part

Credits