Multi-services docker Bioinformatics Pipeline for Assembly and Genomic Annotation

Project `biopipeline-novnc` developed by extreemedev & adriIT

Docker Hub - Git Hub - Changelog - Wiki - Hierarchy

Attention! The repository accetto/xubuntu-vnc-novnc is retired and archived. It will not be developed any further and the related images on Docker Hub will not be rebuilt any more. They will phase out and they will be deleted after becoming too old.

Overview

This project's goal is to create a user friendly environment where anyone can easily use the pipeline. It consists of a Docker Compose file which manages every softwares and microservices. Periodically (every 5 seconds) the host-system will check if the working directory contains the needed file to begin the pipeline, then will proceed automatically to start it, using a Python script. In the working directory then, the main program will add some specific folders named after the softwares, that will contain the output files generated.

Dockerfile image `xubuntu-novnc-biotech:latest`

Our Dockerfile image monitor is based on accetto/xubuntu-vnc-novnc published image. This new custom image includes a bunch of jre and python installations and moreover some custom enviroment settings, which may let the user be easy using this noVNC system. It also includes a FastQC interactive installation. Just for debug purpose, you can find these installations under the path: /otp/bioprograms, but it's highly recommended not to operate inside this directory.

These base images already include commonly used utilities ping, wget, zip, unzip, sudo, curl, git and also the current version of jq JSON processor. Additional components and applications can be easily added by the user because sudo is supported.

Docker compose structure

This project repository contains resources for building various Docker images based on Ubuntu with Xfce desktop environment and VNC/noVNC servers for headless use. The resources for the individual images and their variations are stored in the subfolders of the Git Hub repository and the image features are described in the individual README files. Additional descriptions can be found in the common project Wiki. All images are part of a growing image hierarchy.

This docker-compose.yml file defines multiple services for different bioinformatics tools, each running on its own container. Some of these services have bind mounts, which allow them to access files and scripts on the host system. The containers are set to automatically restart and they are all part of the same network called bionet. Every single container is bound to the same working directory, generally on scripts/, in order to generate the expected outputs for each process. The main container, monitor service, runs a desktop environment and exposes ports 25901 and 26901 for remote access. It runs the previous built Dockerfile image xubuntu-novnc-biotech:latest, and it has a Graphic User Interface where the user can easily work into. Inside this last one, the user has root privileges to enable file actions.

Docker containers and images

Here's a list of the container's names used in docker-compose.yml, associated to the docker images retrievable on https://hub.docker.com. This list is provided in alphabetical order except for monitor (main service):

monitor: This provides a fully functional Xubuntu desktop environment accessible through a web browser via the noVNC client. The container includes a web-based VNC viewer and a lightweight window manager, as well as various tools and applications commonly used in a Linux environment. The container is designed to be easily customizable and supports several configuration options, such as enabling clipboard sharing and mounting external volumes. By running the container, users can access a virtual Linux desktop environment from anywhere with a web browser, without the need for a local VNC client or additional software installations.
biopython: This container is used to run fetchClusterSeqs.py which is a Python script that helps retrieve DNA sequences from a FASTA file based on a list of cluster IDs. The script takes as input a FASTA file and a text file containing the cluster IDs, and outputs a new FASTA file containing only the sequences with IDs that match the cluster IDs in the input file. The script also allows for filtering sequences based on their length and can output the sequences in reverse-complement form if desired. This service is provided by Adam Taranto
busco: BUSCO is a bioinformatics tool used to evaluate the completeness and quality of a gene assembly or genome sequence set. It compares genome sequences with a set of universal single-copy orthologs to identify missing or duplicated genes.
cdhit: CD-HIT-est is a bioinformatics tool used for clustering and comparing protein or nucleotide sequences. It can be used to reduce the complexity of a large sequence dataset by clustering sequences that are highly similar, thereby speeding up subsequent analyses.
corset: Corset is a bioinformatics tool used for clustering and annotating transcriptome assemblies from RNA-Seq data. It identifies clusters of transcripts that represent putative genes and can annotate these clusters with functional information.
fastqc: FastQC is a bioinformatics tool used for quality control of high-throughput sequencing data. It provides a graphical interface for the visualization and assessment of sequence quality, adapter contamination, GC content, and other metrics.
hisat: HISAT2 is an RNA-Seq sequence alignment software used to identify expressed transcripts in a specific experimental condition, quantify gene expression, and discover novel splicing variants. It can handle sequences with a high error rate and efficiently identify multiple alignments.
spades: SPAdes is a genome assembly software designed specifically for RNA sequencing data. It can perform both de novo assembly of the transcriptome and genome-guided assembly using a reference genome. SPAdes can handle various types of RNA sequencing data, including stranded, non-stranded, paired-end, and single-end reads, and is capable of resolving complex transcript structures and alternative splicing events.
transdecoder: TransDecoder is a software package used for the identification of coding regions within de novo transcriptome assemblies, such as those generated from RNA-Seq data. It can predict the likely coding regions of the transcripts, including those for full-length proteins, as well as the locations of start and stop codons. TransDecoder can also identify potential coding regions from genomic sequences that lack annotation, using evidence from expressed sequences.
trimmomatic: Trimmomatic is a software tool used for the quality control and preprocessing of high-throughput sequencing data, particularly for Illumina data. It can perform various trimming tasks, such as removing low quality bases, adapter sequences, and contaminant sequences, and can also trim reads based on length and quality. Trimmomatic can improve the accuracy of downstream analyses and reduce errors caused by sequencing artifacts and low-quality reads.

Working directory

The Working directory is a shared directory between the host system, on scripts/, and the built-up containers, visible on monitor filesystem on /home/headless/Desktop/Biotech path. The user can copy any file wished to be processed into the host folder easily. In here, once the service is enabled, will be checked if there are the requested files. Hence, the pipeline will start.

Python Package Utils

Can be found here: /utils/

Contains utilities that make building the images more convenient and helps out the user get a full clean installation and uninstallation, plus various settings:

utils/pipeManager/

Includes every file needed for the first installation and the initial setup. It is severerly recommended to not touch or modify any of these files.
utils/pipePackage/

Includes every file or extension needed for the pipeline to work properly. It is severerly recommended to not touch or modify any of these files.
utils/util-hdx.sh

Displays the file head and executes the chosen line, removing the first occurrence of '#' and trimming the line from left first. Providing the line number argument skips the interaction and executes the given line directly.The comment lines at the top of included Dockerfiles are intended for this utility. The utility displays the help if started with the -h or --help argument. It has been developed using my other utilities utility-argbash-init.sh and utility-argbash.sh, contained in the accetto/argbash-docker Git Hub repository, from which the accetto/argbash-docker Docker image is built.
utils/util-refresh-readme.sh

This script can be used for updating the version sticker badges in README files. It is intended for local use before publishing the repository. The script does not include any help, because it takes only a single argument - the path where to start searching for files (default is ../docker).

Installation

Attention! To install this full pipeline service, you'll need to be root or a sudoer user.

Now, in order to install the entire service, move inside utils/pipeManager/ and run the python installing script, with the following command:

python3 pipeInstall.py

All needed dependancies will be installed and this tree directory structure will be created on your operative system (starting from the root):

/opt/
  |
  ├─ pipeline/
        |
        ├─ bin/
        |
        ├─ etc/
        |
        ├─ lib/
        |
        ├─ log/
        |
        ├─ opt/
        |
        ├─ var/

Issues

If you have found a problem or you just have a question, please check the Issues and the Wiki first. Please do not overlook the closed issues.

If you do not find a solution, you can file a new issue. The better you describe the problem, the bigger the chance it'll be solved soon.

noVNC Web Access

Watch out! In order to access this web page, you have to host the service, following the next step:

Please, move into this directory /docker/xubuntu-novnc-biotech/ and run this command in the terminal:

docker compose up --build -d

or you can simply just do it, (Only on VS Code) by right-clicking on docker-compose.yml and then clicking Compose Up thanks to Visual Studio Code Docker Extension

After running compose up on the docker-compose.yml file, we are ready to access the web page linked http://localhost:26901/vnc.html?password=headless and connect remotely and locally to the service running on the host machine in question.

Reminder: once you execute the previous command, this docker-compose, will automatically restart every single container if some problems are experienced. Moreover this compose service will be running at every system boot/startup/restart. To avoid this you can simply run this command in the terminal:

docker compose down

Running the service and Usage

Watch out! Before running and using the service you'll need to perform the previous step.

Systemd is composed of a set of daemons, libraries, and tools that allow the administration and configuration of the system and interact with the Gnu/Linux system kernel. Now, you are ready to run the service and suddenly execute the pipeline. Everytime you will need to execute the pipeline, open the terminal and please type the following command:

If your linux host-system supports systemctl:

sudo systemctl pipeline.service

Otherwise if systemctl can't operate, you should use this:

sudo service pipeline start

Uninstallation

Attention! To uninstall this full pipeline service, you'll need to be root or a sudoer user. Consider that, uninstalling this service, will also destroy your cloned repository.

If you have the need to remove this service, or you are having trouble with filesystem conflicts or anything else, please use our one-step uninstall script. Please move inside utils/pipeManager/ and use the following command:

python3 pipeUninstall.py

The service will be removed from /etc/init.d, all files will be deleted and the tree directory structure will be purged. If you would like to reinstall it, you'll have to clone this repository again and repeat the process of installation.

Credits

Credit goes to all the people, who contributed and provided this big cluster of docker images and resources, and particularly to:

Professor Tiziana Castrignanò
PhD Bachelor Doctor Pietro Libro
Adam Taranto

Optimization ideas

Optimize logger stout lines
Pipeline loading script
Resolve compose down uninstall issue