INSaFLU

INSaFLU (“INSide the FLU”) is an influenza-oriented bioinformatics free web-based platform for an effective and timely whole-genome-sequencing-based influenza laboratory surveillance.

INSaFLU is freely available at https://insaflu.insa.pt Documentation (latest) for each INSaFLU module is provided at http://insaflu.readthedocs.io/

Synopsis

INSaFLU (“INSide the FLU”) is a bioinformatics free web-based suite that deals with primary NGS data (reads) towards the automatic generation of the output data that are actually the core first-line “genetic requests” for effective and timely influenza laboratory surveillance (e.g., type and sub-type, gene and whole-genome consensus sequences, variants annotation, alignments and phylogenetic trees).

Main features

Highlights / Main advantages

open to all, free of charge, user-restricted accounts
applicable to NGS data collected from any amplicon-based schema
allows advanced, multi-step software intensive analyses in a user-friendly manner without previous advanced training in bioinformatics
allows integrating data in a cumulative manner, thus fitting the analytical dynamics underlying the continuous epidemiological surveillance during flu epidemics
outputs are provided in nomenclature-stable and standardized formats and can be explored in situ or through multiple compatible downstream applications for data analysis and visualization

Main outputs INSaFLU yields:

influenza type and subtype/lineage
gene and whole-genome consensus sequences
annotation of variants and intra-host minor variants
gene, protein and genome alignments
gene- and genome-scale phylogenetic trees

Other features: INSaFLU also automatically provides:

raw NGS data quality analysis and improvement
a rapid snapshot of whole-genome backbone of each virus (draft assembled contigs are assigned to each viral segment and to close related reference influenza viruses).
coverage statistics
detection of putative mixed infections

How to cite

If you use INSaFLU in your work, please cite Borges V, Pinheiro M et al. Genome Medicine (2018) 10:46, https://doi.org/10.1186/s13073-018-0555-0

Bioinformatics pipeline

Authors

Miguel Pinheiro, Vitor Borges

Installation

This installation is oriented for Ubuntu Server 16.04 and Centos 7.X. There are several steps and packages to install, so, please, be patient. First, it is necessary to install and configure all bioinformatics software, then the database, batch-queuing system and, finally, the web site.

The user "flu_user" is used in all operations and it is going to be the user to run the apache web server.

General packages

###Some general packages to install in Ubuntu 16.X:

$ sudo apt install binutils libproj-dev gdal-bin
$ sudo apt install postgis*
$ sudo apt install bioperl
$ sudo apt install python3
$ sudo apt install libdatetime-perl libxml-simple-perl libdigest-md5-perl git default-jre bioperl

###Some general packages to install in Centos 7.X:

$ sudo yum install gdal gdal-devel 
$ sudo yum install postgis
$ sudo yum install python3
$ sudo yum install perl-Time-Piece perl-XML-Simple perl-Digest-MD5 git java perl-CPAN perl-Module-Build
$ sudo cpan -i Bio::Perl

Bioinformatics software

The software can be installed in this directory "/usr/local/software/insaflu". If you choose other directory it is necessary to edit the file "constants/software_names.py" and set the variable "DIR_SOFTWARE".

$ sudo mkdir -p /usr/local/software/insaflu
$ sudo chown flu_user:flu_user /usr/local/software/insaflu

Software to install:

IGVTools 2.3.98
SPAdes 3.11.1
Abricate 0.8-dev
FastQC 0.11.5
Trimmomatic 0.27
Bamtools 2.5
Prokka 1.2
Mauve 2.4.0, Feb 13 2015
Mafft 7.313
seqret (EMBOSS) 6.6.0.0
FastTreeDbl 2.1.10 Double precision
freebayes v1.1.0-54-g49413aa - Also need some scripts available in freebays
Snippy 3.2-dev
- samtools 1.3
- bgzip 1.3
- tabix 1.3
- snpEff 4.1l - Important, it's necessary to use this version. Recent versions have a problem when variants involve more than one base.
- freebayes v1.1.0-54-g49413aa

Some scripts to install:

convertAlignment.pl
- this script need to be installed in <SoftwareNames.DIR_SOFTWARE>/scripts/convertAlignment.pl
Fastq-tools 0.8

⚠️ Important, copy the file bin/snippy-vcf_to_tab to bin/snippy-vcf_to_tab_add_freq and do this change:

$ cd /usr/local/software/insaflu/snippy/bin
$ cp snippy-vcf_to_tab snippy-vcf_to_tab_add_freq
$ vi snippy-vcf_to_tab_add_freq

and change the line 57 from:
print join("\t", qw(CHROM POS TYPE REF ALT EVIDENCE), @ANNO), "\n";
to
print join("\t", qw(CHROM POS TYPE REF ALT FREQ), @ANNO), "\n";

⚠️ Important, change snippy script to allow snpEff 4.1 version

#xpto@brazil:/usr/local/software/insaflu/snippy/bin$ diff snippy snippy~
90c90
< parse_version( 'snpEff -version',     4.1, qr/(\d+\.\d+)/           );
---
> parse_version( 'snpEff -version',     4.3, qr/(\d+\.\d+)/           );

Database PostgreSQL

* postgresql 9.X
	* create a database and a user. Then reflect these names in ".env" file in root path of web site.

Sun Grid Engine/Open Grid Engine

Software:
* gzip
* [Sun Grid Engine/Open Grid Engine](https://arc.liv.ac.uk/downloads/SGE/releases)
	* [download 8.1.9 version](https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge_8.1.9.tar.xz)
	* queues that will be created:
		* all.q - generic queue
		* fast.q - to run quick process
		* queue_1.q and queue_2.q - to run slow process

Install SGE/OGE tips

$ sudo mkdir /opt/sge
$ sudo groupadd -g 58 gridware
$ sudo useradd -u 63 -g 58 -d /opt/sge sgeadmin
$ cd ~
$ mkdir sge; cd sge
$ wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge_8.1.9.tar.xz
$ tar -xJvf sge_8.1.9.tar.xz
$ cd sge-8.1.9/source
$ scripts/bootstrap.sh

### centos version
$ sudo yum install hwloc-devel openssl-devel
### ubuntu
$ sudo apt-get install libhwloc-dev libssl-dev

$ ./aimk -no-java -no-jni
$ sudo su
# export SGE_ROOT=/opt/sge
# scripts/distinst -local -allall -noexit
# chown -R sgeadmin:gridware /opt/sge
# cd $SGE_ROOT
# ./install_qmaster
# . /opt/sge/default/common/settings.sh
# ./install_execd

### create a file to set the environment variables to SGE
$ sudo vi /etc/profile.d/sun-grid-engine.sh
## add the follow line to the file
 . /opt/sge/default/common/settings.sh

Configure queues with this help.

After the OGE/SGE configuration you need to have these queue names in your system.

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@brazil               BIP   0/0/2          1.19     lx26-amd64    
---------------------------------------------------------------------------------
fast.q@brazil              BIP   0/0/1          1.19     lx26-amd64        
---------------------------------------------------------------------------------
queue_1.q@brazil           BIP   0/0/1          1.19     lx26-amd64    
---------------------------------------------------------------------------------
queue_2.q@brazil           BIP   0/0/1          1.19     lx26-amd64

⚠️ brazil is the name of the computer where the installation is. You have other certainly. The computer name need to be in /etc/hosts with the IP address and not with localhost to SGE work properly. Example:

$ cat /etc/hosts
127.0.0.1	localhost
::1     ip6-localhost ip6-loopback

192.168.1.14	brazil

Of course you have a different IP address from '192.168.1.14'

INSaFLU website


$ sudo mkdir -p /usr/local/web_site
$ sudo mkdir -p /var/log/insaflu
$ sudo chown flu_user:flu_user /usr/local/web_site
$ sudo chown flu_user:flu_user /var/log/insaflu
$ cd /usr/local/web_site
$ git clone https://github.com/INSaFLU/INSaFLU.git
$ sudo pip3 install -r requirements.txt
$ cp .env_model .env

Edit the file ".env" and config all variables. Define also a backend to the email. I have defined a posix server.

To create the database

$ python3 manage.py migrate

To create a super user, it is going to be the administrator user account

$ python3 manage.py createsuperuser

To join all files, in "static_all" path, that is necessary to run the web site

$ python3 manage.py collectstatic

Test if all bioinformatic tolls are installed

$ cd /usr/local/web_site
$ python3 manage.py test constants.tests_software_names

Test everything

$ cd /usr/local/web_site
$ python3 manage.py test

⚠️ All tests must pass otherwise something is not working properly.

If all tests passed you can test immediately it is working:

$ cd /usr/local/web_site
$ python3 manage.py runserver

Go to your internet explorer and write the ip of the computer where the web site is installed ":8000". If it is in same computer can be "localhost:8000". If it is working let's go to install in a Apache web server. If you prefer, can be in a Nginx web server too.

Apache web server

###Config apache2 in Centos 7.X:

Add flu_user to the apache group and add insaflu.conf to apache2.

$ sudo usermod -a -G flu_user apache
## From IUS repo
$ sudo yum install python3<minor version of your python>u-mod_wsgi
$ sudo vi /etc/httpd/conf.d/insaflu.conf

<VirtualHost *:80>

	# General setup for the virtual host, inherited from global configuration

	ServerName insaflu.pt

        Alias /media /usr/local/web_site/media
        Alias /static /usr/local/web_site/static_all
        <Directory "/usr/local/web_site/static_all">
                Require all granted
        </Directory>
        <Directory "/usr/local/web_site/media">
                Options FollowSymLinks
                AllowOverride None
                Require all granted
        </Directory>

        #### for log files
        <Directory "/var/log/insaFlu">
                Require all granted
        </Directory>

        <Directory "/usr/local/web_site/insaflu">
            <Files "wsgi.py">
                Require all granted
            </Files>
        </Directory>
	
	WSGIDaemonProcess flu_user.insa.pt user=flu_user group=flu_user python-path=/usr/local/web_site/insaflu;/usr/lib/python3.<minor version of your python>/site-packages
        WSGIProcessGroup flu_user.insa.pt
        WSGIScriptAlias / /usr/local/web_site/insaflu/wsgi.py

# Use separate log files for the SSL virtual host; note that LogLevel
# is not inherited from httpd.conf.
ErrorLog /var/log/apache2/insaflu_error.log
TransferLog /var/log/apache2/insaflu_transfer.log
LogLevel warn

</VirtualHost> 

$ sudo a2ensite insaflu
$ sudo systemctl restart apache2
$ sudo systemctl status apache2

###Config apache2 in Ubuntu 16.X:

Add flu_user to the apache group and add insaflu.conf to apache2.

$ sudo usermod -a -G flu_user apache
$ sudo apt install libapache2-mod-wsgi-py3
$ sudo vi /etc/apache2/sites-available/insaflu.conf

<VirtualHost *:80>

	# General setup for the virtual host, inherited from global configuration

	ServerName insaflu.pt

        Alias /media /usr/local/web_site/media
        Alias /static /usr/local/web_site/static_all
        <Directory "/usr/local/web_site/static_all">
                Require all granted
        </Directory>
        <Directory "/usr/local/web_site/media">
                Options FollowSymLinks
                AllowOverride None
                Require all granted
        </Directory>

        #### for log files
        <Directory "/var/log/insaFlu">
                Require all granted
        </Directory>

        <Directory "/usr/local/web_site/insaflu">
            <Files "wsgi.py">
                Require all granted
            </Files>
        </Directory>
	
	WSGIDaemonProcess flu_user.insa.pt user=flu_user group=flu_user python-path=/usr/local/web_site/insaflu;/usr/lib/python3.<minor version of your python>/site-packages
	WSGIProcessGroup flu_user.insa.pt
	WSGIScriptAlias / /usr/local/web_site/insaflu/wsgi.py

# Use separate log files for the SSL virtual host; note that LogLevel
# is not inherited from httpd.conf.
ErrorLog /var/log/apache2/insaflu_error.log
TransferLog /var/log/apache2/insaflu_transfer.log
LogLevel warn


</VirtualHost> 

$ sudo a2ensite insaflu.conf
$ sudo systemctl restart apache2
$ sudo systemctl status apache2

Create users without access to INSaFLU web page

Go to your internet explorer and put this address http://127.0.0.1:8000/admin/ Make the authentication with your superuser credentials and in AUTHENTICATION AND AUTHORIZATION you can create new accounts.

Remove files from file system removed by the user on web site

You can remove the original fastq.gz files from system because they are not used anymore. The Trimmomatic result fastq files are the ones that are going to be used. You can can also remove files that belong to the samples, references, uploaded in batch and project samples that were deleted in web site by the users. This operation will save several GB in your hard drives.

⚠️By default, only files with 10 days after been removed in web site will be removed in file system. ⚠️The original fastq.gz files will be removed after 10 days of being processed by Trimmomatic.

To identify the files that can be removed:

$ cd <where your INSaFLU is installed>
$ python3 manage.py run_remove_files --only_identify_files true

A log file will be created with this information in /var/log/insaflu/remove_files.log

To remove the files permanently from file system: :warning: The files can't be recovered.

$ cd <where your INSaFLU is installed>
$ python3 manage.py run_remove_files --only_identify_files false

Tip:

You can create a cron job to run this task every week.

luisdrita/INSaFLU_2.0