Managing and maintaining a genomics infrastructure requires a multi-layered approach. Typically, that involves a Laboratory Information Management System (LIMS), the Sequencing Instrumentation, the computational Infrastructure to store and process the data, as well as any other additional layers such as project management software (e.g. JIRA). In a production environment, all of these layers need to efficiently communicate and integrate with each other, which can be a complicated task. In addition, the approach needs to be descriptive and yet flexible enough to accommodate new additions to the existing layers. Our solution, GURU (Genomics seqUencing Run aUtomation) addresses this need. GURU is implemented using Apache Airflow that allows for the authoring, scheduling and monitoring of workflows, or DAGs (Directed Acyclic Graph). Specific DAGs have been implemented to handle various sequencing runs, from SingleCell applications, to RNA/DNA sequencing, Short reads vs Long reads, archiving of sequencing runs, initiating specific types of analysis (QC/WGS/WES/RNAseq/ATAC/CHiP etc.), as well as automatically communicate to end-users regarding the status of their samples/runs. GURU has been containerized using Docker to enable easy deployment across various configurations.
- Customized user interface plugins using Apache Flask Web Framework, ensuring DAGs are bound to their respective plugins.
- Airflow models (Dagbags) are used to enable the communication with the Airflow plugins, thus avoiding RESTAPIs.
- Commands are launched directly on the computational environment (Server/HPC/Cloud), and their status is monitored, meaning that GURU can be deployed anywhere and does not have to be deployed in the same computational infrastructure where the data processing happens.
- Integration with JIRA (or other issue tracking systems) is achieved using a RESTAPI. The same approach can be implemented for any other additional layers of communication.
- Integration with Bioinformatics Workflow Management Systems such as BioSAILs ensures more complex analyses can be supported (other WMS can also be supported e.g. Nextflow, CWL, Snakemake).
- Apache Airflow ( v2.5.3 )
- Docker
- Python
- Shell Scripting
- Python Flask Wtforms
- JIRA (v6.3.12)
- The JIRA version we are using is outdated, and the integration is based on RESTAPI with Oauthv1. If you require assistance with other project management software or a later version of JIRA, then please contact us to see if we can help.
- We use MISO as our LIMS. If you require assistance integrating GURU with your existing LIMS, then please contact us to see if we can help.
You can install GURU using pip or docker. The simplest way to get GURU up and running quickly is to use Docker.
- Install Docker Latest Version V23.0.2+
- MISO LIMS ( Optional:- iskylims or other LIMS tools )
- JIRA ( Optional:- Redmine or other project management software )
git clone https://github.com/nyuad-corebio/guru
cd guru
Defining the environment variables.
Note:-
- JIRA is optional, if you are not using JIRA then customize the .env file accordingly.
- Since JIRA integration is based on Oauthv1, using a private key, you should place a key called "jira.pem" in the "keys" folder.
cat .env
### Airflow variables
AIRFLOW_UID=<Airflow USER ID - Default is 50000>
AIRFLOW_URL=<IP Address or Hostname of the host machine>
AIRFLOW_PORT=8080
### QC Work directory Path
QC_WORKDIR=<Path-to-workflow-directory>
### JIRA variables
CONSUMER_KEY=<key_name>
JIRA_SERVER=<URL or IP Address>
OAUTH_TOKEN=<token>
OAUTH_TOKEN_SECRET=<token_secret>
Update the ownership of the directory. (For eg:- we run Airflow as UID - 50000).
chown -R 50000 ./guru
Update the Airflow connection parameters for ssh, smtp and mysql.
Note:-
- Modify the ssh key path, user credentails etc.
$vim scripts/airflow_conn.sh
#!/bin/bash
#############################################################################
# Arguments for Airflow initialization
# So far these protocols SSH / MYSQL / SMTP needed now to connect to cluster.
############################################################################
## Defining SSH connection
airflow connections add guru_ssh --conn-type ssh --conn-host < hostname or IP address > --conn-login user --conn-port 22 --conn-extra '{"key_file": "/home/airflow/.ssh/id_rsa", "missing_host_key_policy": "AutoAddPolicy"}'
## Defining SMTP connection
airflow connections add guru_email --conn-type email --conn-host smtp.gmail.com --conn-login <emailID> --conn-password <email-pass> --conn-port <port-num>
## Defining Mysql Connection
airflow connections add guru_mysql --conn-type mysql --conn-login <user> --conn-password <pass> --conn-host <hostname or IP address > --conn-port <port> --conn-schema <database name> --conn-extra '{"ssl_mode": "DISABLED"}'
To invoke the Docker based installation, issue the commands below.
docker compose up --build -d
docker compose restart
Verify the services
docker compose ps
Validate the logs
docker compose logs -f
To access the Airflow User Interface http://IP-address:8080 and use the credentials airflow/airflow.
Note:-
- If you run this service on a server, specify the (IP-address or hostname):8080 on the browser.
- If you run this service on a standalone machine (e.g. laptop), specify localhost:8080 on the browser.
Installation using pip:-
git clone https://github.com/nyuad-corebio/guru
cd guru
Define the environment variable in your .bashrc or .zshrc of you favourite shell.
Note:-
- JIRA is optional, if you are not using JIRA then customize the environment file accordingly (e.g. .bashrc / .zshrc).
### Airflow variables
export AIRFLOW_HOME=<Path-to-airflow-home>
export AIRFLOW_URL=<IP Address or Hostname of the host machine>
export AIRFLOW_PORT=8080
### QC Work directory Path
export QC_WORKDIR=<Path-to-workflow-directory>
###JIRA env variables
export CONSUMER_KEY=<key_name>
export JIRA_SERVER=<URL or IP Address>
export OAUTH_TOKEN=<token>
export OAUTH_TOKEN_SECRET=<token_secret>
Install the prerequisite python packages using commands below.
pip3 install -r pip-requirements.txt
Install the JIRA module ( Optional )
Note:-
- Since the JIRA integration is based on Oauthv1, using a private key. You should place a key file called "jira.pem" in the
~/.ssh/
path.
cd pkgs
sh pip_jira.sh
Initialize airflow db and this will create airflow.cfg in the AIRFLOW_HOME directory (defined above).
airflow db init
Update the Airflow connection parameters for ssh, smtp and mysql. Then excecute the script.
Note:-
- Modify the ssh key path, user credentails etc.
$vim scripts/airflow_conn_pip.sh
#!/bin/bash
#############################################################################
# Arguments for Airflow initialization
# So far these protocols SSH / MYSQL / SMTP needed now to connect to cluster.
############################################################################
## Defining SSH connection
airflow connections add guru_ssh --conn-type ssh --conn-host < hostname or IP address > --conn-login user --conn-port 22 --conn-extra '{"key_file": "<Path-to-SSH-private-key>", "missing_host_key_policy": "AutoAddPolicy"}'
## Defining SMTP connection
airflow connections add guru_email --conn-type email --conn-host smtp.gmail.com --conn-login <emailID> --conn-password <email-pass> --conn-port <port-num>
## Defining Mysql Connection
airflow connections add guru_mysql --conn-type mysql --conn-login <user> --conn-password <pass> --conn-host <hostname or IP address > --conn-port <port> --conn-schema <database name> --conn-extra '{"ssl_mode": "DISABLED"}'
Create user account named airflow with admin privileges.
airflow users create \
--username airflow \
--password airflow \
--firstname <first-name> \
--lastname <lastname> \
--role Admin \
--email <specify-email>
In order to start the Airflow instance (according to the instructions at Airflow), we need to open 2 terminal windows (these can be 2 separate tabs). In the first tab (or window), launch the command below to start the Airflow scheduler service.
airflow scheduler
Once the above command starts (and there are no errors), launch the second command (below) in the second tab (or window).
airflow webserver
To access the Airflow User Interface, open a web browser and go to http://IP-address:8080 and use the following username/password credentials airflow/airflow.
Note:-
- If you run this service on a server, specify the (IP-address or hostname):8080 on the browser.
- If you run this service on a standalone machine (e.g. laptop), specify localhost:8080 on the browser.
To launch a "standard" paired end dual index (with reverse complement i7) DAG...
- DAGs: Once you have successfully logged into the Airflow UI, the available Airflow DAGs should be visible.
- User Interface: Navigate to the "Demultiplex Runs" tab to see the appropriate custom UI input for the Airflow DAGs.
- Sequence Run: Select "Default Sequence Run" from the "Demultiplex Runs" tab.
- Sequence Run Summary: Below is just a default "standard" sequencing run setup.
Below is a description of the fields.
-
Project Name:-
(Optional) Specify a brief description of the Project eg:- "10x Single cell RNAseq for Marcus Lab". -
Miso ID:-
Using the unique sequencing run ID number, which is defined in our MISO LIMS, GURU will automatically gather the required SampleSheet information of the sequenced samples. This is needed in order to generate a Demultiplexing sheet. -
Reverse Complement:-
Defines whether or not to reverse complement the second index (i7). This only applies to "standard" sequencing run setups (DAGs). -
Email Address:-
Here, a user can specify the email address(es) of one or multiple users that need to be notified regarding the progress. The address(es) should be comma-separated (no whitspaces), and email validation is enabled. -
Jira Ticket:-
We use JIRA for logging the run status. Currently, we are running an older version of JIRA (v6.3.12), which is integrated with Airflows using Oauth keys. -
Workflow:-
We use BioSAILs as our workflow management system (WMS) for processing the raw sequencing reads (QC/QT). You can choose any other WMS such as snakemake or nextflow if you so choose. In this variable we also specify the location of the appropriate QC/QT YAML formatted workflow file that BioSAILs will use. -
Adapter Sequence 1:-
Sequence of adapter to be trimmed for read1. -
Adapter Sequence 2:-
Sequence of adapter to be trimmed for read2. -
Working Directory:-
Full path to the location of the sequencing run folder. -
Run Status: Clicking on the "Check Run Status" button will show you the status (progress) of the specific Dagrun.
- 10X Run: To launch a "10x sequencing run", then select the appropriate DAG. Please note that under the "10x Sequence Run" DAG, you will have multiple options depending on the nature of your run (eg:- scRNA->cellranger, scATAC->cellranger_atac etc.).
-
10X Workflow:-
Here you can choose the 10X workflows by choosing the appropriate radio button. -
10X Run Summary: Similarly if you fill and submit the "10X Sequence Run". You may see the status of Dagruns.
If you need to contact us for feedback, queries, or to raise an issue, you can do so using the issues page Github issue.
If you use GURU in your research or publications, please cite our Github page.
This work was supported by Tamkeen under the NYU Abu Dhabi Research Institute Award to the NYUAD Center for Genomics and Systems Biology (ADHPG-CGSB). We would also like to acknowledge the High Performance Computing department at NYU Abu Dhabi for the computational resources and their continuous support.
- CGSB Webpage : for news and updates
- BioSAILs
- Airflow
- MISO LIMS
- Docker