run_dna_cron.py
and run_rna_cron.py
scripts should be set as cron jobs to
periodically perform the following operations:
- check newly added ChIP-Seq / RNA-Seq experiments, generate and export JSON job files;
- check the state of all running ChIP-Seq / RNA-Seq experiments, update their states in BioWardrobe DB;
- when experiment is finished with success, upload generated data to BioWardrobe DB
Usage:
-c CONFIG, --config CONFIG BioWardrobe configuration file
-j JOBS, --jobs JOBS Folder to export generated jobs
- Install biowardrobe-analysis from source
$ git clone https://github.com/Barski-lab/biowardrobe-analysis.git $ cd biowardrobe-analysis $ pip install .
This will install
run-dna-cron
andrun-rna-cron
into/usr/local/bin/
To make configuration process easier we are assuming that:
-
you home directory is
/home/biowardrobe/
-
you have already installed and configured:
- BioWardrobe
-
BioWardrobe configuration file is saved as
/etc/wardrobe/wardrobe
and has the following structure (the order of the first five not commented lines is mandatory)#MySQL host to connect 127.0.0.1 #MySQL User (Pay attention, the user should also have read access to Airflow DB) username #MySQL password userpassword #Wardrobe DB ems #MySQL port 3306 #Custom additional configuration data
-
the user who run
run_dna_cron.py
andrun_rna_cron.py
scripts has read access to BioWardrobe configuration file -
BioWardrobe DB
ems.settings
table includes+---------------+-------------+------------------------------------------------------------------------+ | key | value | description | +---------------+-------------+------------------------------------------------------------------------+ | indices | /indices | Relative path to the directory for mapping software indices files | | preliminary | /RAW-DATA | Relative path where fastq and all preliminary results are stored | | wardrobe | /wardrobe | Absolute path to the Wardrobe directory | +---------------+-------------+------------------------------------------------------------------------+
-
- cwl-airflow
- Airflow DB with the name
airflow
is saved on the same MySQL server as BioWardrobe DB and is accessable by the user set in BioWardrobe configuration file - Airflow configuratin file
airflow.cfg
includes fieldscwl_jobs = /home/biowardrobe/cwl/jobs
cwl_workflows = /home/biowardrobe/cwl/workflows
- Directory set as
cwl_jobs
inairflow.cfg
has the following structure/home/biowardrobe/cwl/jobs ├── fail ├── new ├── running └── success
- Airflow DB with the name
- biowardrobe-analysis
- the constants.py
includes the following constants:
BOWTIE_INDICES = "bowtie" RIBO_SUFFIX = "_ribo" STAR_INDICES = "STAR" ANNOTATIONS = "annotations" JOBS_NEW = 'new' JOBS_SUCCESS = 'success' JOBS_FAIL = 'fail' JOBS_RUNNING = 'running' CHR_LENGTH_GENERIC_TSV = "chrNameLength.txt" ANNOTATION_GENERIC_TSV = "refgene.tsv"
- the constants.py
includes the following constants:
- BioWardrobe
-
You have cloned the latest Workflows into
/home/biowardrobe/cwl/workflows
(currently it's recommended to usev1.0.2
branch instead ofmaster
)
-
To allow
run_dna_cron.py
andrun_rna_cron.py
scripts find the Airflow DB, the following record should be added intoems.settings
tableINSERT INTO ems.settings VALUES ('airflowdb','airflow','Database name to be used by Airflow', 0, 3);
where
airflowdb
is the key by which the name of the Airflow DBairflow
is returned. The Airflow DB is used to check the state of the running workflows and their steps (performs select query fromdag_run
andtask_instance
tables). -
Create /wardrobe/indices/bowtie folder
This folder name is formed as
ems.settings[wardrobe] + ems.settings[indices] + constants.py[BOWTIE_INDICES]
-
Get the genome types list as
SELECT findex FROM ems.genome
. For each genome type create subfolder within /wardrobe/indices/bowtie. The subfolder name should be equal to the genome type received from SELECT queryFor example, if
SELECT
query returnedhg19
,mm10
,dm3
, your directories should look like:
/wardrobe/indices/bowtie/hg19
/wardrobe/indices/bowtie/mm10
/wardrobe/indices/bowtie/dm3
-
In each subfolder created in the previous step put corespondent to the genome type Bowtie indices and TAB-delimited chromosome length file chrNameLength.txt
The name for chromosome length file should be equal to
CHR_LENGTH_GENERIC_TSV
fromconstants.py
-
For running RNA-Seq analysis the ribosomal Bowtie indices should be added too. For each of the genome type folders in /wardrobe/indices/bowtie create additional folder with the suffix _ribo
Suffix
_ribo
should be equal to theRIBO_SUFFIX
fromconstants.py
For example, if you already have directorieshg19
,mm10
,dm3
in/wardrobe/indices/bowtie/
folder, you should add
/wardrobe/indices/bowtie/hg19_ribo
/wardrobe/indices/bowtie/mm10_ribo
/wardrobe/indices/bowtie/dm3_ribo
-
In each subfolder created in the previous step put corespondent to the genome type ribosomal Bowtie indices
-
Create /wardrobe/indices/annotations folder
This folder name is formed as
ems.settings[wardrobe] + ems.settings[indices] + constants.py[ANNOTATIONS]
-
Get the genome types list as
SELECT findex FROM ems.genome
(you should already have this list from some step before). For each genome type create subfolder within /wardrobe/indices/annotations. The subfolder name should be equal to the genome type received from SELECT queryFor example, if
SELECT
query returnedhg19
,mm10
,dm3
, your directories should look like:
/wardrobe/indices/annotations/hg19
/wardrobe/indices/annotations/mm10
/wardrobe/indices/annotations/dm3
-
In each subfolder created in the previous step put corespondent to the genome type TAB-delimited annotation file refgene.tsv. This file is not mandatory to be sorted.
The TAB-delimited annotation file name should be equal to
ANNOTATION_GENERIC_TSV
fromconstants.py
-
To make Genome Browser to display genome coverage tracks from bigWig files, apply patches from biowardrobe_patched_view
-
To run basic analysis
ems.experimenttype
table should be update with the script experimenttype_patch.sql . If columnsworkflow
ortemplate
already exist in a table, delete them before running the script -
After applying the abovementioned SQL scripts to make BioWardrobe to display genome coverage tracks (the old
bedGraph
and newbigWig
) the functionaddGB(tab)
from Experiment.js should be updated to fetch not only_wtrack
(old genome coverage inbedGraph
format), but also_multi_f_wtrack
and_f_wtrack
forbigWig
tracks.
/***********************************************************************
* Add Genome Browser Tab
***********************************************************************/
addGB: function (tab) {
// This is how it was before. Works only for bedGraph track
// var gtbl = this.UID.replace(/-/g, '_') + '_wtrack';
// This part was added for RNA-Seq genome coverage tracks
var gtbl = this.UID.replace(/-/g, '_') + '_multi_f_wtrack'; // multiwig
gtbl += '=full&'+this.UID.replace(/-/g, '_') + '_wtrack'; // bedGraph
gtbl += '=full&'+this.UID.replace(/-/g, '_') + '_f_wtrack'; // bigWig
-
Because the new status
"JOB_CREATED": 1010
was added intoLIBSTATUS
fromconstants.py
, app.css file from BioWardrobe should be updated to display correct icon.gear-1-10 { background-image: url(images/gear_new.png) !important; width: 16px; height: 16px; }
Basically you should change
gear_warning.png
togear_new.png
for.gear-1-10
-
To drop all of the created by biowardrobe-analysis tables from BioWardrobe DB, as long as all of tables from Airflow DB, related to the expeminent to be restarted, update original ForceRun.py with the following commands
# Airflow specific tables settings.cursor.execute("DROP TABLE IF EXISTS `" + DB[0] + "`.`" + string.replace(UID, "-", "_") + "_f_wtrack`;") settings.cursor.execute("DROP TABLE IF EXISTS `" + DB[0] + "`.`" + string.replace(UID, "-", "_") + "_upstream_f_wtrack`;") settings.cursor.execute("DROP TABLE IF EXISTS `" + DB[0] + "`.`" + string.replace(UID, "-", "_") + "_downstream_f_wtrack`;")
# Clean up airflowdb airflowDB = settings.settings["airflowdb"] settings.cursor.execute("DELETE FROM `{0}`.`xcom` WHERE dag_id LIKE '%{1}%';".format(airflowDB,UID)) settings.cursor.execute("DELETE FROM `{0}`.`task_instance` WHERE dag_id LIKE '%{1}%';".format(airflowDB,UID)) settings.cursor.execute("DELETE FROM `{0}`.`task_fail` WHERE dag_id LIKE '%{1}%';".format(airflowDB,UID)) settings.cursor.execute("DELETE FROM `{0}`.`sla_miss` WHERE dag_id LIKE '%{1}%';".format(airflowDB,UID)) settings.cursor.execute("DELETE FROM `{0}`.`log` WHERE dag_id LIKE '%{1}%';".format(airflowDB,UID)) settings.cursor.execute("DELETE FROM `{0}`.`job` WHERE dag_id LIKE '%{1}%';".format(airflowDB,UID)) settings.cursor.execute("DELETE FROM `{0}`.`dag_run` WHERE dag_id LIKE '%{1}%';".format(airflowDB,UID)) settings.cursor.execute("DELETE FROM `{0}`.`dag` WHERE dag_id LIKE '%{1}%';".format(airflowDB,UID))
The location to insert this commands can be checked from updated ForceRun.py
-
Update crontab job
# For ChIP-Seq analysis */1 * * * * . ~/.profile && run-dna-cron -c /etc/wardrobe/wardrobe -j /home/biowardrobe/cwl/jobs >> /wardrobe/tmp/RunAirflowDNA.log 2>&1 # For RNA-Seq analysis */1 * * * * . ~/.profile && run-rna-cron -c /etc/wardrobe/wardrobe -j /home/biowardrobe/cwl/jobs >> /wardrobe/tmp/RunAirflowDNA.log 2>&1
Both the
run_dna_cron.py
andrun_rna_cron.py
scripts use BioWardrobe configuration file set as--config
/-c
argument (/etc/wardrobe/wardrobe
by default). This file is used to get access to BioWardrobe DB. Make sure that scripts have read access to this configuration file.