/bionode-watermill-tutorial

This is a tutorial for bionode-watermill

Primary LanguageJavaScript

bionode-watermill for dummies!

Objective

This tutorial is intended for those that attempt to assemble a bioinformatics pipeline using bionode-watermill for the first time.

First things first

This tutorial assumes that you have installed npm, git and node. Node.js required for the full tutorial should be version 7 or higher.

To setup and test the scripts within this tutorial follow these simple steps:

  • git clone https://github.com/bionode/bionode-watermill-tutorial.git
  • cd bionode-watermill-tutorial
  • npm install bionode-watermill

Defining a task

Watermill is a tool that lets you orchestrate tasks. So, lets first understand how to define a task.

To define a task we first need to require bionode-watermill:

const watermill = require('bionode-watermill') 
const task = watermill.task  /* have to specify task because watermill object
 has more variables*/

After, we can use task variable to define a given task:

  • Using standard javascript style:
// this is a kiss example of how tasks work with shell
const simpleTask = task({
  output: '*.txt', // checks if output file matches the specified pattern
  params: 'test_file.txt',  //defines parameters to be passed to the
    // task function
  name: 'This is the task name' //defines the name of the task
}, function(resolvedProps) {
    const params = resolvedProps.params
    return 'touch ' + params
  }
)
  • Or you can also do something like the following in ES6 syntax, using arrow functions:
// this is a kiss example of how tasks work with shell
const simpleTask = task({
  output: '*.txt', // checks if output file matches the specified pattern
  params: 'test_file.txt',  /*defines parameters to be passed to the
     task function*/
  name: 'This is the task name' //defines the name of the task
}, ({ params }) => `touch ${params}`
)

Note: Template literals are very useful since they allow to include place holders (${ }) within strings. Template literals are enclosed by the back-tick (` `) as exemplified above.

Then after defining the task, it may be executed like this:

// runs the task and returns a promise, and can also return a callback
simpleTask()

This task will create a new file (empty) inside a directory named "data/<uid>/". You may also notice that a 'bunch' of text was outputted to terminal and it can be useful for debugging your pipelines.

The above example is available here. You can test it by running: node simple_task.js

Input/output

Although already discussed elsewhere within bionode-watermill documentation, in this tutorial I intend to explain how input/output are managed by bionode-watermill. First, you can either hardcore input to something like:

{ input: 'ERR1229296.sra' }

or instead you can specify glob patterns which are in fact better explained here. But, basically, what you need to know is that you can specify input to something like:

{ input: '*.sra' }

This tells bionode-watermill to crawl within the data directory in search for the first hit that matches this pattern. So, pay attention when specifying this glob patterns if you have multiple .sra files within this folder or generated by other tasks that are not your target task (the last one that generated a .sra file in this example). To circumvent this you can provide file names that you can easily manage. For instance if you have one file named ERR1229296.sra and another one ERR1229297.sra and you want just the first one, you can easily pass the input as follows:

{ input: '*6.sra' }

or of course hardcode it.

Output works in a very similar way, however there are a few specificities that the user must be aware of:

  • Output object is not the output filename, it is used only to match the file extension to the expected result of the task. So despite necessary for proper resolving the task.
// this won't work!!!
{ output: 'myfile.txt' }

// rather you should provide this as follows:
{ 
  output: '*.txt',
  params: { output: 'myfile.txt' }
}

Remember, task.output is used to match the output file pattern and if you want to specify a given filename to the output you need to use task.params .output object instead where you can freely specify the output file name.

Using orchestrators

What are orchestrators?

  • Join

Join is an operator that lets you run a number of tasks in a given order. For instance if we are interested in creating a file and writing to it in two different instances. But let's first define a new task so we can perform it after the task that we called simpleTask:

const writeToFile = task({
  input: '*.txt', // specifies the pattern of the expected input
  output: '*.txt', // checks if output file matches the specified pattern
  name: 'Write to file' //defines the name of the task
}, ({ input }) => `echo "some string" >> ${input}`
)

So, task writeToFile writes "some string" to the file that we have just created in task simpleTask. However, to do so, we need the file to be created first and only then write something to it. In order to achieve this we use join:

Before applying the pipeline first we need to require join

// === WATERMILL ===
const {
  task,
  join
} = require('bionode-watermill')

And then,

// this is a kiss example of how join works
const pipeline = join(simpleTask, writeToFile)

//executes the join itself
pipeline()

This operation will generate two directories inside data folder, one which is responsible for the first task (simpleTask) that will create a new file called test_file.txt, and a second task (writeToFile) that will do a symlink to test_file.txt and write to it, since we have indicated that we would like to write for the same file as the input. Note that once again files will be inside a directory named "data/<uid>/" (but in this case you will have two directories with distinct uids).

The above example is available here. You can test the above example by running: node simple_join.js

  • Junction

Unlike join, junction allows to run multiple tasks in parallel.

However, we will have to create a new task since if we simply replace in the previous pipeline join with junction, we will end up with a file named test_file.txt with nothing written inside, because if you create the file and write to it at the same time, write won't work, but the file will be created.

But first, don't forget to:

// === WATERMILL ===
const {
  task,
  join,
  junction
} = require('bionode-watermill')

And only then:

// this will not produce the file with text in it!
const pipeline = junction(simpleTask, writeToFile)

So, we will define a new simple task:

const writeAnotherFile = task({
  output:'*.file', // specifies the pattern of the expected input
  params: 'another_test_file.file', /* checks if output file matches the
  specified pattern*/
  name: 'Yet another task'
}, ({ params }) => `touch ${params} | echo "some new string" >> ${params}`
)

And then execute the new pipeline:

// this is a kiss example of how junction works
const pipeline = junction(
  join(simpleTask, writeToFile),  /* this "joint" tasks will be executed at the
  same time as the task bellow */
  writeAnotherFile
)

//executes the pipeline itself
pipeline()

This new pipeline consists on creating two files and writing text to them. Note that in writeAnotherFile task in this task pipe is used in shell ("|") along with the shell commands touch and echo. That is a feature that bionode-watermill also supports. Of course, these are simple tasks that can be performed only with shell commands (but they are merely illustrative). Instead, as mentioned above you can use javascript callback functions or promises as the final return of a task.

Nevertheless, if you browse to data folder, you should have three folders (because you have three tasks). One with the text file generated in the first task, another one with a symlink for the first task (that was used to write to this file) and finally a third one in which you should have the file generated and written in the third task (named another_test_file.file).

The above example is available here. You can test the above example by running: node simple_junction.js

  • Fork

While junction handles two or more tasks at the same time, fork allows to pass the output of two or more different tasks to the next task. Imagine you have two different files being generated in two different tasks and want to process them using the same task in the next step. In this case bionode-watermill uses fork, to split the pipeline in two distinct branches that after will be processed independently.

If you have something like:

join(
  taskA,
  fork(taskB, taskC),
  taskD
)

This will result in something like this: taskA -> taskB -> taskD' and taskA -> taskC -> taskD'', with two distinct final outputs for the pipeline. This is a quite useful feature to benchmark programs or if you are interested in running multiple programs that do the same type of analyses and compare the results of both analyses.

Importantly, the same type of pipeline with junction instead of fork,

join(
  taskA,
  junction(taskB, taskC),
  taskD
)

would result in the following workflow: taskA -> taskB, taskC -> taskD, where taskD has only one final result.

But enough talk, lets get to work!

First:

// === WATERMILL ===
const {
  task,
  join,
  fork
} = require('bionode-watermill')

For the fork tutorial, two functions will be defined. These functions create a file and write to it:

const simpleTask1 = task({
   output: '*.txt', // checks if output file matches the specified pattern
   params: 'test_file.txt',  //defines parameters to be passed to the
   // task function
   name: 'task1: creating file 1' //defines the name of the task
 }, ({ params }) => `touch ${params} | echo "this is a string from first file" >> ${params}`
)

const simpleTask2 = task({
   output:'*.txt', // specifies the pattern of the expected input
   params: 'another_test_file.txt', /* checks if output file matches the
    specified pattern*/
   name: 'task 2: creating file 2'
 }, ({ params }) => `touch ${params} | echo "this is a string from second file" >> ${params}`
)

Then, a task to be performed after the fork, which will add the same text to these files:

const appendFiles = task({
    input: '*.txt', // specifies the pattern of the expected input
    output: '*.txt', // checks if output file matches the specified patters
    name: 'Write to files' //defines the name of the task
  }, ({ input }) => `echo "after fork string" >> ${input}`
)

And finally our pipeline execution:

// this is a kiss example of how fork works
const pipeline = join(
  fork(simpleTask1, simpleTask2),
  appendFiles
)

//executes the pipeline itself
pipeline()

This should result in four output directories in our data folder. Notice that contrarily to junction, where three tasks would render three output directories, with fork the result of our pipeline are four output directories, where the outputs from simpleTask1 and simpleTask2 where both processed by task appendFiles.

The above example is available here. You can test the above example by running: node simple_fork.js

Useful links