/p4

Process and Pipe Pipeline Panacea

Primary LanguagePerl

25/4/14 (kl2)

The viv.pl script executes a set of commands and manages the flow of data between them, currently by using a combination of files
and named pipes (FIFOs). It is configured with a JSON file which describes a connected directed graph. The format of the config
is described below and there are some sample config files in the examples directory for reference.

Usage
=====

  viv.pl [-s] [-x] [-v <verbose_level>] [-o <logname>] <config.json>

  Flags:
    -s : strict; that is, fail if any of the executed commands exit with a non-zero status
    -x : execute; by default, the script will just parse the config file and report in the log what processes it would have created
    -v <verbosity_level> : specify how chatty the log messages should be. Currently, verbosity levels range from 0 to 3
    -o <logname> : specify the log file name (default stdout)

  config.json : a JSON formatted file specifying a directed graph. The config is a hash array with two keys:
         1) "nodes" - a list of nodes, which are hash arrays with keys:
              "id" - a unique identifier for the node, used in the edges to specify the "from" and "to" nodes
              "type" - possible values INFILE, OUTFILE, RAFILE and EXEC (see below for more detail)
              "name" - for a file node {INFILE, OUTFILE, RAFILE}, this specifies the name of the file.
              "cmd" - for an EXEC node, specifies the command to be executed. Port names can specified (currently these are arbitrary
                        embedded strings which are replaced by FIFO names generated at execution time)
         2) "edges" - a list of edges, which are hash arrays with keys:
              "id" - a unique identifier for the edge
              "from" - contains an id value for a node, followed by an optional ':' and a port name. When a port name is specified,
                         occurrences of that port name in the EXEC node are replaced by the names of FIFOs generated by the script
                         to direct I/O between the nodes
              "to" - contains an id value for a node, followed by an optional ':' and a port name. When a port name is specified,
                         the same substitution process described above (under "to") is applied.


Node types, attributes and behaviour
====================================

  INFILE
     specifies a file on the file system for reading

  OUTFILE
     specifies a file on the file system for writing

  RAFILE
     specifies an intermediate file on the file system for reading and writing. Used to move data between EXEC nodes
       when it is decided not to use the default pipe behaviour. When this node type is used, downstream nodes will
       not be launched until execution of any directly upstream nodes has completed. This node can have
       a "subtype" attribute with value "DUMMY" to indicate that that viv script should not take responsibility for
       creating the file (via output redirection), it should just coordinate execution of the connected EXEC nodes.

  EXEC
    specifies a command to be executed. The actual command appears in the "cmd" attribute". Direct data transfer
      between EXEC nodes is done via named pipes (fifos), which are automatically created by the script. Input/output
      defaults to stdin/stdout respectively, unless the "to" or "from" attributes include a port specification.
  

TODO
====
1. Improve logging. There is currently only one log file, it might be useful to have separate execution logs for each EXEC node,
  or at least clearer separation and labelling in the master log file.
2. Validate the graph. Currently there is little checking to see if the graph specification makes sense. For example, if a node
  tries to use an input or output port in another node, no checks are done to see if such a port actually exists. No checks are
  done to ensure that nodes specified in edges actually exist. The id for a node should also be unique for a given graph, but
  this is not checked. There is also no check to make sure that the graph is connected, so it is easy to specify orphan EXEC
  nodes which wait forever for input. So for the time being, be careful.