Note: Still work in progress.
Implementation of an automatic data processing flow for L200 data, based on Snakemake.
Data processing resources are configured via a single site-dependent (and possibly user-dependent) configuration file, named "config.json" in the following. You may choose an arbitrary name, though.
Use the included templates/config.json as a template and adjust the data base paths as necessary.
When running Snakemake, the path to the config file must be provided via
--configfile=path/to/configfile.json
. For example, run
snakemake -j`nproc` --configfile=config.json file_to_generate
Snakemake is controlled using the Snakefile which specifies the rules to generate each file.
The path to the Snakefile must be provided via --snakefile path/to/Snakefile
.
Data generation is based on key-lists, which are flat text files
(extension ".keylist") containing one entry of the form
{experiment}-{period}-{run}-{datatype}-{timestamp}
per line.
Key-lists can be auto-generated based on the available DAQ files using Snakemake targets of the form
all-{experiment}.keylist
all-{experiment}-{period}.keylist
all-{experiment}-{period}-{run}.keylist
all-{experiment}-{period}-{run}-{datatype}.keylist
which will generate the list of available file keys for all l200 files, resp. a specific period, or a specific period and run, etc.
For example:
snakemake -j4 --configfile=config.json all-l200-myper.keylist
will generate a key-list with all files regarding period myper
.
File-lists are flat files listing output files that should be generated,
with one file per line. A file-list will typically be generated for a given
data tier from a key-list, using the Snakemake targets of the form
{label}-{tier}.filelist
(generated from {label}.keylist
).
For file lists based on auto-generated key-lists like
all-{experiment}-{period}-{tier}.filelist
, the corresponding key-list
(all-{experiment}-{period}.keylist
in this case) will be created
automatically, if it doesn't exist.
Example:
snakemake -j4 --configfile=config.json all-mydet-mymeas-tier2.filelist
File-lists may of course also be derived from custom keylists, generated
manually or by other means, e.g. my-dataset-raw.filelist
will be
generated from my-dataset.keylist
.
Usually, the main output will be determined by a file-list, resp. a key-list
and data tier. The special output target {label}-{tier}.gen
is used to
generate all files listed in {label}-{tier}.filelist
. After the files
are created, the empty file {label}-{tier}.filelist
will be created to
mark the successful data production.
Snakemake targets like all-{experiment}-{period}-{tier}.gen
may be used
to automatically generate key-lists and file-lists (if not already present)
and produce all possible output for the given data tier, based on available
tier0 files which match the target.
Example:
snakemake -j`nproc` --configfile=config.json all-mydet-mymeas-tier2.gen
Targets like my-dataset-raw.gen
(derived from a key-list
my-dataset.keylist
) are of course allowed as well.
Snakemake supports monitoring by connecting to a panoptes server.
Run (e.g.)
panoptes --port 5000
in the background to run a panoptes server instance, which comes with a GUI that can be accessed with a web-brower on the specified port.
Then use the Snakemake option --wms-monitor
to instruct Snakemake to push
progress information to the panoptes server:
snakemake --wms-monitor http://127.0.0.1:5000 [...]
This dataflow doesn't use Snakemake's internal Singularity support, but
instead supports Singularity containers via
venv
environments
for greater control.
To use this, the path to venv
and the name of the environment must be set
in "config.json".
This is only relevant then running Snakemake outside of the software container, e.g. then using a batch system (see below). If Snakemake and the whole workflow is run inside of a container instance, no container-related settings in "config.json" are required.
A template configuration to run the dataflow on an SGE batch system is
included in templates/snakemake-config.
Copy the configuration into "$HOME/.config/snakemake"
and adjust as
necessary (especially batch-queue selection, number of jobs, etc.).
You should then be able to run data production on the batch system via (e.g.):
snakemake --profile cluster-sge --jobs 20 --configfile=config.json all-l200-myper-dsp.gen