MarioBros is a Python module to configure Spotify-Luigi in a makefile-like manner.
Spotify-Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. The configuration files will called MarioFiles.
Master branch status:
The MarioFiles are structured in several section representing the various tasks.
An example of task definition is:
[task_name] target: source1 source2 source3 task_command
The task name is defined in the square brackets. Target and sources are divided by two points.
In the line below there is the task command.
In the task_command you can call the target name and the sources name with ${TARGET}
and ${SOURCES[i]}
where i
is an index or a python slice.
In the curly brackets you can execute python code with variables defined upper.
You can match parts of the target name with parts of the sources name with python regular expression syntax.
For example:
[task_name] (.*)-(.*)-(.*).txt: \1.txt \2.txt \3.txt task_command --output ${TARGET} --first-input ${SOURCES[0]} --other-inputs ${SOURCES[1:]}
This task match a target like first-second-third.txt
with 3 source files named first.txt
, second.txt
and third.txt
.
\1
represent the first match, \2
represent the second match and \3
the third one.
You can also define task variables and use them below. You have to define them below the task name as follow:
[task_name] variable1 = value1 variable2 = value2 target: source task_command -o ${TARGET} -i ${SOURCES} --par1 ${variable1} --par2 ${variable2}
The default section is the one defined, without specifying the name, at the top of the file. In this section there are definitions of the global variables and the definition of the default task. The default task is executed when no other task is requested:
global_variable1 = value1 global_variable2 = value2 default_target: source1 source2 default_task_command
At the top of the file you can include other MarioFiles with the path in this way:
include mariofile_path.ini
The global variables of the included MarioFile will be added at the top of the MarioFile. The tasks of the included MarioFiles will be added at the end of the MarioFile. If there are variables or tasks with the same names, only the top level one will be included.
Summing up, a typical MarioFile will look like this:
include included_mariofile.ini global_var1 = value1 global_var2 = value2 (.*)-main_target.out: \1-\1-task1 default_task -o ${SOURCES} -i ${TARGET} [task_1] RESOURCES_CPU = 4 (.*)-(.*)-task1: source1 source2 task_1 -o ${TARGET} -i ${SOURCES} -j ${RESOURCES_CPU}
The Mario command line is:
$ mario --help
Usage: mario [OPTIONS] [TARGETS]...
Options:
-f, --file, --mariofile PATH Main configuration file
-p, --port INTEGER Set `luigi.build` scheduler_port parameter.
--workers INTEGER Set the number of workers
--local-scheduler Run local scheduler.
--print-ns Print namespaces: Print the MarioFile with the
included tasks and variables
-n, --dry-run Don't actually run any commands; just print
them.
--help Show this message and exit.
With:
- All the mario options and arguments are optional.
- The default task request is
[DEFAULT]
.- The default mariofile is
mario.ini
.- With
--print-ns
flag it print the whole MarioFile with the included tasks and variables.- With
--dry-run
flag it doesn't actually run any commands; just print them.
First of all you have to run luigid
:
$ luigid
Then you can run mario
script with the command line described above:
$ mario
In this way mario will execute the default target with mario.ini as MarioFile. You can also request a specific target with a MarioFile different from mario.ini as follow:
$ mario -f my_mariofile.ini target.out
You can visualize a scheduler in localhost:8082 address. There will be a list of the tasks and a tree diagram of the processing.
If you don't need to visualize the scheduler you can run mario
without luigid
running and with the local scheduler as follow:
$ mario --local-scheduler
Luigi can store the statistic of the processing and can manage the resources.
You have to configure luigi writing client.cfg file and then launch luigid
from the client.cfg directory.
In client.cfg file you can configure the db path for the statistics. An example of client.cfg is:
[scheduler] record_task_history = True state_path = /path/to/luigi-state.pickle [task_history] db_connection = sqlite:////path/to/db/luigi-task-hist.db
You can also specify the required resources for the single tasks.
You have to define the available resources in the client.cfg file defining the [resources]
section as follow:
[resources] cpus = 64
You can specify required resources for the single tasks setting the local task variables RESOURCES_RESOURCE_NAME
where RESOURCE_NAME
is the name of the resource defined in the client.cfg, for example:
[task_name] RESOURCES_cpus = 4 target: sources task_command -j ${RESOURCES_cpus}
The request resource is 4
for cpus
.
install in the current python environment:
pip install mariobros