nmt-wizard

nmt-wizard is a Docker-based task launcher and monitor on a variety of remote platforms (called services) such as SSH servers, Torque clusters, or EC2 instances. Each service is providing access to compute resources. The launcher is meant to be used with nmt-wizard-docker images, but without a strong dependency.

The project provides:

a RESTful server that queues incoming requests in a Redis database;
a client to the REST server providing a simple textual visualization interface;
workers that launch and manage tasks on the requested service and updates their status.

Once launched, the tasks are sending progress and updated information to the launcher.

Services configuration

Service configurations are provided by the system administrators. A service is declared and configured by a JSON file in the config directory. The REST server and worker automatically discover existing configuration files, provided that their filename ends with .json. A special default.json file defines parameters shared by all services.

The configuration file has the following structure:

{
    "name": "my-service",  // The short name the user will select.
    "description": "My service",  // Display name of the service.
    "module": "services.XXX",  // Name of the Python module managing the service.
    "variables": {
        "key1": [ "value1", "value2" ],
        ...
    },
    "docker": {
        "registries": {  // Docker registries: ECS, Docker Hub.
            "aws": {
                "type": "aws",
                "credentials": {
                    "AWS_ACCESS_KEY_ID": "XXXXX",
                    "AWS_SECRET_ACCESS_KEY": "XXXXX"
                },
                "uri": "XXXXX.dkr.ecr.eu-west-3.amazonaws.com",
                "region": "eu-west-3"
            },
            "dockerhub": {
                "type": "dockerhub",
                "uri": ""
            },
            "mydockerprivate": {
                "type": "dockerprivate",
                "uri": "",
                "credentials": {
                    "password": "XXXXX",
                    "username": "XXXXX"
                }
            }
        },
        "mount": [  // Volumes to mount when running the Docker image.
            "/home/devling/corpus:/root/corpus",
            "/home/devling/models:/root/models"
        ],
        "envvar": {  // Environment variables to set when running the Docker image.
        }
    },
    "skey1": "svalue1",  // Service specific configurations.
    ...,
    "disabled": [01],  // Boolean field to disable/enable the service.
    "storages": {  // Storage configuration as described in single-training-docker.
    },
    "callback_url": "http://LAUNCHER_URL",
    "callback_interval": 60
}

where variables is a list of possible options for the service. The structure of these options is specific to each service. These options are transformed into simple key/LIST,FIELDS by the describe route to enable simple and generic UI selection of multiple variants.

Template files are provided in config/templates and can be used as a basis for configuring services.

Server configuration

The Redis database must be configured to enable keyspace event as followed:

redis-cli config set notify-keyspace-events Klgx

The REST server and worker are configured by settings.ini. The LAUNCHER_MODE environment variable (defaulting to Production) can be set to select different set of options in development or production.

Using the launcher

Worker

The first component to launch is the worker that should always be running. It handles:

the launch of tasks
the termination of tasks
the update of active resources

cd server && python worker.py

For performance, multiple workers might be running simultaneously. In that case, a longer refresh should be defined.

Server

The server has the following HTTP routes:

list_services: returns available services
describe: returns user selectable options for the service
check: checks availability of a given service with provided user options
launch: launches a task on a given service with provided user options
status: checks the status of a task
list_tasks: returns the list of tasks in the database
del: delete a task from the database
terminate: terminates the process and/or instance associated with a task
beat: provides a beat back to the launcher to notify the task activity and announce the next beat to expect
file: sets or returns a file associated to a task

The server uses Flask. See the Flask documentation to deploy it for production. For development, it can be run as follows (single thread):

cd app && FLASK_APP=main.py flask run [--host=0.0.0.0]

Here are the are the available routes. Also see the next section

`GET /list_services`

Lists available services.

Arguments: None
Input: None
Output: A dictionary of service name to description (JSON)
Example:

$ curl -X GET 'http://127.0.0.1:5000/list_services'
{
  "demogpu02": "OVH extra training server",
  "ec2": "Instance on AWS EC2",
  "localhost": "test local environment",
  "ssaling04": "GPU training server"
}

`GET /describe/<service_name>`

Returns possible options for a service as a JSON Form. This can be used to easily implement a GUI to select options the target service.

Arguments:
- service_name: the service name
Input: None
Output: A JSON form (or an empty dictionary if the service has no possible options).
Example:

$ curl -X GET 'http://127.0.0.1:5000/describe/ec2'
{
  "launchTemplate": {
    "description": "The name of the EC2 launch template to use",
    "enum": [
      "SingleTrainingDev"
    ],
    "title": "EC2 Launch Template",
    "type": "string"
  }
}

`GET /check/<service_name>`

Checks if the service is available and can be used with the provided options. In case of success, it returns information about the service and the corresponding resource.

Arguments:
- service_name: the service name
Input: The selected service options (see describe/<service_name>) (JSON)
Output:
- On invalid option, a HTTP 400 code with the error message (JSON)
- On server error, a HTTP 500 code with the error message (JSON)
- On success, an optional message with details about the service (JSON)
Example:

$ curl -X GET http://127.0.0.1:5000/check/ec2
{
  "message": "missing launchTemplateName option",
}
$ curl -X GET -d '{"launchTemplateName": "InvalidLaunchTemplate"}' \
    -H "Content-Type: application/json" 'http://127.0.0.1:5000/check/ec2'
{
  "message": "An error occurred (InvalidLaunchTemplateId.NotFound) when calling the RunInstances operation: LaunchTemplate null not found"
}
$ curl -X GET -d '{"launchTemplateName": "SingleTrainingDev"}' \
    -H "Content-Type: application/json" 'http://127.0.0.1:5000/check/ec2'
{
  "message": ""
}

`POST /launch/<service_name>`

Launches a Docker-based task on the specified service. In case of success, it returns a task identifier that can be used to monitor the task using the status or terminate routes.

Arguments:
- service_name: the service name
Input: the input is either a simple json body or a multi-part request with content field containing JSON task configuration. The other fields of the multi-part requests are binary files to be uploaded on the remote service at task-launch time.

The task configuration (JSON)

$ cat body.json
{
  "docker": {
    "registry": "dockerhub"
    "image": "opennmt/opennmt-lua",
    "tag": "latest",
    "command": [
      ...
    ]
  },
  "wait_after_launch": 2,
  "trainer_id": "OpenNMT",
  "options": {
    "launchTemplateName": "SingleTrainingDev"
  }
}

docker.tag and wait_after_launch are optional.

Output:
- On invalid task configuration, a HTTP 400 code with an error message (JSON)
- On success, a task identifier (string)
Example:

$ curl -X POST -d @invalid_body.json -H "Content-Type: application/json" \
    http://127.0.0.1:5000/launch/ec2
{
  "message": "missing trainer_id field"
}
$ curl -X POST -d @body.json -H "Content-Type: application/json" \
    'http://127.0.0.1:5000/launch/ec2'
"130d4400-9aad-4654-b124-d258cbe4b1e3"
$ curl -X POST -d content=@body.json -F input.txt=@input.txt 'http://127.0.0.1:5000/launch/ec2'
"1f877e53-5a25-44de-b115-7f6d3e386e70"

`GET /list_tasks/<pattern>`

Lists available services.

Arguments:
- pattern: pattern for the tasks to match. See KEYS pattern for syntax.
Input: None
Output: A list of tasks matching the pattern with minimal information (task_id, queued_time, status, service, message)
Example:

$ curl -X GET 'http://127.0.0.1:5000/list_tasks/jean_*'
[
  {
    "message": "completed", 
    "queued_time": "1519652594.957615", 
    "status": "stopped",
    "service": "ec2",
    "task_id": "jean_5af69495-3304-4118-bd6c-37d0e6"
  }, 
  {
    "message": "error", 
    "queued_time": "1519652097.672299", 
    "status": "stopped",
    "service": "mysshgpu", 
    "task_id": "jean_99b822bc-51ac-4049-ba39-980541"
  }
]

`GET /del_tasks/<pattern>`

Lists available services.

Arguments:
- pattern: pattern for the tasks to match - only stopped tasks will be deleted. See KEYS pattern for syntax.
Input: None
Output: list of deleted tasks
Example:

$ curl -X GET 'http://127.0.0.1:5000/del_tasks/jean_*'
[
  "jean_5af69495-3304-4118-bd6c-37d0e6",
  "jean_99b822bc-51ac-4049-ba39-980541"
]

`GET /status/<task_id>`

Returns the status of a task.

Arguments:
- task_id: the task ID returned by /launch/<service_name>
Input: None
Output:
- On invalid task_id, a HTTP 404 code dictionary with an error message (JSON)
- On success, a dictionary with the task status (JSON)
Example:

curl -X GET http://127.0.0.1:5000/status/unknwon-task-id
{
  "message": "task unknwon-task-id unknown"
}
curl -X GET http://127.0.0.1:5000/status/130d4400-9aad-4654-b124-d258cbe4b1e3
{
  "allocated_time": "1519148201.9924579",
  "content": "{\"docker\": {\"command\": [], \"registry\": \"dockerhub\", \"image\": \"opennmt/opennmt-lua\", \"tag\": \"latest\"}, \"service\": \"ec2\", \"wait_after_launch\": 2, \"trainer_id\": \"OpenNMT\", \"options\": {\"launchTemplateName\": \"SingleTrainingDev\"}}", 
  "message": "unknown registry",
  "queued_time": "1519148144.483396",
  "resource": "SingleTrainingDev",
  "service": "ec2",
  "status": "stopped",
  "stopped_time": "1519148201.9977396",
  "ttl": null
}

(Here the task was quickly stopped due to an incorrect Docker registry.)

The main fields are:

status: (timestamp for each status can be found in <status>_time)
- queued,
- allocated,
- running,
- terminating,
- stopped (additional information can be found in message field);
service: name of the service the task is running on;
resource: name of the resource the task is using;
content: the actual task definition;
update_time: if the task is sending beat requests;
ttl if a time to live was passed in the beat request.

`GET /terminate/<task_id>(?phase=status)`

Terminates a task. If the task is already stopped, it does nothing. Otherwise, it changes the status of the task to terminating (actual termination is asynchronous) and returns a success message.

Arguments:
- task_id: the task identifier returned by /launch/<service_name>
- (optionnal) phase: indicate if the termination command is corresponding to an error or natural completion (completed)
Input: None
Output:
- On invalid task_id, a HTTP 404 code with an error message (JSON)
- On success, a HTTP 200 code with a message (JSON)

curl -X GET http://127.0.0.1:5000/terminate/130d4400-9aad-4654-b124-d258cbe4b1e3
{
  "message": "130d4400-9aad-4654-b124-d258cbe4b1e3 already stopped"
}

`GET /del/<task_id>`

Deletes a task. If the task is not stopped, it does nothing.

Arguments:
- task_id: the task identifier returned by /launch/<service_name>
Input: None
Output:
- On invalid task_id, a HTTP 404 code with an error message (JSON)
- On success, a HTTP 200 code with a message (JSON)

`GET /beat/<task_id>(?duration=XXX&container_id=CID)`

Notifies a beat back to the launcher. Tasks should invoke this route wih a specific interval to notify that they are still alive and working. This makes it easier for the launcher to identify and handle dead tasks.

Arguments
- task_id: the task identifier returned by /launch/<service_name>
- (optional) duration: if no beat is received for this task after this duration the task is assumed to be dead
- (optional) container_id: the ID of the Docker container
Input: None
Output:
- On invalid duration, a HTTP 400 code with an error message (JSON)
- On invalid task_id, a HTTP 404 code with an error message (JSON)
- On success, a HTTP 200 code

`POST /file/<task_id>/<filename>`

Registers a file for a task - typically used for log, or posting translation output using http storage.

Arguments
- task_id: the task identifier returned by /launch/<service_name>
- filename: a filename
Input: None
Output:
- On invalid task_id, a HTTP 404 code with an error message (JSON)
- On success, a HTTP 200 code

`GET /file/<task_id>/<filename>`

Retrieves file attached to a tasK

Arguments
- task_id: the task identifier returned by /launch/<service_name>
- filename: a filename
Input: None
Output:
- On invalid task_id, a HTTP 404 code with an error message (JSON)
- On missing files, a HTTP 404 code with an error message (JSON)
- On success, the actual file

Launcher

The launcher is a simple client to the REST server. See:

python client/launcher.py -h

Notes:

The address of the launcher REST service is provided either by the environment variable LAUNCHER_URL or the command line parameter -u URL.
By default, the request response are formatted in text-table for better readibility, the option -j displays raw JSON response
The trainer_id field to the launch command is either coming from --trainer_id option or using LAUNCHER_TID environment variable. Also, by default, the same environment variable is used as a default value of the prefix parameter of the lt command.
By default, the command parameter are expected as inline values, but can also be obtained from a file, in that case, the corresponding option will take the value @FILEPATH.
File identified as local files, are transfered to the launcher using TMP_DIR on the remote server

Development

Redis database

The Redis database contains the following fields:

Field	Type	Description
`active`	list	Active tasks
`beat:<task_id>`	int	Specific ttl-key for a given task
`lock:<resource...,task:…>`	value	Temporary lock on a resource or task
`queued:<service>`	list	Tasks waiting for a resource
`resource:<service>:<resourceid>`	list	Tasks using this resource
`task:<taskid>`	dict	status: [queued, allocated, running, terminating, stopped] job: json of jobid (if status>=waiting) service:the name of the service resource: the name of the resource - or auto before allocating one message: error message (if any), ‘completed’ if successfully finished container_id: container in which the task run send back by docker notifier (queued
`files:<task_id>`	dict	files associated to a task, "log" is generated when training is complete
`queue:<task_id>`	str	expirable timestamp on the task - is used to regularily check status
`work`	list	Tasks to process

jsenellart/nmt-wizard

nmt-wizard

Services configuration

Server configuration

Using the launcher

Worker

Server

GET /list_services

GET /describe/<service_name>

GET /check/<service_name>

POST /launch/<service_name>

GET /list_tasks/<pattern>

GET /del_tasks/<pattern>

GET /status/<task_id>

GET /terminate/<task_id>(?phase=status)

GET /del/<task_id>

GET /beat/<task_id>(?duration=XXX&container_id=CID)

POST /file/<task_id>/<filename>

GET /file/<task_id>/<filename>

Launcher

Development

Redis database

`GET /list_services`

`GET /describe/<service_name>`

`GET /check/<service_name>`

`POST /launch/<service_name>`

`GET /list_tasks/<pattern>`

`GET /del_tasks/<pattern>`

`GET /status/<task_id>`

`GET /terminate/<task_id>(?phase=status)`

`GET /del/<task_id>`

`GET /beat/<task_id>(?duration=XXX&container_id=CID)`

`POST /file/<task_id>/<filename>`

`GET /file/<task_id>/<filename>`