Add experiments watcher (webserver)
Cadene opened this issue · 0 comments
Goals
Display a list of experiments on a webpage to easily check their status.
This webpage must be frequently updated, must be easy to customize, and must scale to 1000 experiments.
We would like to improve over this POC:

Propositions
Run the webserver
Using default options:
python -m bootstrap.watch --exp.dirs logs/myprojectUsing custom options:
python -m bootstrap.watch -o myproject/options/watch.yamlThe yaml file myproject_watch.yaml is used to generate a custom webpage. It contains experiment directories to watch, filtering rules to select the experiment to display and columns to display.
Example:
exp:
dirs: logs/myproject
filters:
- table: options
column: exp.dir
rule: mnist_resnet_*
- SQL: SELECT MAX(accuracy) FROM test_epoch WHERE accuracy > 0.3
columns:
- name: exp.dir
table: options
column: exp.dir
- name: accuracy
SQL: SELECT MAX(accuracy) FROM test_epoch WHERE accuracy > 0.3
- name: status
type: status
ranking:
- name: status
order: [crashed, ended, running]
- name: accuracy
order: ascDesign of webpage
Core features
Similarly to the POC, a first header with statistics of experiments followed by the custom table.
Example:
Ended: 10 | Crashed: 3 | Running: 40 (| Pending: 150)
server | exp.dir | accuracy | status
--- | --- | --- | ---
pascal[3] | lolilol | 0.19 | crashed
Optional
In the header, we could have the list of filtering rules and columns to display. We could remove them or add new ones dynamically. Then, we could have a button to export as yaml file or update the original yaml file.
Filtering rules
Drop down pannel containing filtering rules which are based on SQLite data (select('options') or select('env_info') or select('train_epoch') or select('test_epoch') or custom SQL query).
Example:
List of positive filters:
- [table: options] [column: exp.dir] [rule: mnist_resnet_*]
- [table: env_info] [column: nodename] [rule: pascal & titan]
- [SQL: SELECT MAX(accuracy) FROM test_epoch WHERE accuracy > 0.3]
- [type: status] [rule: crashed]
- [info: end_datetime] [rule: >2020-05-22 10:00:00]
Display options that have a experiment directory which match the regexp mnist_resnet_*, are trained on pascaland titan server, have an accuracy higher than 0.3, and crashed after a certain datetime.
We could have a list of negative filters as well, corresponding to rules to remove experiments from the list.
Column to display
Same interface to select the column to display.
By default: server and gpu ids, experiment directory, number of epochs done / number of total epochs, datetime of creation, status.
Example:
... (default)
- [SQL: SELECT MAX(accuracy) FROM test_epoch]
Implementation
bootstrap/watch.py
bootstrap/watch/css
bootstrap/watch/js
bootstrap/watch/index.html
Use Werkzeug to create webserver. (see shortly example).
Send in POST request all the options in the json format to the webpage.
Use simple javascript (no ReactJS) to get these options, look for experiment directories, send SQL queries to sqlite files, update the HTML. Every x seconds, update list of experiments and experiments if needed.