lisa-lab/pylearn2

[enhancement] range of values in yaml files

Closed this issue · 2 comments

TNick commented

The things discussed in this issue are implemented in my fork, in train_multi branch:

git clone https://github.com/TNick/pylearn2.git train_multi
cd train_multi
git checkout train_multi

This builds on the Proxy model employed by the yaml_parse. The load() function creates a proxy tree that is later instantiated by a call to (protected?) function _instantiate().
scripts/train.py will load the yaml file with this mechanism.

I propose enhancing the yaml syntax with a !range "${VARIABLE:start,end,intervals}" that would allow constructs like:

        learning_rule: !obj:pylearn2.training_algorithms.learning_rule.Momentum {
            init_momentum: !range "${INIT_MOMENTUM:0.05,0.50,10}",
            nesterov_momentum: False
        },

When scripts/train.py is run against a yaml like this a default can be defined (first value in the interval, a random value in the interval) or the syntax could be extended to allow for a fourth argument (this is not implemented right now). When the new scripts/train_multi.py is run (implemented here) same test can be repeated a number of times, each time with a different parameter inside that interval.

Each run has a tag generated for it (either time based or uuid1 based) that is used to identify the run in the report and to name generated files (resulted model, monitor, list of parameters (variables in environment that is saved in a text file).

After a run a single range variable may be changed or all range variables may be changed. Values for range variables may be randomly generated or the may be generated in ascending order. The maximum number of runs for a file may be limited or not. The settings are defined inside a !multiseq like this:

multiseq: !multiseq: {
    # [optional, default is once] update mode can be:
    # - one: a single value is changed, then a test is performed
    #        then another (or same) value is changed and so on
    # - all: all values are changed, then a test is performed
    # - once: a single experiment is run (useful with gen_val: 'random')
    grouping: 'one',

    # [optional, default is ordered] how to generate new values
    # - ordered: increment the value on each iteration
    # - random: pick a random value
    gen_val: 'ordered',

    # [optional, default is time] how to generate a unique tag for each run
    # - uuid: obvious
    # - time: use current time
    # The tag is available in ${MULTISEQ_TAG}
    gen_tag: 'uuid',

    # [optional, default is -1] force termination after this many iterations
    # -1 is a special value that means run forever.
    # ${MULTISEQ_ITER} is available as current, 0 based, iteration.
    max_count: 120,

    # [optional] we can generate names here that will be later available
    # as normal environment variables
    names: {
        # assign a different name for uuid/time based tag
        TAG: "${MULTISEQ_TAG}",

        # used below to save the final network
        SAVE_PATH: "${SAVE_BASE}/${MULTISEQ_TAG}.pkl",

        # used below to save the monitoring result
        MONITOR_FILE: "${SAVE_BASE}/${MULTISEQ_TAG}_monitor.pkl",

        # if defined, the parameters are saved inside here for each iteration
        PARAMETERS_FILE: "${SAVE_BASE}/${MULTISEQ_TAG}_parameters.txt",

        # a short report file will be generated if this variable is set
        PYLEARN2_REPORT: "${SAVE_BASE}/report.txt"
    }
},

For this example the SAVE_BASE environment variable needs to be defined.


Things to improve

Right now a .yaml file must have this structure:

{
  multiseq: !multiseq: {
      # ...
  }
  train: !obj:pylearn2.train.Train {
      # ...
  }
}

That is, a dictionary with two keys: multiseq and train. All the values in multiseq are optional so removing that should not be a problem. Parameters could be passed to train_multi.py.

The interval only allows discrete values. The !range "${VARIABLE:start,end}" could be used to indicate a continuous interval.

A variable is always generated right now. !range "${:start,end}" could be a way to generate anonymous ranges.

Generate a proper yaml file from the parameters used on each run instead of the environment dump done right now.

Internal variables are saved in the environment. I tried to avoid that so that the environment remains clean afterwards but failed to do so.

I'm not satisfied with the names for various components. Will rename at some point. MULTISEQ_TAG and others with predefined meaning should be preceded by PYLEARN2_.

If an exception happen during a run it is logged in the report. That looks nasty for multi-line exception strings. Also, the names of the columns should be full words.


Contribute ideas

Please share your thoughts. Current form is enough for my purpose so, if you think some form of it may be useful to be merged into the project, I could put some time into making it happen.

I feel like this introduces too much special-case naming and syntax for a relatively simple feature. I also don't think we should make the YAML parser such a central feature.

You can implement the same thing with a fairly simple script that doesn't involve any modification to the main parser.
https://github.com/goodfeli/forgetting/tree/master/experiments/random_search_dropout_relu_mnist

TNick commented

Thanks for the input.
As there's no interest I'm closing this issue.