mir-evaluation/mir_eval

Restructure evaluators and io requirements

Closed this issue ยท 6 comments

Historically, mir_eval has required that for each task there exists a data loader which loads in data from some filetype to be used in the task. These data loaders exist in the io submodule. For example, for beat detection, data has historically been formatted as an ASCII file where each beat location (in seconds) appears on each line, and the function mir_eval.io.load_events implements this. There are a few things which prompted this requirement:

  1. Each task should have an "evaluator" which is a Python script, useable as a binary, which takes in paths to reference and estimated annotation files and outputs scores for that task.
  2. The mir_eval web service requires that people can upload annotation files for their task in order to receive scores.
  3. The docstring of each evaluation function in each task should have an example for calling the function, including loading in data.
  4. The regression tests require parsing existing data files, and so including input/output functionality has allowed us to keep external test-only dependencies down.

In a recent survey of mir_eval users, the vast majority (97.4%) said they had used mir_eval by calling it directly from Python code. About 12.8% said they used the evaluators, and no respondent had used the web service. Similarly, 81.6% said they had used a separate data loading procedure; 26.3% said they had used functions in the io submodule, and 7.9% said data was loaded because they were using the evaluator.

Requiring that every task has a data loading function has some drawbacks.

  1. Some tasks don't have a pre-existing data loading format, so we've had to make one up. It seems silly to make up a format just to satisfy the requirements of mir_eval's conventions.
  2. In some cases, the format supported by mir_eval is not always obviously the best format for the task. For example, for transcription, MIDI is a pretty nice format. One could argue that JAMS is better than all of the annotation formats we support.
  3. There already exist good (better?) Python data loaders than the ones we implemented. For example, librosa would work well for loading in wav files for source separation evaluation.
  4. Following the principle of "do one thing and do it well", mir_eval should implement evaluation functions, not implement data loading/parsing and a set of scripts for using mir_eval.

Given that the mir_eval web service is not used, the evaluators are sparsely used, and using the io submodule is much less common than using another input pipeline, I propose the following changes:

  1. Deprecate support for the mir_eval web service and move it to a separate repository. Treat it as a "proof of concept" instead of a supported project.
  2. Move the evaluators to a separate repository. Do not require that every task has an evaluator, but support the existing evaluators which have tasks.
  3. Do not require that each task have a data loading function in io. In the future, only implement data loading functions which are genuinely useful because they don't already exist in other software and which are for widely used annotation formats (e.g. those used in MIREX).
  4. Because of the above point, do not require that each example in each docstring includes data loading. Only include an example when it is useful for seeing how data should be loaded or preprocessed (for example, in the beat submodule where beats should be trimmed before being passed to evaluation functions).

In order to accommodate the new repositories, I propose that we move http://github.com/craffel/mir_eval to http://github.com/mir_eval/mir_eval and create repositories http://github.com/mir_eval/web_service and http://github.com/mir_eval/evaluators for the web service and evaluators respectively.

If there are no major objections, I will start implementing these moves and changes next week.

Some tasks don't have a pre-existing data loading format, so we've had to make one up. It seems silly to make up a format just to satisfy the requirements of mir_eval's conventions.

๐Ÿ‘

In some cases, the format supported by mir_eval is not always obviously the best format for the task. For example, for transcription, MIDI is a pretty nice format. One could argue that JAMS is better than all of the annotation formats we support.

๐Ÿ‘ to both, though I don't think we want mir_eval to require jams. (JAMS currently requires mir_eval, so that's a non-starter.) I think generally having a clearly defined API for each evaluation / task is the best we can do, and I support getting out of the IO game entirely.

There already exist good (better?) Python data loaders than the ones we implemented. For example, librosa would work well for loading in wav files for source separation evaluation.

I would recommend pysoundfile for this over librosa for this.

Following the principle of "do one thing and do it well", mir_eval should implement evaluation functions, not implement data loading/parsing and a set of scripts for using mir_eval.

๐Ÿ‘ ๐Ÿ’ฏ ever-green reference for this

Deprecate support for the mir_eval web service and move it to a separate repository. Treat it as a "proof of concept" instead of a supported project.

๐Ÿ‘

Move the evaluators to a separate repository. Do not require that every task has an evaluator, but support the existing evaluators which have tasks.

What's the benefit of putting the scripts in a separate repo?

Do not require that each task have a data loading function in io. In the future, only implement data loading functions which are genuinely useful because they don't already exist in other software and which are for widely used annotation formats (e.g. those used in MIREX).

Are there any such formats?

Because of the above point, do not require that each example in each docstring includes data loading. Only include an example when it is useful for seeing how data should be loaded or preprocessed (for example, in the beat submodule where beats should be trimmed before being passed to evaluation functions).

๐Ÿ‘

In order to accommodate the new repositories, I propose that we move

I support this move. I suggest keeping a mir_eval prefix in the repo names, just for downstream convenience (cloned repos will not inherit the org name), but that's a minor quibble.

though I don't think we want mir_eval to require jams.

Agreed, sorry if it was worded unclearly - didn't mean to suggest this. Was just making the point that sometimes there's a better format available.

What's the benefit of putting the scripts in a separate repo?

Delineation between the projects - basically treating the evaluators as a separate entity, which are basically an example for using mir_eval, rather than a part of mir_eval.

Are there any such formats?

I would argue for example that load_labeled_intervals is a nice convenience function for a widely-used MIREX format, same for load_ragged_time_series, load_pattern, etc. They also have some sugar in them to make them user-friendly and hopefully give helpful warnings, etc.

I suggest keeping a mir_eval prefix in the repo names, just for downstream convenience (cloned repos will not inherit the org name), but that's a minor quibble.

Good point.

Oh gosh, you can't have underscores in organization names. So the URL would have to be something like
http://github.com/mireval/mir_eval
http://github.com/mir-eval/mir_eval
That is super confusing. I will think about this.

I'm going to scale back some of these requirements a bit.

Deprecate support for the mir_eval web service and move it to a separate repository. Treat it as a "proof of concept" instead of a supported project.

It already is in a different repository; it's not in a different GitHub organization. Per comment above I'm going to punt on the "different organization" thing for now. This TODO is transformed basically to

  1. Add a note to the web service that says it is deprecated/just a proof of concept.

Move the evaluators to a separate repository. Do not require that every task has an evaluator, but support the existing evaluators which have tasks.

I will still do this. The documentation will need to be updated.

Do not require that each task have a data loading function in io. In the future, only implement data loading functions which are genuinely useful because they don't already exist in other software and which are for widely used annotation formats (e.g. those used in MIREX).

There's no action item on this right now - it's just a modification of guidelines for future contributions.

Because of the above point, do not require that each example in each docstring includes data loading. Only include an example when it is useful for seeing how data should be loaded or preprocessed (for example, in the beat submodule where beats should be trimmed before being passed to evaluation functions).

Same as above, unless we want to remove examples which aren't very helpful. An example is e.g. from the onsets submodule, where the example is

>>> reference_onsets = mir_eval.io.load_events('reference.txt')
>>> estimated_onsets = mir_eval.io.load_events('estimated.txt')
>>> F, P, R = mir_eval.onset.f_measure(reference_onsets,
...                                    estimated_onsets)

I'm not sure it's valuable to just delete this example at this point.

Ok, the webservice is updated with a nice disclaimer, the evaluators are moved to https://github.com/craffel/mir_evaluators with history in-tact, and the docs are updated to point people there for the evaluators. So, all TODOs are done, closing this. In the future we won't require task-specific file loading functions when it doesn't make sense.