avinashvarna/audio_alignment

Document data.json format for interface

avinashvarna opened this issue · 3 comments

Document the structure of the data.json or add a "How to add a new set of data" section to the README to make it easy to contribute new aligned audio + text data.

Currently data.json is of the form,


{
    'data': [
        {
            'key': 'text-used-to-refer-to-the-file', 
            'name': 'name displayed in the corpus list',
            'audio_url': 'url that will be used as is (can still be relative',
            'word_alignment': 'path to the file containing word alignment',
            'sentence_alignment': 'path to the file containing sentence alignment, unused'
        }, ...

    ]
}

There's scope for adding corpus details such as name, description etc in the top-level, (currently the name is "deduced" in the flask file as the parent directory name of the data.json (which is a bit clumsy).

Adding new data is basically equivalent to adding a new directory at the top level (besides others such as ramayana, meghaduta, and a data.json file inside them.

I'll put this tentative information in a README later.

The entire process can be definitely made smoother, such as,

  • Specifying a data directory which will contain corpora (instead of taking "parent directory of the server directory"
  • Corpus name / Description (tweaking front-end for this display)
  • Perhaps a way to go to "next" or "previous" corpus

PR #4 handles this for the most part

Closed via PR #4