/common-voice-methodology

A living document outlining a methodological approach for building read speech sentence corpora.

Common Voice: Methodology

A living document outlining a methodological approach for building read speech sentence corpora.

The goal of this repository/living document is two-fold:

  1. To outline a well-researched and methodological approach for the construction of sentence-based corpora designated for recording, and thus to serve as the basis of read speech corpora.
  2. To outline a well-researched and methodological approach for the collection of recordings of such a sentence-based corpora.

This is done by advising with domain experts and integrating conclusions and lessons learned from past research in the field.

The approach taken here is simple:

  • First, the premise is set, detailing the various aspects of corpus construction and collection that require attention, and the possible approaches to each aspect.
  • Then, a set of best practices is suggested, corresponding to the list of aspects detailed in the premise. Where a wide concensus has been reached by advising parties, a single best practice is suggested; otherwise, in the case of differing opinions, several approaches are outlined, and the possible implications of the choice between them is detailed, where possible.

A methodological approach for the construction of said corpora is required to address several aspects of the construction process:

  • Data source: The data source from which the sentences composing the corpus are extracted.
    • Licensing: The license under which texts from the data source are licensed.
    • Data source type: Newspaper, social network site, book, articles, conversation transcripts, transcripts of official precedings, etc.
    • Data source count: The number of different data sources sentences are sampled from (e.g. the number of different new websites, books, etc.).
    • Register: The register in which the text is written (or in which the transcribed speech was spoken). E.g. Formal vs. consultative vs. casual.
    • Time of origin: The time source text or speech were recorded.
  • Dataset size:
  • Sampling method:

A methodological approach for the collection of said speech corpora is required to address several aspects of the collection process:

  • Recording equipment:
  • Recording environment:
  • Collected metadata:
  • Number of recordings per sentence:
  • Validation method:

An example for a corpus built with this methodology is The SVLM Hebrew Wikipedia Courpus.

This effort is currently managed and maintained by Shay Palachy (shay.palachy@gmail.com), as part of the work on Common Voice: Hebrew. Your opinion and contributions are very welcomed, and you can either open issues to discuss specific topics and pull request for suggested additions (preferrablly) or mail me directly at the above email address.

[Panayotov15]V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "LibriSpeech: an ASR corpus based on public domain audio books", ICASSP 2015 [pdf]
[Paul92]D. B. Paul and J. M. Baker, "The Design for the Wall Street Journal-based CSR Corpus", HLT '91 Proceedings of the workshop on Speech and Natural Language, Pages 357-362 [pdf]
[Fransen]J. Fransen, D. Pye, T. Robinson, P. Woodland and S. Young, "WSJCAM0 Corpus and Recording Description" [link]
[Garofolo86]Garofolo, John S., Lori F. Lamel, William M. Fisher, Jonathon G. Fiscus, David S. Pallett, and Nancy L. Dahlgren, "The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT)" [link]
[Varod17]Silber-Varod, V., Latin, M., & Moyal, A. (2017) "Frequency of Hebrew phonemes and phoneme clusters in a data-driven approach. (in Hebrew). Literacy and Language (Oryanut Ve-Safa), 6, 22-36 [pdf]