HarikalarKutusu/cv-tbox-dataset-compiler

[FR] Enhance Text-Corpora Analysis

HarikalarKutusu opened this issue · 1 comments

Fix multiple problems/needs at once:

  • [DONE in Analyzer v0.13.0] The data in $text_corpus_stats.* data is rather large, so dividing them under languages is needed. The analyzer only works per language-version of the dataset anyway.
  • [DONE in Analyzer v0.13.0] Although new text corpus data is not exported to the git repo, older ones are there and can be reached by git tools. So we can get the data at a specific commit and analyze it. This way we can see the changes in time.
  • [DONE in Analyzer v0.13.0] Add more analysis
    • Grapheme distribution
    • Phoneme distribution (if supported)
  • Analyze text corpus usage in the buckets/splits wrt the above extracted text-corpus

Implemented in #33