[FR] Enhance Text-Corpora Analysis
HarikalarKutusu opened this issue · 1 comments
HarikalarKutusu commented
Fix multiple problems/needs at once:
- [DONE in Analyzer v0.13.0] The data in
$text_corpus_stats.*
data is rather large, so dividing them under languages is needed. The analyzer only works per language-version of the dataset anyway. - [DONE in Analyzer v0.13.0] Although new text corpus data is not exported to the git repo, older ones are there and can be reached by git tools. So we can get the data at a specific commit and analyze it. This way we can see the changes in time.
- [DONE in Analyzer v0.13.0] Add more analysis
- Grapheme distribution
- Phoneme distribution (if supported)
- Analyze text corpus usage in the buckets/splits wrt the above extracted text-corpus
HarikalarKutusu commented
Implemented in #33