[FR] Enhance Text-Corpora Analysis

Question

HarikalarKutusu opened this issue 10 months ago · 1 comments

Fix multiple problems/needs at once:

[DONE in Analyzer v0.13.0] The data in $text_corpus_stats.* data is rather large, so dividing them under languages is needed. The analyzer only works per language-version of the dataset anyway.
[DONE in Analyzer v0.13.0] Although new text corpus data is not exported to the git repo, older ones are there and can be reached by git tools. So we can get the data at a specific commit and analyze it. This way we can see the changes in time.
[DONE in Analyzer v0.13.0] Add more analysis
- Grapheme distribution
- Phoneme distribution (if supported)
Analyze text corpus usage in the buckets/splits wrt the above extracted text-corpus

Answer 1 · 2024-03-30T16:49:18.000Z

Implemented in #33