dariogoetz/keyboard_layout_optimizer

Upload scripts used to create corpus

Closed this issue · 6 comments

@dariogoetz, could you please add the scripts you used for creating corpus-files?
It would be neat if users had the ability to easily create new corpora themselves.

It is available in scripts/ngrams/clean_uni_leipzig_corpora.py. Or do you mean something else?

To be honest, I'm not sure, what that script does. Mainly because this monster of a line:

res = re.sub("(\n)", lambda m, c=itertools.count(): m.group() if next(c) % 5 == 4 else " ", s)

Comments or a README describing each file may be helpful. In any case, the filename sounds like it's not what I'm looknig for.
I'm asking for the script you used for creating 1-grams.txt, 2-grams.txt and 3-grams.txt out of a corpus(-text)-file.

This line replaces four out of five line breaks in a file by a space. This is due to each sentence in the corpus files being separated by a line break, which gives too many line breaks.

What you are looking for is the ngrams binary, if I understood you correctly, see README.md at the end, Structure item 7.

Ah, yes, that's exactly what I was looking for! I didn't think there would be a rust-file, so I just searched through the scripts-directory.
Am i correct in assuming the intended workflow is

  1. clean_uni_leipzig_corpora.py
  2. ngrams
  3. ngram_merge?

Yes, this might be a valid workflow.

  1. clean_uni_leipzig_corpora.py takes a corpus file downloaded from the University of Leipzig and performes some pre-processing.
  2. ngrams takes the prepared corpus and generates files for 1-grams, 2-grams, and 3-grams.
  3. ngrams_merge can (optionally) take multiple sets of ngram-files (e.g., for different languages or from different sources) and merge them into a new set of ngram-files.

Got it, thank you! :)