Upload scripts used to create corpus
Closed this issue · 6 comments
@dariogoetz, could you please add the scripts you used for creating corpus-files?
It would be neat if users had the ability to easily create new corpora themselves.
It is available in scripts/ngrams/clean_uni_leipzig_corpora.py
. Or do you mean something else?
To be honest, I'm not sure, what that script does. Mainly because this monster of a line:
res = re.sub("(\n)", lambda m, c=itertools.count(): m.group() if next(c) % 5 == 4 else " ", s)
Comments or a README describing each file may be helpful. In any case, the filename sounds like it's not what I'm looknig for.
I'm asking for the script you used for creating 1-grams.txt
, 2-grams.txt
and 3-grams.txt
out of a corpus(-text)-file.
This line replaces four out of five line breaks in a file by a space. This is due to each sentence in the corpus files being separated by a line break, which gives too many line breaks.
What you are looking for is the ngrams
binary, if I understood you correctly, see README.md
at the end, Structure
item 7.
Ah, yes, that's exactly what I was looking for! I didn't think there would be a rust-file, so I just searched through the scripts-directory.
Am i correct in assuming the intended workflow is
clean_uni_leipzig_corpora.py
ngrams
ngram_merge
?
Yes, this might be a valid workflow.
clean_uni_leipzig_corpora.py
takes a corpus file downloaded from the University of Leipzig and performes some pre-processing.ngrams
takes the prepared corpus and generates files for 1-grams, 2-grams, and 3-grams.ngrams_merge
can (optionally) take multiple sets of ngram-files (e.g., for different languages or from different sources) and merge them into a new set of ngram-files.
Got it, thank you! :)