Upload scripts used to create corpus

Question

Upload scripts used to create corpus

Closed this issue 2 years ago · 6 comments

@dariogoetz, could you please add the scripts you used for creating corpus-files?
It would be neat if users had the ability to easily create new corpora themselves.

Answer 1 · 2022-10-06T13:42:15.000Z

It is available in scripts/ngrams/clean_uni_leipzig_corpora.py. Or do you mean something else?

Answer 2 · 2022-10-06T18:07:46.000Z

To be honest, I'm not sure, what that script does. Mainly because this monster of a line:

res = re.sub("(\n)", lambda m, c=itertools.count(): m.group() if next(c) % 5 == 4 else " ", s)

Comments or a README describing each file may be helpful. In any case, the filename sounds like it's not what I'm looknig for.
I'm asking for the script you used for creating 1-grams.txt, 2-grams.txt and 3-grams.txt out of a corpus(-text)-file.

Answer 3 · 2022-10-06T19:28:08.000Z

This line replaces four out of five line breaks in a file by a space. This is due to each sentence in the corpus files being separated by a line break, which gives too many line breaks.

What you are looking for is the ngrams binary, if I understood you correctly, see README.md at the end, Structure item 7.

Answer 4 · 2022-10-07T08:45:09.000Z

Ah, yes, that's exactly what I was looking for! I didn't think there would be a rust-file, so I just searched through the scripts-directory.
Am i correct in assuming the intended workflow is

clean_uni_leipzig_corpora.py
ngrams
ngram_merge?

Answer 5 · 2022-10-07T09:13:27.000Z

Yes, this might be a valid workflow.

clean_uni_leipzig_corpora.py takes a corpus file downloaded from the University of Leipzig and performes some pre-processing.
ngrams takes the prepared corpus and generates files for 1-grams, 2-grams, and 3-grams.
ngrams_merge can (optionally) take multiple sets of ngram-files (e.g., for different languages or from different sources) and merge them into a new set of ngram-files.

Answer 6 · 2022-10-07T09:19:53.000Z

Got it, thank you! :)