Let LLM do heavy corpus cleaning task!
send files to llm, use it clean corpus!
- preprocess(regex/fold unprintables)
- docs to llm
- llm send back cleaned texts
- save it
- modified "self-distillation", ref: https://arxiv.org/abs/2402.13669
- save.
- find a proper prompt
- text chunk generator
- save the docs
- write a prompt similar to self-distillation
- refactoring the code