/Lumix

Pre-Processing data before pre-train and sft

Primary LanguagePython

Lumix

This is an open source project for preparing large language model data. Due to the fact that everyone is pre training and fine-tuning the volume model, most public projects also rarely mention the details of handling cleaning data.

I hope this project can help everyone to complete the data cleaning work as much as possible, so that everyone can focus more on model training and fine-tuning.

Project structure