shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
PythonMIT
Stargazers
- akontra
- AlexDeng-AI
- BrightXiaoHanIfun Game
- bryandeng@Tencent
- dfyisco
- ediie726
- EZ-hwhFudan University
- fly51flyPRIS
- getinglxf
- gowithwind深圳
- gradetwo
- israelgonzalezbLas Vegas
- Jadentan
- jhg543Xiamen, China
- JiayiXuDaisy
- kenhktsuiLondon
- ktrk115Tokyo, Japan
- L1aoXingyuBeijing, China
- learnerynwei
- Lingeng56
- LiuPearl1VILab, Tianjin University
- padeoe@AegisAI
- qhduan知未智能KDF
- ray075hlNWPU
- SandalotsVolcanak
- seralfserendipity expert
- sonackHIT
- sudahui
- suizhihaoshanghai ai lab
- T-tssxuan
- valdasVilnius, Lithuania
- wonderseen
- xiamuguizhiVirtual animation company
- yulunduCarnegie Mellon University
- zeyuanchen23Salesforce Research
- zhouyizhuang-megviiMegvii.Inc