modelscope/data-juicer
A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
PythonApache-2.0
Issues
- 4
Heavy dependency of Data-Juicer
#398 opened by BeachWang - 1
[Feat] Enhance type hints and parameter validation
#416 opened by drcege - 1
[Bug]: undefined symbol: _ZN3c104cuda9SetDeviceE
#419 opened by lh61500 - 1
执行 python tools/process_data.py --config train.yaml 命令
#425 opened by abchbx - 6
[Bug]: 去重的hash计算卡在100%上,一直不过滤
#387 opened by xiafeng-nb - 1
AssertionError
#420 opened by abchbx - 1
[Feat] Data-Juicer as a Service
#417 opened by drcege - 2
[Feat]: Add Ray actor support
#371 opened by drcege - 3
Efficient processing OPs for scanned images and pdf
#375 opened by yxdyc - 3
- 2
什么鬼呀,不管是你们huggingface空间还是自己起个服务都运行不起来,demo也运行不起来
#404 opened by coder4nlp - 1
analyzer or analyzer?
#409 opened by lilqz66 - 0
[Feat] Support `dj_batched_group_ops` that allows for the configuration and application of multiple operators in smaller, manageable batches
#413 opened by yxdyc - 0
[Feat] Support `PythonCodesOperator` and `BashCodesOperator` that wraps an existing python file, or some code snippets to be executed, such as the existing DJ tools.
#412 opened by yxdyc - 1
是否可以为一个op设置多个text_key
#380 opened by lihongxiacream - 0
- 10
How to download the YoukuMPLUG video dataset?
#341 opened by lucasjinreal - 3
- 2
[Bug]: librosa not work with np>1
#372 opened by drcege - 3
potential bug of checkpointing
#337 opened by drcege - 1
[Bug]: MODEL_ZOO is not reused in subprocesses
#370 opened by drcege - 4
cache files in /tmp/hf_datasets-*
#328 opened by simplew2011 - 3
- 5
alphanumeric_filter算子清洗疑问
#267 opened by echo-valor - 1
- 2
Confused with the meaning of 'preprocess' time-consuming in the `reproduced_redpajama /README.md`
#383 opened by flyflypeng - 3
图片去重之后保留的是数组前面出现的图片吗?
#342 opened by HalcyonLiang - 14
- 4
[Bug]: 使用图片相关算子在显存充足的情况下 报OOM
#378 opened by tian969 - 0
[Bug]: Memory leak in video OP
#369 opened by BeachWang - 2
- 0
如何根据算子提前准备好需要资源?
#376 opened by tian969 - 1
- 2
- 4
[Bug]: RuntimeError: SimpleQueue objects should only be shared between processes through inheritance
#356 opened by MingdongHe - 2
ram_plus_swin_large_14m.pth invalid
#350 opened by arturia-Xayah - 2
- 0
[Bug]: bug when use the video_motion_score_filter
#344 opened by ycwfs - 3
- 4
报”error: Unrecognized arguments: -B -S -I -c“
#315 opened by HaleYang - 2
About Quality Classifier
#321 opened by koanho - 4
stopwords_filter 为什么是过滤掉小于某个阈值的样本
#307 opened by noforit - 6
filter是否支持batch处理,以及怎么设置batch_size?
#285 opened by Yang-QW - 5
hash calculate in ray deduplicator
#286 opened by simplew2011 - 3
为什么大部分的refined recipe都是用simhash去重?
#292 opened by sherrytonger - 1
- 2
- 1
[Question] Can't find evalutor.yaml on the path of `/workspace/data-juicer/demos`
#298 opened by BenWu11 - 0
Absolute path to relative path for multi-source
#278 opened by BeachWang - 8
[Bug]: process on ray occur "TypeError: 'str' object cannot be interpreted as an integer"
#281 opened by laolv421