Issues
- 4
Size mismatch
#24 opened by DuyguA - 1
harmful pp
#23 opened by jiangix-paper - 5
- 3
OSCAR 22.XX scope
#21 opened by Uinelj - 0
Missing pages in Common Crawl
#22 opened by hadiasghari - 2
strange datasets for Yue Chinese corpus
#1 opened by jerryIsHere - 14
- 2
- 0
- 2
- 0
Low size of Swahili Oscar
#16 opened by hadyelsahar - 1
West Flemish contains only two words
#7 opened by Uinelj - 7
ConnectionError: Couldn't reach https://huggingface.co/datasets/oscar-corpus/OSCAR-2109/resolve/main/OSCAR-2109.py
#8 opened by TDehaene - 0
Wu Chinese dataset is of bad quality.
#5 opened by Uinelj - 0
Scots language corpus is non linguistic?
#14 opened by Uinelj - 0
Quality warning: Neapolitan
#13 opened by Uinelj - 0
Quality warning: Somali
#12 opened by Uinelj - 0
Quality warning: Northern Frisian
#11 opened by Uinelj - 0
Quality warning: Chavacano
#10 opened by Uinelj - 1
Quality warning: Central Bikol
#9 opened by Uinelj - 0
- 3
[BUG] Encoding errors in OSCAR 21.09
#2 opened by stefan-it - 1
3835 records full of backslashes
#4 opened by stas00 - 0
Support for Tigrinya
#3 opened by tadeze