Issues
- 1
- 1
Python 3.10 has some issues in downloading data from huggingface: use Python 3.9
#4 opened by AkikoAizawa - 0
Introduce category-based filtering to Wikipedia
#30 opened by hkiyomaru - 1
Exclude the Book3 portion in the Pile dataset
#36 opened by hkiyomaru - 0
- 0
- 0
Apply ethical filtering to Japanese Wikipedia
#29 opened by hkiyomaru - 1
Create a validation split
#26 opened by hkiyomaru - 1
Add the `token_ids` field
#25 opened by hkiyomaru - 1
- 1
Apply filtering to the Stack dataset
#23 opened by hkiyomaru - 1
Improve Wikipedia text extraction
#22 opened by hkiyomaru - 1
Expired links to Wikipedia dumps
#27 opened by hkiyomaru - 1
Construct the corpus ver. 1
#13 opened by hkiyomaru - 1
Use the HF datasets library for tokenization
#14 opened by hkiyomaru - 0
Use Python 3.11
#15 opened by hkiyomaru - 2
Determine the license
#20 opened by hkiyomaru - 1
- 0
- 0
Japanese Wikipedia
#1 opened by hkiyomaru - 0
English Wikipedia
#2 opened by hkiyomaru