shaigue/pmi_masking
This repository contains code that takes a text corpus and creates a PMI masking vocabulary for it.
PythonMIT
Issues
- 0
Add support for RedPajama
#33 opened - 0
- 0
- 0
Add support for word Level tokenization
#30 opened - 1
deal with wikipedia bug
#29 opened - 0
write a blog post
#28 opened - 0
- 0
- 0
- 0
try to disentangle the dataset loading from my code, so that anyone could provide it's own dataset.
#24 opened - 0
- 0
- 0
add random sampling support?
#20 opened - 0
- 0
write a descriptive README.md
#18 opened - 2
Try to optimize `aggregate_ngram_counts`
#17 opened - 0
document the performance results
#16 opened - 1
remove redundent files from the repo
#15 opened - 1
integration with LLM training code
#14 opened - 0
- 0
- 0
- 1
reproduce results on wiki+bookcorpus
#10 opened - 0
add remote logging/monitoring
#9 opened - 0
- 0
- 0
- 0
- 0
- 0
figure out how to do code review
#3 opened - 0
- 0