EleutherAI/dps

Data processing system for polyglot

PythonApache-2.0

Issues

dedup_job java.lang.UnsatisfiedLinkError
#81 opened 4 months ago by syedhasnainrazashah
3
Bug in the function `remove_repeated_text`
#79 opened a year ago by ohwi
0
[ja] `.filter` is used instead of `.map` for non-filter methods
#74 opened a year ago by mrorii
1
Chiese dedup memory error
#65 opened a year ago by hyeinhyun
1
[ja] replace Japanese PII
#49 opened a year ago by fujiki-1emon
0
[ja] reduce emoticon
#50 opened a year ago by fujiki-1emon
1
[ja] spam word filter
#51 opened a year ago by fujiki-1emon
0
[ja] refactor MinHashLSH-based near deduplication method
#62 opened a year ago by fujiki-1emon
0
Japanese pre-procesesing - remove text with low rate of Japanese stopwords
#52 opened a year ago by fujiki-1emon
4
Refactor RDD process to Dataframe process
#57 opened a year ago by Taekyoon
0
Need to add ignore null or empty text during korean text process
#56 opened a year ago by Taekyoon
0
Improve Korean preprocessing algorithm
#54 opened a year ago by hyunwoongko
0
Add pre-processing for Japanese texts
#28 opened a year ago by fujiki-1emon
0
Replace html2text from Beautifulsoup
#32 opened 2 years ago by Taekyoon
1
Task consideration
#33 opened 2 years ago by hyunwoongko
3
Implement minhash dedup module
#34 opened 2 years ago by Taekyoon
0
Add huggingface tokenizers for data length statistics
#17 opened 2 years ago by Kaeun-Lee
0
Add job to separate train and validate data
#16 opened 2 years ago by Taekyoon
0
Add statistics by data category
#13 opened 2 years ago by donggrii
0
Add Toxic text labeler
#8 opened 2 years ago by Taekyoon
0
Add Text length Stats for datasets
#7 opened 2 years ago by Taekyoon
0
MassiveText Quality Filtering
#4 opened 2 years ago by jayseok-park
3
Add function for processing empty string
#18 opened 2 years ago by Ronalmoo
0
Update additional preprocess function
#23 opened 2 years ago by Ronalmoo
1
Remove `soynlp` library
#27 opened 2 years ago by Taekyoon
0
Add normalize `?,:"!` in common preprocess job
#22 opened 2 years ago by Taekyoon
0
Add general text refinement job
#1 opened 2 years ago by Taekyoon
2
Add scripts to run hadoop cluster
#20 opened 2 years ago by Taekyoon
0
Add requirements-dev.txt
#2 opened 2 years ago by Taekyoon
0
Add guides to run dps jobs
#3 opened 2 years ago by Taekyoon
0
Add build news paper dataset as long text data form
#9 opened 2 years ago by Taekyoon
0