Issues
- 3
dedup_job java.lang.UnsatisfiedLinkError
#81 opened by syedhasnainrazashah - 0
Bug in the function `remove_repeated_text`
#79 opened by ohwi - 1
- 1
Chiese dedup memory error
#65 opened by hyeinhyun - 0
[ja] replace Japanese PII
#49 opened by fujiki-1emon - 1
[ja] reduce emoticon
#50 opened by fujiki-1emon - 0
[ja] spam word filter
#51 opened by fujiki-1emon - 0
- 4
Japanese pre-procesesing - remove text with low rate of Japanese stopwords
#52 opened by fujiki-1emon - 0
Refactor RDD process to Dataframe process
#57 opened by Taekyoon - 0
- 0
Improve Korean preprocessing algorithm
#54 opened by hyunwoongko - 0
Add pre-processing for Japanese texts
#28 opened by fujiki-1emon - 1
Replace html2text from Beautifulsoup
#32 opened by Taekyoon - 3
Task consideration
#33 opened by hyunwoongko - 0
Implement minhash dedup module
#34 opened by Taekyoon - 0
- 0
Add job to separate train and validate data
#16 opened by Taekyoon - 0
Add statistics by data category
#13 opened by donggrii - 0
Add Toxic text labeler
#8 opened by Taekyoon - 0
Add Text length Stats for datasets
#7 opened by Taekyoon - 3
MassiveText Quality Filtering
#4 opened by jayseok-park - 0
Add function for processing empty string
#18 opened by Ronalmoo - 1
Update additional preprocess function
#23 opened by Ronalmoo - 0
Remove `soynlp` library
#27 opened by Taekyoon - 0
Add normalize `?,:"!` in common preprocess job
#22 opened by Taekyoon - 2
Add general text refinement job
#1 opened by Taekyoon - 0
Add scripts to run hadoop cluster
#20 opened by Taekyoon - 0
Add requirements-dev.txt
#2 opened by Taekyoon - 0
Add guides to run dps jobs
#3 opened by Taekyoon - 0