[ja] reduce emoticon
Closed this issue · 1 comments
fujiki-1emon commented
Background
- After the current Japanese quality filtering, as far as I see, there seem to be a lot of bad quality texts like having a lot of repeating emoticons.
- However, when we check the quality filtered datasets in depth, and/or, when we add other datasets to the current one, we might find such repetitive emoticons.
- So, we might need to implement some pre-processing like
japanese_reduce_emoticon
referring to the Korean one.
skjang54 commented
TODO
- www ⇒ max 3 times
- 笑 ⇒ max 1 times