EleutherAI/dps

[ja] reduce emoticon

Closed this issue · 1 comments

Background

  • After the current Japanese quality filtering, as far as I see, there seem to be a lot of bad quality texts like having a lot of repeating emoticons.
  • However, when we check the quality filtered datasets in depth, and/or, when we add other datasets to the current one, we might find such repetitive emoticons.
  • So, we might need to implement some pre-processing like japanese_reduce_emoticon referring to the Korean one.

TODO

  • www ⇒ max 3 times
  • 笑 ⇒ max 1 times