2017 Naver Hackday Winter Project

Useful when class is too many for normal classification. Currently extracts features with fasttext and detects anomaly using PyNomaly.

Clean up Korean text.

  • Remove uncommon English words with dictionary of 10000 words(google-10000-english)
  • Split if Korean and English words are concatenated
  • Remove special characters
  • Split by morphs with Konlpy

Find least-from-anomaly and supposed-anomaly element from given features.

For specific class C, choose 1000 element from the class (Positive Sampling) and 25 element from other than the class (Negative Sampling). Find 50 supposed-anomaly and count the found negative samples.

Build model with variants and find out the best model with metric below

  1. Build model with variants as follows
    • Model Type
    • Feature Dimension
    • Learning Rate
    • Epochs
    • Use pretrained model or not
  2. Extract feature vector from each element for each model
  3. Evaluate metric for each model as follows
    • For specific class C, choose 1000 element from the class (Positive Sampling) and 50 element from other than the class (Negative Sampling).
    • Find 100 supposed-anomaly and count the found negative samples.

Dependencies