Useful when class is too many for normal classification. Currently extracts features with fasttext and detects anomaly using PyNomaly.
Clean up Korean text.
- Remove uncommon English words with dictionary of 10000 words(google-10000-english)
- Split if Korean and English words are concatenated
- Remove special characters
- Split by morphs with Konlpy
Find least-from-anomaly and supposed-anomaly element from given features.
For specific class C, choose 1000 element from the class (Positive Sampling) and 25 element from other than the class (Negative Sampling). Find 50 supposed-anomaly and count the found negative samples.
Build model with variants and find out the best model with metric below
- Build model with variants as follows
- Model Type
- Feature Dimension
- Learning Rate
- Epochs
- Use pretrained model or not
- Extract feature vector from each element for each model
- Evaluate metric for each model as follows
- For specific class C, choose 1000 element from the class (Positive Sampling) and 50 element from other than the class (Negative Sampling).
- Find 100 supposed-anomaly and count the found negative samples.
- Naver Shopping Data (Not available to public)
- Python3 with libraries
- fasttext and pretrained models
- google-10000-english