Current repository contains experiments on language modeling for text classification.
- Prepare train/test/val
- Get TM results
- Implement QRNN
- Implement BiQRNN
- Implement simple LSTM
- Implement BiLSTM
- Add CNN model
- Add SVM and XgBoost tools
- initial classification results
- Implement VAE
- Prepare new test/train/ver with extracted entities
- Find out why loss becomes NaN in QRNN (too high learning rate)
- Preprocess embedding data to zero mean and unit variance
- Check on models_dir/model_1542027692.h5 weights (w2v embedding)
- Add cross-validation =======================================================================
- Prepare test/train/verif sets with less than 10 tokens
- Check processed comments (reprocess), and other sets
- Prepare model with fasttext embeddings (1 - without preprocessing)
- Prepare model with fasttext embeddings (2 - lemmatized)
- Reduce the dictionary and substitute rare words with oov (?)
- Change percentage of positive examples in training set (?)
- Tune model with hyperopt
- Check model_1542229255 on comments with more than 50 tokens
- Prepare report on language model
- Rewrite to normal pipeline
- Add representation in latent space from VAE =======================================================================
- Try simple bilstm
- Find out what's happening inside of neural network (LIME)
- Prepare ELMo embeddings on raw texts (in progress)
- Add context (?)
More data - Divide on chunks
- Add TripAdvisor proocessed comments to train/test
- Introduce new train/test/ver v5 Cleaner data
- Fix dirty data issue (html, phone instead of id)
- Create set with source labels (TA, PS_pos, PS_neg, OR_pos, OR_neg ,OT)
- Create synonyms replacement
- [ ]
- BIQRNN fasttext/w2v (time, results, loss plot)
- BILSTM fasttext/w2v (time, results, loss plot)
- VAE fasttext/w2v (latent space clustering (if possible), create example transition from negative to positive comment) (time, results, loss plot)
- Optimization with hyperopt
- Experiment with ELMo embeddings (not sure how yet (? put directly to input without )) pretrained/fine-tuned (if possible)
- Look at fasttext work in case of unlemmatized input for the best performing models above (prepare this input)
- Try hierarchical attention network
- Create new verification with 500 samples (50 positive, other - negative)
- Look at the distribution how length depends on the wrong/right classification result
- Change fit to fit generator + add batches generation
- Claculate averaged word embeddings
- Perform cross-val xgboost, svm
- Obtain results for short and long texts
- most of the examples in test set (15 samples from vk group with negative comments) were misclassified due to the ORG tag
- pretrained LM for russian:
