Usage:
-
Create a data folder, and put your data file "text8.bin", "train.tsv" in the folder.
-
Examples:
python experiment.py -h
usage: experiment.py [-h] [--cv CV] [--play_ratio PLAY_RATIO]
[--data_folder DATA_FOLDER] [--train_ratio TRAIN_RATIO]
[--quiet] [--pooling POOLING]
Experiments on sentiment analysis of kaggle movie reviews
optional arguments:
-h, --help show this help message and exit
--cv CV cross-validation folder number (default=2)
--play_ratio PLAY_RATIO
how much of training data will be used (default=1.0).
Range: (0.,1.]
--data_folder DATA_FOLDER
folder of word2vec model/corpus, train/test data
(default="../data")
--train_ratio TRAIN_RATIO
training ratio. For example, 0.3, 0.5, 0.7
--quiet don't show verbose info
--pooling POOLING how to aggregate a set of word vectors into one vector
(default="max") // choices: max, min, sum, mean
python experiment.py --play_ratio 0.01 --cv 3
use only 1% data (i.e., 1560 phrases out of 156,000) for experiment
python experiment.py --cv 3 --pooling min
use min-pooling to aggregate vectors of all words in a phrase
python experiment.py --data_folder ../data --play_ratio 0.01
parameters:
data_folder = ../data
play_ratio = 0.01
quiet = False
train_ratio = None
pooling = max
cv = 2
time elapsed after read training data: 0:00:00.229506
dataset size of your experiment: 1560
time elapsed after loading word2vec pretrained model: 0:00:04.652768
number of word2vec word used in movie reviews: 504
phrases without any word in word2vec_vocab: 59
time elapsed after creating word2vec matrix: 0:00:05.458077
score: 0.671795
[Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 1.1s
score: 0.669231
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 1.1s finished
scores: [ 0.67179487 0.66923077]
time elapsed after finishing CV validation: 0:00:06.721579