higerra/ridi_imu

Training Time and DataSet Split

harsh-agarwal opened this issue · 4 comments

I am using an Intel i7 4790 with 24 GB RAM for training the network. Here is a screen shot of two terminals, one displaying the training window and other showing for how long has it been running.
screenshot from 2018-08-02 16-01-14
I would be glad if you could tell, how long did it take you guys to train the network :)

Another problem is the data-set split. You have a list of the train data and while training it splits the data-set as just 1 data point for test and the rest for train. That's a bit confusing. Could you pleas throw some more light on the same?

Hi, Harsh,

The model being trained is technically a Support Vector Machine, or more concretely, a cascaded set of support vector machines. Training such models on large training set is expected to be slow. The screenshot shows that it's still finding the best hyper-parameters with the grid search (essentially an exhaustive search). The easiest way is to bypass this grid search by providing the hyper-parameter. Copy the following text into a "option.txt" file and pass the path of to this file to "--option" argument.

v1.0
4 2
SVM RBF 1 0.000833 10.000000 0.010000 10000
SVR RBF 1 0.000833 1.000000 0.001000 10000
SVR RBF 1 0.000833 1.000000 0.001000 10000
SVR RBF 1 0.000833 10.000000 0.001000 10000
SVR RBF 1 0.000833 1.000000 0.010000 10000
SVR RBF 1 0.000833 1.000000 0.001000 10000
SVR RBF 1 0.000833 1.000000 0.001000 10000
SVR RBF 1 0.000833 10.000000 0.001000 10000
SVR RBF 1 0.000833 10.000000 0.001000 10000

Another way is to subsample the dataset during the grid search by providing larger number to the "--subsample" argument. For example, when --subsmple=3, only one third of the dataset will be used. The typical work flow I used is:

  1. Run the regression_cascade.py with "--subsample 3". This will execute the grid search and write a "option.txt" file into the output model directory.
  2. Re-run "regression_cascade.py" with "--subsample 1 --option ". This is will use all data for training. The hyper-parameters are directly provided, thus no grid search.

Hope these help.

And for the data splitting, that's probably legacy code that I didn't use. There are two ways to split datasets for training/testing: on the sample level and on the sequence level. That "--train_ratio" is used for sample level splitting. But for our problem, sequence level splitting makes more sense. This is because each sequence will generate a lot of overlapping samples. The numbers under two splitting strategies are similar, though.

Anyway, you can pass something like 0.75 to "--train_ratio", then the script will use 75% of samples for training and the remaining 25% for testing.

Thank you so much for your quick response :)

I will make the changes and get back to you on this.

Pls, help me run this code, see my type is correct or not.
python regression_cascade.py --list .../gyroscope/ridi_data_publish_v2/data_publish_v2////data.csv .....??? Thank you @higerra , @harsh-agarwal ...