oseiskar/autosubsync

Error during cross-validation encountering "empty slice"

mdkberry opened this issue · 3 comments

I am getting the following error when running the train_and_test adapted to a python script in windows conda. It seems to work but runs into later errors in the cross-validation part:

(whisper) C:\Users\admin\Documents\Python\Whisper>python train_and_test.py
{'video': 'C:/Users/admin/Documents/Python/Whisper/training/example_films/TheIceman2012.mp4', 'subtitles': 'C:/Users/admin/Documents/Python/Whisper/training/example_films/TheIceman2012.srt', 'language': 'en'}
computing features
file 1
training data extracted, shape (126548, 50)
training...
LogisticRegression(C=0.001, penalty='l1', solver='liblinear')
finding sync bias
-0.2
training accuracy 0.7273287606283781
bias 0.2 s
serializing model to trained-model.bin
loaded training features of size (126548, 50)
Cross-validation fold 1/4
[]
Training... (0, 50)
C:\Users\admin.conda\envs\whisper\lib\site-packages\numpy\core\fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
C:\Users\admin.conda\envs\whisper\lib\site-packages\numpy\core_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "C:\Users\admin\Documents\Python\Whisper\training\cross_validate.py", line 96, in
trained_model = model.train(train_x, train_meta.label, train_meta, verbose=True)
File "C:\Users\admin.conda\envs\whisper\lib\site-packages\autosubsync\model.py", line 56, in train
speech_detection.fit(training_x_normalized, training_y, sample_weight=training_weights)
File "C:\Users\admin.conda\envs\whisper\lib\site-packages\sklearn\linear_model_logistic.py", line 1196, in fit
X, y = self._validate_data(
File "C:\Users\admin.conda\envs\whisper\lib\site-packages\sklearn\base.py", line 584, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\admin.conda\envs\whisper\lib\site-packages\sklearn\utils\validation.py", line 1106, in check_X_y
X = check_array(
File "C:\Users\admin.conda\envs\whisper\lib\site-packages\sklearn\utils\validation.py", line 931, in check_array
raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 250)) while a minimum of 1 is required by LogisticRegression.
Traceback (most recent call last):
File "C:\Users\admin\Documents\Python\Whisper\train_and_test.py", line 11, in
subprocess.run([python_executable, cross_validate_script], check=True)
File "C:\Users\admin.conda\envs\whisper\lib\subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', 'training/cross_validate.py']' returned non-zero exit status 1.

Is there a file size or length limit for the videos used for training the models, or maybe characters in the srt file it doesnt like? I was using a 1hr 45min movie.

Info on the mp4 file:
mp4 details

I noticed there are some color and font information in the SRT maybe that is an issue? Here is the first section from the .srt


1
00:00:22,040 --> 00:00:23,234
<font color="#A9F5F2">[PRISON DOOR OPENS]</font>

2
00:00:27,640 --> 00:00:30,871
<font color="#A9F5F2">[PRISON DOOR CLOSES, BUZZES]</font>

3
00:00:37,880 --> 00:00:41,759
<font color="#A9F5F2">[DOOR OPENS, CLOSES]</font>

4
00:00:43,680 --> 00:00:45,671
<font color="#A9F5F2">[MALE VOICE]</font>
Mr. Kuklinski.

5
00:00:47,840 --> 00:00:51,037
Do you have any regrets
for the things you've done?

A quick comment on this too. There should be not time limit on the length of the movie file but it is definitely possible that some SRT variants do not work if they have non-standard tokens. However, for the empty slice error, I would first suspect some rather simple logical error related to cross validation. Do you have enough files in the list to do that?

Also a related generic comment: if your main intention is not customizing the package, then it should not be necessary to retraining/refit the model by running train_and_test.py. You can just use the original binary model file in the PyPI package (see #11)

if your main intention is not customizing the package, then it should not be necessary to retraining/refit the model

that would be preferable.

but with my attempt to run train_and_test.py , I noticed it made some files in the '/training/data' folder that are not present in the #11 linked whl trained model.

created files

but with my attempt to run train_and_test.py , I noticed it made some files in the '/training/data' folder that are not present in the #11 linked whl trained model.

Those files are only relevant during training, but not needed to use the final trained model