Ghadjeres/DeepBach

Empty training split for small dataset - incorrect rounding

Closed this issue · 0 comments

In case there's only one valid chorale read from a directory of MIDI files, when loading the pickled dataset it fails with:

DeepBach/data_utils.py in generator_from_raw_dataset(batch_size, timesteps, voice_index, phase, percentage_train, pickled_dataset, transpose)
    365 
    366     while True:
--> 367         chorale_index = np.random.choice(chorale_indices)
    368         extended_chorale = np.transpose(X[chorale_index])
    369         chorale_metas = X_metadatas[chorale_index]

mtrand.pyx in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:17200)()

ValueError: a must be non-empty

The problem that the training split size calculation is not correct.

int(len(X) * percentage_train) for len(X) == 1 and percentage_train == 0.8 calculates to 0.8 rounded down to 0. Then np.random.choice([]) fails.

A more proper way would be with round():

training_size = int(round(len(X) * percentage_train))

Still with dataset of size 1, the test split would be empty. So eg. for percentage_train == 0.8 the minimum dataset size for non-empty both training and test split would be 3.