Fix train-test partitioning on data iterators
georgymh opened this issue · 0 comments
The data iterator in data/iterators.py
currently partitions the dataset in a very naive way: it reverses the list of .csv
's inside the dataset folder and gets the top floor(max_count / batch_size) * batch_size
datapoints of each .csv
and yields them in batches.
This is incorrect because we instead need to yield a total of max_count
datapoints across all .csv
's and because the sets of training data and testing data currently have an overlap (they should form a partition over the entire dataset, with the exception of the occasional data points that can't form a batch of size batch_size
). It also causes the following error on validation jobs with batch_size > 1: ValueError: array split does not result in an equal division
.
Also, the count_datapoints()
should take into consideration the requirement that all .csv
's have a line with the headers (i.e., count
could start with -1) and this requirement should be enforced somewhere in the Dataset Manager (could be a separate issue/PR).