georgymh/decentralized-ml

Fix train-test partitioning on data iterators

georgymh opened this issue · 0 comments

The data iterator in data/iterators.py currently partitions the dataset in a very naive way: it reverses the list of .csv's inside the dataset folder and gets the top floor(max_count / batch_size) * batch_size datapoints of each .csv and yields them in batches.

This is incorrect because we instead need to yield a total of max_count datapoints across all .csv's and because the sets of training data and testing data currently have an overlap (they should form a partition over the entire dataset, with the exception of the occasional data points that can't form a batch of size batch_size). It also causes the following error on validation jobs with batch_size > 1: ValueError: array split does not result in an equal division.

Also, the count_datapoints() should take into consideration the requirement that all .csv's have a line with the headers (i.e., count could start with -1) and this requirement should be enforced somewhere in the Dataset Manager (could be a separate issue/PR).