split creates folds of different sizes
Closed this issue · 9 comments
I noticed that it creates folds , where the last pair has a different sizes then the rest.
is this expected ?
(def splits
(-> (tc/dataset {:x (range 37)})
(tc/split->seq :kfold {:k 5})))
{:train-sizes (map #(tc/row-count (:train %)) splits)
:test-sizes (map #(tc/row-count (:test %)) splits)}
;; => {:train-sizes (29 29 29 29 32), :test-sizes (8 8 8 8 5)}
Yes, it's expected. k-fold doesn't make any shuffle or resampling, just simply cuts ds into chunks (using partition-all
).
There is no way to have all folds always of the same size. 37 is not divisible by 5. So it depends how rounding is done.
In our case partitioning produces such fold sizes:
(map count (partition-all (/ 37 5) (range 37)))
;; => (8 8 8 8 5)
So this is the main reason of what you observe.
I think this is most visible with very, very small datasets. For bigger ones the difference is minor.
(map count (partition-all (/ 1037 5) (range 1037)))
;; => (208 208 208 208 205)
What we can fight for is a better balance, so instead of (8 8 8 8 5)
we may want (7 8 7 8 7)
. Any idea how to produce such balanced split?
I would expect "same size" .
(so different splits, but same size)
{:train-sizes (29 29 29 29 29), :test-sizes (8 8 8 8 8)}
Is this not even the default (and simplest way) to do this.
But know I see, this is not really possible....
If the number of rows does not divide by k, then we cannot have this...
I need to re-think this, maybe there is no issue at all.
I understand know your explanation and you are right. We can close it.
I found a function which does the partition-all in a balanced way:
(defn piles [m xs]
(let [cnt (count xs)
l (quot cnt m)
r (rem cnt m)
k (* (inc l) r)]
(concat
(partition-all (inc l) (take k xs))
(partition-all l (drop k xs)))))
Maybe nice to use it to make more balanced folds
Thanks! Included above code in 7.000-beta-27
.