scicloj/tablecloth

split creates folds of different sizes

Closed this issue · 9 comments

I noticed that it creates folds , where the last pair has a different sizes then the rest.
is this expected ?

(def splits
  (-> (tc/dataset {:x (range 37)})
      (tc/split->seq  :kfold {:k 5})))

{:train-sizes (map #(tc/row-count (:train %)) splits)
 :test-sizes  (map #(tc/row-count (:test %)) splits)}


     ;; => {:train-sizes (29 29 29 29 32), :test-sizes (8 8 8 8 5)}

Yes, it's expected. k-fold doesn't make any shuffle or resampling, just simply cuts ds into chunks (using partition-all).

Yes, but why have the chunks a different size...
So the last fold has 32 rows in train and 5 in test,
while al other folds have 29 train + 8 test.
(teh sum is 37, so indded constant as it should be)

Looking at this:

image

seem to say that all folds are allways of same size

There is no way to have all folds always of the same size. 37 is not divisible by 5. So it depends how rounding is done.
In our case partitioning produces such fold sizes:

(map count (partition-all (/ 37 5) (range 37)))
;; => (8 8 8 8 5)

So this is the main reason of what you observe.

I think this is most visible with very, very small datasets. For bigger ones the difference is minor.

(map count (partition-all (/ 1037 5) (range 1037)))
;; => (208 208 208 208 205)

What we can fight for is a better balance, so instead of (8 8 8 8 5) we may want (7 8 7 8 7). Any idea how to produce such balanced split?

I would expect "same size" .
(so different splits, but same size)

{:train-sizes (29 29 29 29 29), :test-sizes (8 8 8 8 8)}

Is this not even the default (and simplest way) to do this.

But know I see, this is not really possible....
If the number of rows does not divide by k, then we cannot have this...

I need to re-think this, maybe there is no issue at all.

I understand know your explanation and you are right. We can close it.

I found a function which does the partition-all in a balanced way:

(defn piles [m xs]
  (let [cnt (count xs)
        l   (quot cnt m)
        r   (rem cnt m)
        k   (* (inc l) r)]
    (concat
      (partition-all (inc l) (take k xs))
      (partition-all l (drop k xs)))))

Maybe nice to use it to make more balanced folds

Thanks! Included above code in 7.000-beta-27.