JuliaStats/KernelDensity.jl

Sensible Default bin size

Closed this issue ยท 8 comments

Currently the kde methods either require the user to provide number of bins, the midpoints or default to 2048. This can be a problem for small datasets. It seems like it would be nice to have a sensible default like the on in http://stats.stackexchange.com/questions/798/calculating-optimal-number-of-bins-in-a-histogram or some other rule of thumb.

`
bin_size = 2_IQR(data)_length(data)^(-1/3)

midpoints= max(data):bin_size:min(data))
`

That seems like a good idea to me. Would you be interested in putting together a PR for this?

Yep, I'm already working on it.

Note that the choice of the number of bins here should be different than a histogram.

In a histogram, you choose the number of bins as a method of avoiding overfitting (i.e. regularization).

For a KDE, the number of bins just affects the numerical resolution of the resulting function, so you want to choose as many as your computational budget allows (up to the resolution of your screen, or whatever needs you have). Ideally it should also be a power of 2 to gain the most advantage from the FFTs. The regularization is handled by the kernel function.

The 2048 was admittedly a pretty arbitrary pick, based on scaling up R's choice (512) by a bit.

Thank for that comments, I haven't noticed that. This ticket seems rather pointless than. Unless there is another reason to do it?

We could implement a different a more data-aware default than 2048. Perhaps there's some literature around that recommends something along those lines for kernel density estimation rather than histograms?

I think Simon is right, it doesn't seem to make any difference for the resulting density other than sampling.

I think Simon is right

Agreed. After all, when isn't he right? ๐Ÿ˜„

Whenever he is talking to his wife/girlfriend ;)