Distinguishing optimal images
Closed this issue · 9 comments
Is your feature request related to a problem or opportunity? Please describe.
Even with reasonably large dartboard cells, the loss profiles in a cross-val loop remain pretty similar (see e.g. #214). I'm not sure if many regularizer/strength combinations fitting the observed data very similarly indicates the model just needs more data to be able to distinguish between optimizations using different regularizer strengths, or even different regularizers. Maybe the predictive accuracy of the model to unobserved points scales strongly with the number of points?
Describe the solution you'd like
Not sure how much of an altered modeling approach this would require. Is there a better way than CV to distinguish between models?
Describe alternatives you've considered
Within the CV framework, I could test how much of a difference it makes to:
- decrease the number of k-folds
- increase the dartboard cell size
- compare cross-val results for the same dataset when different fractions of it are withheld entirely from the modeling process (to see how strongly the inference scales with the number of points)
- other ways to distinguish better between cross-vals with different regularizers/strengths? I'm sure there are standard ML ways of doing this...
One useful comparison point would be doing just a single K-fold, i.e., a normal "train" and "validation set". Do we get meaningfully better constraints on the best hyperparameters by doing K=5, 10, ... etc? Enough has changed since we first brought in K-fold that it would be good to (re)establish baselines about how good/bad we are doing.
For example, on a single loss plot, could we chart the train and validate behavior for a few different regularizer strengths? This would be nice to show that we can easily (or not) distinguish between good and bad.
I don't know if all these will render, but here's a comparison of 1, 5 or 10 k-folds for sparsity with lambda of [1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0].
- The difference in evolution of the loss, in the final CV score, and in the final image and residuals imaged, is negligible when varying the number of k-folds for a given lambda value.
- There is variation in the loss evolution at intermediate iterations across regularizer strengths in the regime of strengths that are giving more realistic images (not grossly under- or over-regularized). But the CV scores are still effectively identical between these cases - and effectively the same as the cases where there is clear under- or over-regularization (which are evident from the images).
I ran a train/test loop with the .asdf dataset of IM Lup with no averaging, and am getting a very similar result to that with the time- and frequency-averaged dataset I've been using. It doesn't seem like this is the source of the poor fit at short baselines.
Interesting, what do the residuals look like in terms of their sigma, for real and imaginary?
I'm not sure, the NuFFT also crashes. But since the images are effectively identical, the model and residual visibilities should be too. E.g. thei images' total flux is the same to 2 parts in 1000, so they're both underfitting to the same degre at short baselines. If you want to run more tests with the full dataset, it's probably more practical to run on your cluster. I can send a script to reproduce the pipeline in its current state if needed.
Ah, right, memory is also an issue with the NuFFT. Working to predict batches of visibilities at a time is probably the way to solve it. I'll need to travel this route with the SGD work so hopefully that will illuminate the individual visibility residuals.
Yeah I made an issue this morning to that end, #224. Let me know how it goes!
Closing as out of date and out of scope for v0.3 redesign