ConorMacBride/mcalf

The FittingIBIS and LabellingTutorial notebooks expect specific inputs that are not provided

Closed this issue · 8 comments

Both of these example notebooks expect multiple specific files that are not provided, and many users will not have anything to hand that can be simply slotted in to replace these. It would be good to have some real data (if only a few pixels) that can work here, simply to allow for ease of testing with real data, and the troubles that typically come with (noise/non-Voigt profiles etc). Since this sort of data is not a great fit for git, it could always be added as a binary file with the release, so as to be downloaded separately if desired.

Hi @Goobley, thank you for this suggestion. I've been delaying packaging real data as I need to work out the licence compatibility between the code and the data. I'll contact the IBIS team about including some data here. I'll probably include this data in a /examples/data directory. If the licences are not compatible, I may have to host the data in a different repository and provide a link and an SHA hash in the FittingIBIS and LabellingTutorial examples.

Once this data is included, I'll also update FittingIBIS.ipynb and LabellingTutorial.ipynb so they can be run with this real data without the user needing to modify the example's code. Since the v0.2.0 release, I've added a new example (https://mcalf.macbride.me/en/latest/gallery/models/plot_ibis8542model.html). Although it runs as-is, its input data is randomly generated.

Hi @Goobley, I've now merged pull requests (#31, #32) where I have added some real IBIS data. I have also updated LabellingTutorial.ipynb so it runs as-is, loading the IBIS spectra and selecting a random sample to label using the script. Instead of updating FittingIBIS.ipynb to use this data, I created a new gallery example (https://mcalf.macbride.me/en/latest/gallery/models/plot_ibis8542data.html), as the gallery allows for the output of the script to be displayed nicely. The content of FittingIBIS.ipynb has mostly been replaced by gallery examples, however, I'd rather not remove it just yet.

As well as the spectral data, I have also added labelled data for training a neural network with, which is also used in the new example. Also included is a file of pre-calculated results from fitting the array of spectra. This lets users quickly see sample results without having to do the computationally intensive task of fitting. I also include the code for doing the fitting. (Pre-calculated results are also necessary as the documentation hosting provider would not have the resources to do this much fitting when building the documentation.)

Let me know if you'd like anything more changed here, and feel free to close this issue if you are happy with all of this. Thanks!

Hi,
The updated examples look good, thanks for this. However, this comment drew my attention to something I hadn't noticed before:

Results may differ as there is a random factor when training the neural network.

This should only arise from not being able to fix the initialisation of the MLP in the code (the MLP takes a random_state kwarg), and the combination of this, and not having a standardised method for saving/loading the models concerns me. Machine learning has to be reproducible, which I can't see that your classifier model currently is. Retraining something very small like this MLP is fine if the RandomState is fixed so as to make it reproducible.
If I'm missing something here please let me know.

Hi @Goobley, this is a very good point. The standard way that I (and I think others who use this classifier class) save and reload a trained model is to save it to a python pickle file. For example, to save a trained neural network,

import pickle
pkl = open('trainedneuralnetwork.pkl', 'wb')
pickle.dump(model.neural_network, pkl)
pkl.close()

And then to load it again later,

import pickle
pkl = open('trainedneuralnetwork.pkl', 'rb')
model.neural_network = pickle.load(pkl)  # Overwrite the default untrained model

This is covered in FittingIBIS.ipynb. I can add this code into the new example, or would you prefer if I allow the random_state kwarg to be used with the IBIS8542Model.train() method? (I should probably do both.)

Both of those would be ideal, indeed.

As stated here, pickle isn't a good long term storage model for networks. I accept that the training data will persist, so allowing the MLP to be seeded correctly to reproduce the output is an important component too.

In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:

  • The training data, e.g. a reference to an immutable snapshot

  • The python source code used to generate the model

  • The versions of scikit-learn and its dependencies

  • The cross validation score obtained on the training data

Hi @Goobley, I've now merged a pull request (#33) where I have allowed a random_state parameter to be passed through to the scikit-learn MLPClassifier object. I've also updated the example (https://mcalf.macbride.me/en/latest/gallery/models/plot_ibis8542data.html) with details on random_state and saving to pickle files. The random_state and pickel options are not perfect for reproducibility, but until a more robust solution is provided by scikit-learn (i.e. not as dependent on system configurations such as package dependency versions) I guess they will have to do.

Hi, I'm happy with this now! Just so you know, there's a couple of typos to 'pickel' in the docs on that example. Feel free to close this issue once you've fixed that!

Excellent! Thanks for spotting the typo. Grepping through the project I could only find one pickel so I think I got them all now. Thanks!