The FittingIBIS and LabellingTutorial notebooks expect specific inputs that are not provided
Closed this issue · 8 comments
Both of these example notebooks expect multiple specific files that are not provided, and many users will not have anything to hand that can be simply slotted in to replace these. It would be good to have some real data (if only a few pixels) that can work here, simply to allow for ease of testing with real data, and the troubles that typically come with (noise/non-Voigt profiles etc). Since this sort of data is not a great fit for git, it could always be added as a binary file with the release, so as to be downloaded separately if desired.
Hi @Goobley, thank you for this suggestion. I've been delaying packaging real data as I need to work out the licence compatibility between the code and the data. I'll contact the IBIS team about including some data here. I'll probably include this data in a /examples/data
directory. If the licences are not compatible, I may have to host the data in a different repository and provide a link and an SHA hash in the FittingIBIS and LabellingTutorial examples.
Once this data is included, I'll also update FittingIBIS.ipynb and LabellingTutorial.ipynb so they can be run with this real data without the user needing to modify the example's code. Since the v0.2.0 release, I've added a new example (https://mcalf.macbride.me/en/latest/gallery/models/plot_ibis8542model.html). Although it runs as-is, its input data is randomly generated.
Hi @Goobley, I've now merged pull requests (#31, #32) where I have added some real IBIS data. I have also updated LabellingTutorial.ipynb
so it runs as-is, loading the IBIS spectra and selecting a random sample to label using the script. Instead of updating FittingIBIS.ipynb
to use this data, I created a new gallery example (https://mcalf.macbride.me/en/latest/gallery/models/plot_ibis8542data.html), as the gallery allows for the output of the script to be displayed nicely. The content of FittingIBIS.ipynb
has mostly been replaced by gallery examples, however, I'd rather not remove it just yet.
As well as the spectral data, I have also added labelled data for training a neural network with, which is also used in the new example. Also included is a file of pre-calculated results from fitting the array of spectra. This lets users quickly see sample results without having to do the computationally intensive task of fitting. I also include the code for doing the fitting. (Pre-calculated results are also necessary as the documentation hosting provider would not have the resources to do this much fitting when building the documentation.)
Let me know if you'd like anything more changed here, and feel free to close this issue if you are happy with all of this. Thanks!
Hi,
The updated examples look good, thanks for this. However, this comment drew my attention to something I hadn't noticed before:
Results may differ as there is a random factor when training the neural network.
This should only arise from not being able to fix the initialisation of the MLP in the code (the MLP takes a random_state
kwarg), and the combination of this, and not having a standardised method for saving/loading the models concerns me. Machine learning has to be reproducible, which I can't see that your classifier model currently is. Retraining something very small like this MLP is fine if the RandomState is fixed so as to make it reproducible.
If I'm missing something here please let me know.
Hi @Goobley, this is a very good point. The standard way that I (and I think others who use this classifier class) save and reload a trained model is to save it to a python pickle file. For example, to save a trained neural network,
import pickle
pkl = open('trainedneuralnetwork.pkl', 'wb')
pickle.dump(model.neural_network, pkl)
pkl.close()
And then to load it again later,
import pickle
pkl = open('trainedneuralnetwork.pkl', 'rb')
model.neural_network = pickle.load(pkl) # Overwrite the default untrained model
This is covered in FittingIBIS.ipynb
. I can add this code into the new example, or would you prefer if I allow the random_state
kwarg to be used with the IBIS8542Model.train()
method? (I should probably do both.)
Both of those would be ideal, indeed.
As stated here, pickle isn't a good long term storage model for networks. I accept that the training data will persist, so allowing the MLP to be seeded correctly to reproduce the output is an important component too.
In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:
The training data, e.g. a reference to an immutable snapshot
The python source code used to generate the model
The versions of scikit-learn and its dependencies
The cross validation score obtained on the training data
Hi @Goobley, I've now merged a pull request (#33) where I have allowed a random_state
parameter to be passed through to the scikit-learn MLPClassifier object. I've also updated the example (https://mcalf.macbride.me/en/latest/gallery/models/plot_ibis8542data.html) with details on random_state
and saving to pickle files. The random_state
and pickel options are not perfect for reproducibility, but until a more robust solution is provided by scikit-learn (i.e. not as dependent on system configurations such as package dependency versions) I guess they will have to do.
Hi, I'm happy with this now! Just so you know, there's a couple of typos to 'pickel' in the docs on that example. Feel free to close this issue once you've fixed that!
Excellent! Thanks for spotting the typo. Grepping through the project I could only find one pickel
so I think I got them all now. Thanks!