Implementation of Spectroscopy File Readers

Question

Implementation of Spectroscopy File Readers

Opened this issue 2 years ago · 18 comments

Describe the functionality you would like to see.

We want to provide readers for different luminescence spectroscopy file formats. In particular,

Jobin-Yvon XML format (.xml) - see hyperspy/rosettasciio#25
- Add Writing of Jobin-Yvon XML files?
- Add IO of Jobin-Yvon HDF5 files?
S&I Trivista XML format (.tvf) - see hyperspy/rosettasciio#27
Renishaw Wire format (.wdf) - see hyperspy/rosettasciio#55
Grams/Galactic SPC file format (.spc)
Hamamatsu streak camera binary file format (...) - see hyperspy/rosettasciio#87

Describe the context

Is it already possible to register file reader extensions with HyperSpy @ericpre and @francisco-dlp? As these are not electron microscopy related file formats, I would propose to have them within LumiSpy - but it would be great to register them so that they work with hs.load similar to how we extend HyperSpy by additional signal types.

Additional information

@pietsjoh would work on the implementation of the readers

Answer 1 · 2022-06-15T12:54:47.000Z

For Renishaw .wdf there is a python parser renishawWiRE based on the Matlab parser.

.spc files can be read by the python packages rohanisaac/spc or its fork spc_spectra. There is also a parser in R.

Both formats can also be read by Gwyddion: http://gwyddion.net/module-list-nocss.en.php#spcfile

Answer 2 · 2022-06-15T13:52:39.000Z

No, registering IO is not implemented. There is a PR to split the IO hyperspy/hyperspy#2174. I agree that it would be good to implement IO plugin registration feature.

Answer 3 · 2022-06-19T20:01:34.000Z

No, registering IO is not implemented. There is a PR to split the IO hyperspy/hyperspy#2174. I agree that it would be good to implement IO plugin registration feature.

If I have a look at the discussion, the intention seems to vary between providing a generic scientific reader package with RosettaSciIO and a more EM focussed package. And there are pro&cons for having a single reader package against several focused on certain fields. So I am not fully decided where to place such readers. Using the HyperSpy reader API makes sense, I guess. And registering them with HyperSpy would really be nice. However, I think it would need someone with some experience on the extensions to implement that.

@hakonanes & @dnjohnstone, I have seen that kikuchipy and pyxem also define custom readers - are they entirely independent or based on the hyperspy-api? Would you also be interested to register them with the hyperspy-io functions?

Answer 4 · 2022-06-20T08:22:23.000Z

All readers in kikuchipy are independent of HyperSpy. But we mirror HyperSpy's API with kikuchipy.load() and kikuchipy.signals.EBSD.save() (exchange EBSD for any other supported signal in kikuchipy). Which reader/writer to call in these functions are decided based on the file extension of filename (and some additional "footprints" for HDF5 files). Thanks to the extension API of HyperSpy, any signals provided by kikuchipy can be "cast" to other supported ones via EBSD.set_signal_type() (kikuchipy's extension file).

The choice to not include the plugins (started in 2019) in HyperSpy was made because (1) I wanted the readers to return a kikuchipy EBSD signal and not HyperSpy's Signal2D, and (2) I wanted them released right away. Many arguments can be made both for and against including plugins of signals not implemented in HyperSpy in HyperSpy's IO, which is the case for kikuchipy's plugins, but I don't see a good reason to upstream the kikuchipy plugins to HyperSpy at the moment.

Answer 5 · 2022-06-20T20:59:50.000Z

The ideas being a single separate library are:

independent development from hyperspy
this library can be used outside hyperspy
avoid dispersion of community efforts/knowledge
provide consistent API for all IO plugins

Answer 6 · 2022-06-21T08:18:17.000Z

Similar to kikuchipy, I would favor keeping the readers in LumiSpy for the moment. Though mirroring the API seems like the easier, but subideal solution to me.

Instead, I think having the extension mechanism for IO would be great (and probably would be needed after the split so that people can keep using hs.load?). However, I don't really know how to go about implementing the extension mechanism.

Answer 7 · 2022-06-21T12:11:03.000Z

OK, I'll resurrect hyperspy/hyperspy#2174. I submitted the draft to RnM, but actually, since it doesn't touch the API at all, I could well resubmitted it to Rnm, isn't it?

Due to popular demand, I will implement a plugin system like the extensions one. I imagine the following structure: HyperSpy import all IO plugins from RosettaSCIO. RosettaSCIO itself just provides light machinery to register IO plugins in other repositories. Those could be standalone IO libraries (e.g. RosettaEMIO) or more general packages (e.g. LumiSpy). What do you think?

Answer 8 · 2022-06-21T12:55:57.000Z

Sounds fair.

An alternative route could be to develop the plugin/extension system in a separate PR to Rnm and in a second step include it in the split.

Answer 9 · 2022-06-22T10:01:08.000Z

My impression is that plugins or entry points is not right solution here and this is the wrong direction of travel, as this would facilitate dispersion of efforts. For example, in the case of mib reader in pyxem, there have been issue with from the start and now it is very difficult to fix it without breaking it - I have tried to improve the situation but it was really a pain and I am under the impression, that this reader needs to be restarted from scratch. At the moment, I don't need it, so I haven't push it any further.

There were reasons/blockers in the past to have plugins in pyxem or kikuchipy, as mentioned by @hakonanes above :

return specify class - fixed by signal registration mechanism
independent development, i.e. be able to release quickly - there are now a lot of automated tooling and an independent library will fix this

Since then, there have been very good progress to address these issue and now, we do have reasons to get all io reader/writer in an independent library. Here I am trying to elaborate more on the reasons which have discussed here and elsewhere:

metadata handling would most likely be easier, for example when it comes to update the metadata or define one single point of source for documenting metadata
implement consistent reader/writer API
coordinate community efforts, which would be useful for future development - likely more difficult if readers are scattered across libraries.
maintenance will be easier, for similar reason as the previous point.
file can contain different type of data which will fit in different libraries, for example, a oxford instrument file (iop or h5oina) containing EBSD and EDS data, in what library should the IO reader should go? I would imagine that we would have more and more of these cases over time.

There is some degree subjectivity/speculation in some of the points above but based on my experience, I think there are fair.

Indeed of scattering the reader/I would suggest that we do the following:

Split IO from hyperspy (hyperspy/hyperspy#2174) as it is, without any change in the API and without plugins / entry points
Make a release of the IO package
Make a minor release of hyperspy using this IO package and do "major pinning" (.i.e <1.0) to the IO library in hyperspy. When the API changes in the IO library, hyperspy will need to be changed accordingly in a major release.
Contribute IO reader / writer to the independent IO library. New reader will be added in minor release of the IO library, independently from any library (hyperspy, lumispy, kikuchipy, pyxem, etc.)

The only thing that we need to be careful with this approach is that we don't break the hyperspy API when splitting the IO part, until hyperspy 2.0 and there are solutions for this issue, either a IO shim in hyperspy or work with two development branch in the IO library, one with the API frozen and the other with the new API.

I think that points 1 to 3 can be done quickly and importantly this would provide a good platform for future development, for example, by starting to improve to metadata situation or by developing the new API of the IO library independently from any other libraries.

Answer 10 · 2022-06-22T10:44:49.000Z

In principle, I am happy with either solution - just we need to know where to start the development of the new readers.

Just a few points that come to my mind:

A single IO library without sub-libraries will become quite large and some people might not be interested in having support for all those file formats.
On the other hand, providing a nexus reader for a large variety of data, mainly differing in the metadata, would make it hard to decide where to put this reader if we split up the package.
Having an integrated IO library is kind of a selling point for HyperSpy and its extension-packages. We have to then advertise well in the Reader package that integration with the HyperSpy family has a lot of benefits and we endorse people to use that library for handling the data (dm3/4 support is what brought me to HyperSpy in the first place).

Answer 11 · 2022-06-22T11:44:19.000Z

In principle, I am happy with either solution - just we need to know where to start the development of the new readers.

I think that the approach suggested above is ready to go, can be done quickly and simply needs a push. @francisco-dlp, does it sound like a fair statement?

Just a few points that come to my mind:

A single IO library without sub-libraries will become quite large and some people might not be interested in having support for all those file formats.

For users, the size of library (as in file size to download) will not be large if we don't package the test data and use pooch to download test data when running the test suite. For contributors, I would expect that it would be more simpler/streamline as most reader should be fairly standalone and we would not have to deal with the IO parsing mechanism in hyperspy, which make thing a bit opaque at first or for new contributor. Basically, splitting it should also lower the barrier for contributing in comparison to contributing to hyperspy.

Having an integrated IO library is kind of a selling point for HyperSpy and its extension-packages. We have to then advertise well in the Reader package that integration with the HyperSpy family has a lot of benefits and we endorse people to use that library for handling the data (dm3/4 support is what brought me to HyperSpy in the first place).

Yes, agreed. Another side of the coin is that several people have implemented readers elsewhere, which were already present in hyperspy because they are not interested to use hyperspy or they have other reasons. This is what I have been told several time in person and I am sure that I am not the only one to have told this!
One aim is to provide a platform which would address these issues and this is something which should be possible with introducing new API of the IO library. This would be a very good milestone for the community!
I don't know if we should endorse a particular library/framework, I would prefer to simply mention libraries using it - so that user knows what is available - and leave them make their own opinion. Of course, some libraries will have a strong case for various applications! ;)

Answer 12 · 2022-06-22T12:53:20.000Z

OK, so I guess the way forward for us would be to start implementing the new readers in a local branch of HyperSpy (instead of directly in LumiSpy) and then hopefully the basic split will be in time so that we can migrate the code to the new repo and do a PR there.

Answer 13 · 2022-06-22T14:45:30.000Z

If you need it to start now or very soon, yes, this is what I would do: start in the branch of a hyperspy fork and once the IO library repository is created, move the code to a fork of the IO library repo (from a quick google search, it should be possible to keep the history if necessary).

@francisco-dlp would know better about that and would be best to confirm it, as he started the split in hyperspy/hyperspy#2174! :)

Answer 14 · 2022-06-22T15:44:15.000Z

Yes, we are planning to start in the next few weeks.

Answer 15 · 2022-06-22T15:59:02.000Z

I have restarted working on #2174. I will resubmit a working prototype next week and create a new repository for rosettascio where you could submit your new readers.

I agree with @ericpre in that a single IO library is better in most respects than having readers scattered around multiple libraries and packages. However, HyperSpy is increasingly used beyond EM, so, while it'll be good to have all the EM readers in the same repository, I don't think that it is feasible to get readers from disparate scientific communities in the same library. Therefore, it seems to me that we should also offer a plugin system so that other communities can plug their IO libraries into HyperSpy.

Answer 16 · 2022-06-22T17:09:46.000Z

If you need it to start now or very soon, yes, this is what I would do: start in the branch of a hyperspy fork and once the IO library repository is created, move the code to a fork of the IO library repo (from a quick google search, it should be possible to keep the history if necessary).

@jlaehne I just wanted to add to this discussion that this is the way I personally approached the new I/O reader for the Attolight files (before @LMSC-NTappy wrote the full version). Working on your own branch in the main HyperSpy is a good compromise, and you can also create a lumispy install tag in setup.py that installs the branch of interest (as we did for the non linear axis for a long while).

Answer 17 · 2022-07-05T13:11:59.000Z

As discussed above, I have restarted working on splitting the IO code. I've just submitted a WIP PR that supersedes #2174, see #2972. I am very interested in your feedback.

Answer 18 · 2022-07-23T18:00:35.000Z

With RosettaSciIO now split from HyperSpy, the new readers can now be contributed there.