Loading pandas hdf5 files
p-j-smith opened this issue · 2 comments
Hi, I've been trying out ExeTera (mainly just creating and loading simple datasets so far) and it seems really nice. I see there's an issue for storing dataframes in Pandas format (#201) but I'm wondering whether it's possible to load hdf5 files written by Pandas? I had a read of the wiki but didn't see this mentioned, although I may have missed it.
I've tried the following:
from exetera.core.session import Session
with Session() as s:
pandas_dataset = s.open_dataset('my-data-from-pandas.hdf5', 'r', name='pandas')
But I get the following error:
Traceback (most recent call last):
File "/Users/paul/github/ExeTera/tempnotebook/trying_out_exetera.py", line 4, in <module>
pandas_dataset = s.open_dataset('tempnotebook/my-data-from-pandas.hdf5', 'r', name='pandas')
File "/Users/paul/github/ExeTera/exetera/core/session.py", line 86, in open_dataset
self.datasets[name] = ds.HDF5Dataset(self, dataset_path, mode, name)
File "/Users/paul/github/ExeTera/exetera/core/dataset.py", line 54, in __init__
dataframe = edf.HDF5DataFrame(self, group, h5group=h5group)
File "/Users/paul/github/ExeTera/exetera/core/dataframe.py", line 64, in __init__
self._columns[subg] = dataset.session.get(h5group[subg])
File "/Users/paul/github/ExeTera/exetera/core/session.py", line 888, in get
raise ValueError(f"'{field}' is not a well-formed field")
ValueError: '<HDF5 dataset "axis0": shape (40,), type "|S15">' is not a well-formed field
I assume this is to do with the column names, because I have 40 columns in the dataframe.
Hi. Yes, it would potentially be useful to be able to import a pandas hdf5 file, but it would be an import operation. Pandas HDF5 file format very much reflects the internal data structures that pandas uses to represent data, and ExeTera stores data in a way that is suitable for it, but the two uses of HDF5 are quite different. I could see us having an import from pandas hdf5, however.
If you want to load data from a hdf5 pandas dataset into exetera in the meanwhile, the best thing to do would be to load the dataset and construct a field for each series in the dataset in a loop.
Okay, thanks for the explanation 🙂